[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760702#comment-17760702
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700313640
> As the implemention detail, can we ignore the `rep-def` histogram when
`max-rep <= 1, max-def <= 1`? Since we already have page-ordinal in OffsetIndex
and null-count in ColumnIndex? This might take less space but make it a bit
tricky. @etseidl @emkornfield
I think that would be ok. My current implementation only writes the
histograms when `max_level > 0`, but could easily be changed to ` > 1`. On the
read side, the logic is a little harder, but not unmanageable, especially since
we already have to deal with the `max_level == 0` case. Once we settle on where
everything goes, I'll modify my code to make use of the new structures and see
if there are any problems. @emkornfield does this work for you?
> The second is that, I think should size better in `OffsetIndex` rather
than `ColumnIndex`.
I'm fine with this. Kind of in the weeds, but by splitting it up this way we
do save a little bit of space and processing not having to encode the
`SizeStatistics` wrapper.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)