[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759099#comment-17759099 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1305871232 ########## src/main/thrift/parquet.thrift: ########## @@ -974,6 +1050,13 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate the + * widest range of readers both should be populated. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms; Review Comment: > If this PR is adopted as-is, then the above structure wouldn't be needed at all for fixed length data types, just for byte arrays, and the chunk_size field wouldn't be necessary either since it would already be in the column chunk's SizeStatistics struct. Does this seem reasonable? If so I can submit a draft after this PR is merged. @etseidl I think given that people are willing to try to contribute implementations based on the current scope I would propose keeping the PR as is, and once implemented we can follow-up with considering the additions that meet your use-case? This makes sure that overall approach proposed here is viable and hopefully any additional. Given this proposal I think the main question is whether we sould change the histogram in the PageIndex to be SizeEstimate, this currently doesn't blow up datastructures too much but doesn't quite semantically align with the exact use-case that page index is meant for (filtering only). > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)