emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1317866553
########## src/main/thrift/parquet.thrift: ########## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate + * the widest range of readers both should be populated when either the max + * definition and repetition level meet the requirements specified in + * RepetitionDefinitionLevelHistogram. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms Review Comment: For page level encoding, if we are really concerned about this, what do you think about just having a flat list of size `(max_level + 1) * number of pages`. This would be the best in terms of memory compactness for memory optimizations. I think whether to flip inner/outer list meaning here then depends to some extent on use-case. Other places others have commented on difficulty of implementation being a concern in general in other places. While 4K isn't necessiraly trivial under common use-cases (10s of columns where an index lookup is needed, it might be over-optimizing). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org