[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759204#comment-17759204
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1306335556
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
@emkornfield sorry...didn't see this reply.
> I think given that people are willing to try to contribute implementations
based on the current scope I would propose keeping the PR as is, and once
implemented we can follow-up with considering the additions that meet your
use-case?
That is of course reasonable. I think this is a very worthwhile addition to
the spec, so thank you all for getting it this far.
> Given this proposal I think the main question is whether we sould change
the histogram in the PageIndex to be SizeEstimate, this currently doesn't blow
up datastructures too much but doesn't quite semantically align with the exact
use-case that page index is meant for (filtering only).
Yes, I agree that using `SizeStatistics` will not significantly blow up the
metadata, so it really is down to whether you all believe it's appropriate to
have sizing information in the page indexes or not.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)