etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1306335556
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
@emkornfield sorry...didn't see this reply.
> I think given that people are willing to try to contribute implementations
based on the current scope I would propose keeping the PR as is, and once
implemented we can follow-up with considering the additions that meet your
use-case?
That is of course reasonable. I think this is a very worthwhile addition to
the spec, so thank you all for getting it this far.
> Given this proposal I think the main question is whether we sould change
the histogram in the PageIndex to be SizeEstimate, this currently doesn't blow
up datastructures too much but doesn't quite semantically align with the exact
use-case that page index is meant for (filtering only).
Yes, I agree that using `SizeStatistics` will not significantly blow up the
metadata, so it really is down to whether you all believe it's appropriate to
have sizing information in the page indexes or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]