[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759204#comment-17759204
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1306335556


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   @emkornfield sorry...didn't see this reply. 
   
   > I think given that people are willing to try to contribute implementations 
based on the current scope I would propose keeping the PR as is, and once 
implemented we can follow-up with considering the additions that meet your 
use-case?
   
   That is of course reasonable.  I think this is a very worthwhile addition to 
the spec, so thank you all for getting it this far.
   
   > Given this proposal I think the main question is whether we sould change 
the histogram in the PageIndex to be SizeEstimate, this currently doesn't blow 
up datastructures too much but doesn't quite semantically align with the exact 
use-case that page index is meant for (filtering only).
   
   Yes, I agree that using `SizeStatistics` will not significantly blow up the 
metadata, so it really is down to whether you all believe it's appropriate to 
have sizing information in the page indexes or not.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to