[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759099#comment-17759099
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1305871232


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   > If this PR is adopted as-is, then the above structure wouldn't be needed 
at all for fixed length data types, just for byte arrays, and the chunk_size 
field wouldn't be necessary either since it would already be in the column 
chunk's SizeStatistics struct. Does this seem reasonable? If so I can submit a 
draft after this PR is merged.
   
   @etseidl I think given that people are willing to try to contribute 
implementations based on the current scope I would propose keeping the PR as 
is, and once implemented we can follow-up with considering the additions that 
meet your use-case?  This makes sure that overall approach proposed here is 
viable and hopefully any additional.
   
   Given this proposal I think the main question is whether we sould change the 
histogram in the PageIndex to be SizeEstimate, this currently doesn't blow up 
datastructures too much but doesn't quite semantically align with the exact 
use-case that page index is meant for (filtering only).





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to