mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303273552


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   Hmm first of all, PageIndex might not a "footer", because it has some 
flexibility for puting it.( each rowgroup has a `(length, offset)` pair for 
column and offset index)
   
   Estimate batch size is important, however I wonder a page-level statistics 
in "index" or "footer" might be a bit weird(because we might have it in 
per-page). If you want it, I think you can try to draft a new pull request in 
this repo, and maybe put the statistics in footer or index.
   
   I've searched in the project:
   
   1. `OffsetIndex` has a compressed-size, but actually it's for IO. 
   2. `ColumMetadata` has an ` encoding_stats`, but it's for every encoding
   
   Welcome to draft here. And we can even encode the user-defined stats in 
`key_value_metadata` as base64 or base86 string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to