[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757871#comment-17757871
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

gszadovszky commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302642695


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   I think the size of a Parquet file in total and also in the different parts 
(footer, column/offset indexes, pages) impact the related systems 
significantly. We should be very careful about adding new metadata since they 
are usually useful if they are written by all the implementations increasing 
the size of the average Parquet file.
   I am not against adding new structures to the format that is not potentially 
useful for everyone, but in this case we should state that the writing of these 
structures are optional (not only syntactically but semantically as well).
   @etseidl, do you want to scan Parquet files written by any kinds of writers 
or you have control over the writing?





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to