[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Wed, 23 Aug 2023 08:39:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758121#comment-17758121
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303204476


##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   @gszadovszky 
   > I think the size of a Parquet file in total and also in the different 
parts (footer, column/offset indexes, pages) impact the related systems 
significantly. We should be very careful about adding new metadata since they 
are usually useful if they are written by all the implementations increasing 
the size of the average Parquet file.
   
   I am in total agreement.
   
   > I am not against adding new structures to the format that is not 
potentially useful for everyone, but in this case we should state that the 
writing of these structures are optional (not only syntactically but 
semantically as well).
   
   Yes, this should be optional.  And, in fact, would only be non-zero for byte 
array columns, so the size impact should not be huge compared to adding the 
histograms.  Or is the argument to not add the histograms to the ColumnIndex?
   
   > do you want to scan Parquet files written by any kinds of writers or you 
have control over the writing?
   
   While I have control over the writing, there will be potentially many 
different implementations used depending on what we're doing.  We might use 
libcudf for some of the ETL and data grooming, spark/parquet-mr for queryting, 
arrow-rs or polars for bulk import, etc.  It would be nice if all 
implementations converged one day on the same set of functionality.
   





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to