[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 23 Aug 2023 08:38:35 -0700


etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303204476



##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /** 
+    * Repetition and definition level histograms for the pages.  
+    *
+    * This contains some redundancy with null_counts, however, to accommodate  
the
+    * widest range of readers both should be populated.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms; 

Review Comment:
   @gszadovszky 
   > I think the size of a Parquet file in total and also in the different 
parts (footer, column/offset indexes, pages) impact the related systems 
significantly. We should be very careful about adding new metadata since they 
are usually useful if they are written by all the implementations increasing 
the size of the average Parquet file.
   
   I am in total agreement.
   
   > I am not against adding new structures to the format that is not 
potentially useful for everyone, but in this case we should state that the 
writing of these structures are optional (not only syntactically but 
semantically as well).
   
   Yes, this should be optional.  And, in fact, would only be non-zero for byte 
array columns, so the size impact should not be huge compared to adding the 
histograms.  Or is the argument to not add the histograms to the ColumnIndex?
   
   > do you want to scan Parquet files written by any kinds of writers or you 
have control over the writing?
   
   While I have control over the writing, there will be potentially many 
different implementations used depending on what we're doing.  We might use 
libcudf for some of the ETL and data grooming, spark/parquet-mr for queryting, 
arrow-rs or polars for bulk import, etc.  It would be nice if all 
implementations converged one day on the same set of functionality.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to