[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Fri, 08 Sep 2023 10:19:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763187#comment-17763187
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320143220


##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1038,25 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * Contains repetition level histograms for more details) for each page
+   * concatenated together.  The repetition_level_histogram field on
+   * SizeStatistics contains more details.
+   *
+   * When present the length should always be (number of pages *
+   * (max_repetition_level + 1)) elements in size.
+   *
+   * Element 0 is the first element of the histogram for the first page.
+   * Element (max_repetition_level + 1) is the first element of the histogram
+   * for the second page.

Review Comment:
   I'll also point out that if we are very concerned about eacking out the most 
efficiency for reads then probably the best organization here is having levels 
grouped by pages, and having each entry be the  cumulative sum:
   
   `[<page 0, histogram[0]>, <page 1, histogram[0]>, <page 0, histogram[0] + 
histogram[1]>, <page 1, histogram[0] + histogram[1]> ]`. 
   
   this allows comparing, all not null doing a vectorized subtraction and 
checking for positive elements of the results.  I'm happy to change this if we 
don't think this is too much complexity.





> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to