[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763203#comment-17763203
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1320192142
##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1038,25 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+
+ /**
+ * Contains repetition level histograms for more details) for each page
+ * concatenated together. The repetition_level_histogram field on
+ * SizeStatistics contains more details.
+ *
+ * When present the length should always be (number of pages *
+ * (max_repetition_level + 1)) elements in size.
+ *
+ * Element 0 is the first element of the histogram for the first page.
+ * Element (max_repetition_level + 1) is the first element of the histogram
+ * for the second page.
Review Comment:
So you are proposing transposing the 2D interpretation of the array as
currently defined, so that the histogram data for a given level is contiguous
in memory, and further proposing that each level is the inclusive sum of the
preceding levels. So the last "column" would be the same histogram as in the
column chunk's `SizeStatistics`. Is this correct?
I'm fine with the transposition if that makes the filtering use case easier.
But since my primary concern is the size estimation, I'd prefer not having to
do the extra work of computing deltas to figure out value counts. But at least
with your proposal the counts are inclusive, so I wouldn't have to look outside
the column index to get the information I want (unlike using the offset index
to get page row counts, where you have to look at the row group's `num_rows` to
get the row count for the last page).
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)