[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757728#comment-17757728
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1302271092
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
> The information would be necessary for both fixed width types, accounting
for "size" of variable width data, and also estimating size of repeated fields
(these might not apply to your use-case, depending on how nulls are handled
Ah, ok. So then I need the full SizeStatistics in some form. Whether
that's in one place on either the OffsetIndex or ColumnIndex, or split across
both (with histograms on ColumnIndex and
`unencoded_variable_width_stored_bytes` in the OffsetIndex) doesn't matter too
much.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)