[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758121#comment-17758121
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303204476
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
@gszadovszky
> I think the size of a Parquet file in total and also in the different
parts (footer, column/offset indexes, pages) impact the related systems
significantly. We should be very careful about adding new metadata since they
are usually useful if they are written by all the implementations increasing
the size of the average Parquet file.
I am in total agreement.
> I am not against adding new structures to the format that is not
potentially useful for everyone, but in this case we should state that the
writing of these structures are optional (not only syntactically but
semantically as well).
Yes, this should be optional. And, in fact, would only be non-zero for byte
array columns, so the size impact should not be huge compared to adding the
histograms. Or is the argument to not add the histograms to the ColumnIndex?
> do you want to scan Parquet files written by any kinds of writers or you
have control over the writing?
While I have control over the writing, there will be potentially many
different implementations used depending on what we're doing. We might use
libcudf for some of the ETL and data grooming, spark/parquet-mr for queryting,
arrow-rs or polars for bulk import, etc. It would be nice if all
implementations converged one day on the same set of functionality.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)