[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758117#comment-17758117
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303197041
##########
src/main/thrift/parquet.thrift:
##########
@@ -974,6 +1050,13 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+ /**
+ * Repetition and definition level histograms for the pages.
+ *
+ * This contains some redundancy with null_counts, however, to accommodate
the
+ * widest range of readers both should be populated.
+ **/
+ 6: optional list<RepetitionDefinitionLevelHistogram>
repetition_definition_level_histograms;
Review Comment:
> f you are not reading a parquet file in the streaming fashion, why
SizeStatistics in the column-chunk level is not enough? The pages of different
columns are not aligned and you somehow will end up with reading the entire
column chunk.
@wgtmac just because the pages aren't aligned doesn't mean I have to read
them all :wink: In a large row group with small pages, the non-alignment can be
minimized and there can still be a win from not reading unnecessary pages.
As to why the column-chunk level sizing info isn't enough, I have files
where the un-encoded size of the file is over 40X larger than the on-disk
sizes, due primarily to vast savings in the dictionary encoding. So a 1GB row
group could potentially blow up to 40GB when fully decoded. In the constrained
environment of a GPU that's not tenable. Being able to know in advance which
pages I can read and decode while still keeping everything on the GPU is very
beneficial. To get this sizing information now, we have to read and decompress
every page, doing most of the work of decoding the file just to find the total
size of all the byte arrays. I'd prefer not to have to make 2 passes through
the file :smile:
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)