[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758266#comment-17758266
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
GregoryKimball commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1690724411
Thank you @emkornfield for suggesting this change, @pitrou for your
[comment](https://github.com/apache/parquet-format/pull/197#discussion_r1301338683)
and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion.
In the libcudf [chunked parquet
reader](https://docs.rapids.ai/api/libcudf/stable/classcudf_1_1io_1_1chunked__parquet__reader),
we would benefit greatly from having `SizeStatistics` added to `ColumnIndex`
such as:
```
ColumnMetaData:
optional SizeStatistics size_estimate_statistics;
ColumnIndex:
optional list<SizeStatistics> size_estimate_statistics;
```
We would benefit from having page-level values for
`unencoded_variable_width_stored_bytes` because it would help us step through a
row group to yield consistently-sized table "chunks". We created the chunked
reader to read row groups that explode to >10-100 GB tables when decompressed
and decoded.
The `repetition_definition_level_histograms` is also useful for estimating
row count per page and aligning the pages between ColumnChunks. We don't need
to track `FullSizeStatistics` in our use case, just the histograms and
`unencoded_variable_width_stored_bytes` at the page-level will suffice.
Thank you for your help!
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)