[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

ASF GitHub Bot (Jira) Wed, 23 Aug 2023 15:26:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758266#comment-17758266
 ]


ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

GregoryKimball commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1690724411

   Thank you @emkornfield for suggesting this change, @pitrou for your 
[comment](https://github.com/apache/parquet-format/pull/197#discussion_r1301338683)
 and @mapleFU, @wgtmac, @gszadovszky, @etseidl for the discussion.
   
   In the libcudf [chunked parquet 
reader](https://docs.rapids.ai/api/libcudf/stable/classcudf_1_1io_1_1chunked__parquet__reader),
 we would benefit greatly from having `SizeStatistics` added to `ColumnIndex` 
such as:
   ```
   ColumnMetaData:
   optional SizeStatistics size_estimate_statistics;
   
   ColumnIndex:
   optional list<SizeStatistics> size_estimate_statistics;
   ```
   
   We would benefit from having page-level values for 
`unencoded_variable_width_stored_bytes` because it would help us step through a 
row group to yield consistently-sized table "chunks". We created the chunked 
reader to read row groups that explode to >10-100 GB tables when decompressed 
and decoded.
   
   The `repetition_definition_level_histograms` is also useful for estimating 
row count per page and aligning the pages between ColumnChunks. We don't need 
to track `FullSizeStatistics` in our use case, just the histograms and 
`unencoded_variable_width_stored_bytes` at the page-level will suffice. 
   
   Thank you for your help!




> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

Reply via email to