[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760723#comment-17760723
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992
Since we all seem to be in agreement now, it's probably good to list the
options available and then make a decision on which to use. My (probably
incomplete) list would be:
1. Simply add `SizeStatistics` to `ColumnIndex`. This is the simplest
solution, keeps the new data together, and mirrors what is being added to
`ColumnMetaData`. The downside is extra storage and work for clients that may
not use this new information.
2. Add `RepetitionDefinitionLevelHistogram` to `ColumnIndex` and
`unencoded_variable_width_stored_bytes` to `OffsetIndex` (either by adding it
as an optional field in the `PageLocation`, or as an optional `list<i64>` in
`OffsetIndex`). This is the next simplest to implement, and has modest savings
over option 1. This suffers the same drawback that clients are forced to read
this extra information.
3. Add a size/location pair to `ColumnMetaData` and a new struct containing
`list<SizeStatistics>`, mirroring how `OffsetIndex` is written. This allows
clients that have no need for this information to ignore it, and allows clients
that don't need the full column/offset indexes access to just the size
information, but adds complexity and requires reading a third structure for
those clients that will use all three.
I think 3 is maybe the most flexible, but since I'd almost always be using
all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd
probably take 1 right now since I already have it implemented :) I do have the
cycles to try out 2 and 3 and can report back if that would be helpful.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)