[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760608#comment-17760608
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196
Hi all, just wanted to share some preliminary results with the new
statistics. I implemented this PR using both the
`RepetitionDefinitionLevelHistogram` and the full `SizeStatistics` struct in
the `ColumnIndex`. I used four files I use frequently for testing; two large
files with a flat schema and varying mixes of integer and string data, and two
smaller files that are deeply nested. The table below shows the impact on the
size of the `ColumnIndex`, as well as the impact to total file size, for each
of the test files.
```
------------------------------------------------------------------------------
| | | column index size (bytes)
|
| file | file size (MiB) | no size stats | histograms | full size stats
|
------------------------------------------------------------------------------
| flat 1 | 1883.1 | 1730740 | 2229005 | 2498311
|
------------------------------------------------------------------------------
| flat 2 | 1695.4 | 2322339 | 2884139 | 3265139
|
------------------------------------------------------------------------------
| nested 1 | 12.1 | 3085 | 4287 | 4683
|
------------------------------------------------------------------------------
| nested 2 | 282.2 | 22704 | 34852 | 38267
|
------------------------------------------------------------------------------
```
For the files with a flat schema, the histograms resulted in a 24-29%
increase in the index size. Adding in the unencoded size bumped that to a
41-44% increase. The large impact to the added size info is due to a) the lack
of a repetition level histogram, and b) small definition level histogram (2
bins). For the nested files, the histograms added between 40-54% to the
`ColumnIndex` size, now that the repetition level histograms are populated, and
the max definition level is as high as 9. For these files, the addition of the
size info had a less dramatic effect, with the full stats adding between 52-69%
to the index.
The overall impact on file size was negligible, however, with the largest
increase being an additional .053%.
So the good news here is no dramatic increase in file sizes, but the bad
news is a pretty significant hit to `ColumnIndex` sizes. If the latter is a
concern, perhaps it is a better idea to move the per-page size statistics to
its own structure separate from the page indexes. Then the page histogram data
could be skipped altogether if the filtering predicate doesn't include any
`null` logic.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)