[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760608#comment-17760608
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196

   Hi all, just wanted to share some preliminary results with the new 
statistics. I implemented this PR using both the 
`RepetitionDefinitionLevelHistogram` and the full `SizeStatistics` struct in 
the `ColumnIndex`. I used four files I use frequently for testing; two large 
files with a flat schema and varying mixes of integer and string data, and two 
smaller files that are deeply nested. The table below shows the impact on the 
size of the `ColumnIndex`, as well as the impact to total file size, for each 
of the test files.
   
   ```
   
------------------------------------------------------------------------------
   |          |                 |            column index size (bytes)          
|
   | file     | file size (MiB) | no size stats |  histograms | full size stats 
|
   
------------------------------------------------------------------------------
   | flat 1   |     1883.1      |   1730740     |   2229005   |     2498311     
|
   
------------------------------------------------------------------------------
   | flat 2   |     1695.4      |   2322339     |   2884139   |     3265139     
|
   
------------------------------------------------------------------------------
   | nested 1 |       12.1      |      3085     |      4287   |        4683     
|
   
------------------------------------------------------------------------------
   | nested 2 |      282.2      |     22704     |     34852   |       38267     
|
   
------------------------------------------------------------------------------
   ```
   For the files with a flat schema, the histograms resulted in a 24-29% 
increase in the index size. Adding in the unencoded size bumped that to a 
41-44% increase. The large impact to the added size info is due to a) the lack 
of a repetition level histogram, and b) small definition level histogram (2 
bins). For the nested files, the histograms added between 40-54% to the 
`ColumnIndex` size, now that the repetition level histograms are populated, and 
the max definition level is as high as 9. For these files, the addition of the 
size info had a less dramatic effect, with the full stats adding between 52-69% 
to the index.
   
   The overall impact on file size was negligible, however, with the largest 
increase being an additional .053%.
   
   So the good news here is no dramatic increase in file sizes, but the bad 
news is a pretty significant hit to `ColumnIndex` sizes. If the latter is a 
concern, perhaps it is a better idea to move the per-page size statistics to 
its own structure separate from the page indexes. Then the page histogram data 
could be skipped altogether if the filtering predicate doesn't include any 
`null` logic.
   




> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to