We would like to add statistics to better estimate size of pages and column
chunks after they are read back into memory from parquet:

https://github.com/apache/parquet-format/pull/197

Additionally, this metadata can support finer grained null filters and
lists lengths for nested types.

At a high level this PR adds a new field to track the size of variable
length data in byte array columns and histograms of repetition and
definition levels to at both the column chunk level, and page index level.

Two prototype implementations that have been manually tested to be
interoperable with the new format additions:

1.  RapidsAI CuDF: https://github.com/rapidsai/cudf/pull/14000
2.  Parquet MR: https://github.com/apache/parquet-mr/pull/1177


This vote will be open for at least 72 hours.

[ ] +1 Add this type to the format specification
[ ] +0
[ ] -1 Do not add this type to the format specification because...

My vote is +1 (non-binding).


Thanks,
Micah

Reply via email to