Generally I think encoded size can be a good target at first. And duckdb does well on pick dictionary and bloom filter, and split the page by row. However, seems that duckdb wouldn't pick the encoding algorithm in parquet-v2.
Picking dictionary might be straightforward sometimes, however, picking row-group size and split pages would affect by reader workloads and writer memory limit. Best, Xuwei Fu Kenny Daniel <[email protected]> 于2025年10月31日周五 09:15写道: > It would be great if there were more discussion of these encoder choices. > Authoring hyparquet reader was more straightforward in some ways than > authoring the writer. A million little choices came up, and sometimes I > would just guess, and other times I would look at reference implementations > (mostly duckdb because it has the cleanest code of all the parquet > implementations imo) > > Examples: > > - Choosing between RLE and bitpack in bitpacked hybrid encoding (try > both > < > https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/encoding.js#L14-L24 > > > ) > - When to use dictionary encoding? (non-boolean columns where rows / > unique > 2 > < > https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/column.js#L91-L100 > > > ) > - When to split pages? By row count? By compressed bytes? By > uncompressed bytes? > - Default row group size? > > > > > On Thu, Oct 30, 2025 at 2:09 PM Andrew Lamb <[email protected]> > wrote: > > > Hello, > > > > I wanted to start a discussion about adaptive encoding selection as a way > > to improve the performance of Parquet implementations that we started at > > the sync earlier this week. > > > > Specifically the question came up "would we need to specify some way in > the > > spec to pick between any new encodings" and I think the consensus answer > > was "no, that would be up to the writer implementation" > > > > At the moment, all open source Parquet implementations I know of have a > > simple heuristics for choosing the encoding, but we are discussing a more > > dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there > other > > conversations that people are aware of? > > > > > > Andrew > > > > p.s. This is part of a larger point and misconception about the Parquet > > format, specifically that the choice of encoding is dictated by the > > format. Many new file format proposal such as BtrBlocks[2] include a > > dynamic approach to encoding selection (after claiming that "Parquet" > uses > > static heuristics): > > > > > In BtrBlocks, we test each encoding scheme on a sample and select the > > scheme that performs best. > > > > (the paper then proposes a sampling technique to avoid skewed samples) > > > > [1]: https://github.com/apache/arrow-rs/issues/8378 > > [2]: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf > > >
