It would be great if there were more discussion of these encoder choices. Authoring hyparquet reader was more straightforward in some ways than authoring the writer. A million little choices came up, and sometimes I would just guess, and other times I would look at reference implementations (mostly duckdb because it has the cleanest code of all the parquet implementations imo)
Examples: - Choosing between RLE and bitpack in bitpacked hybrid encoding (try both <https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/encoding.js#L14-L24> ) - When to use dictionary encoding? (non-boolean columns where rows / unique > 2 <https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/column.js#L91-L100> ) - When to split pages? By row count? By compressed bytes? By uncompressed bytes? - Default row group size? On Thu, Oct 30, 2025 at 2:09 PM Andrew Lamb <[email protected]> wrote: > Hello, > > I wanted to start a discussion about adaptive encoding selection as a way > to improve the performance of Parquet implementations that we started at > the sync earlier this week. > > Specifically the question came up "would we need to specify some way in the > spec to pick between any new encodings" and I think the consensus answer > was "no, that would be up to the writer implementation" > > At the moment, all open source Parquet implementations I know of have a > simple heuristics for choosing the encoding, but we are discussing a more > dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there other > conversations that people are aware of? > > > Andrew > > p.s. This is part of a larger point and misconception about the Parquet > format, specifically that the choice of encoding is dictated by the > format. Many new file format proposal such as BtrBlocks[2] include a > dynamic approach to encoding selection (after claiming that "Parquet" uses > static heuristics): > > > In BtrBlocks, we test each encoding scheme on a sample and select the > scheme that performs best. > > (the paper then proposes a sampling technique to avoid skewed samples) > > [1]: https://github.com/apache/arrow-rs/issues/8378 > [2]: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf >
