Re: Discussion: Dynamic encoding selection for Paruqet

Kenny Daniel Thu, 30 Oct 2025 18:15:37 -0700

It would be great if there were more discussion of these encoder choices.
Authoring hyparquet reader was more straightforward in some ways than
authoring the writer. A million little choices came up, and sometimes I
would just guess, and other times I would look at reference implementations
(mostly duckdb because it has the cleanest code of all the parquet
implementations imo)


Examples:

   - Choosing between RLE and bitpack in bitpacked hybrid encoding (try both
   
<https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/encoding.js#L14-L24>
   )
   - When to use dictionary encoding? (non-boolean columns where rows /
   unique > 2
   
<https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/column.js#L91-L100>
   )
   - When to split pages? By row count? By compressed bytes? By
   uncompressed bytes?
   - Default row group size?




On Thu, Oct 30, 2025 at 2:09 PM Andrew Lamb <[email protected]> wrote:

> Hello,
>
> I wanted to start a discussion about adaptive encoding selection as a way
> to improve the performance of Parquet implementations that we started at
> the sync earlier this week.
>
> Specifically the question came up "would we need to specify some way in the
> spec to pick between any new encodings" and I think the consensus answer
> was "no, that would be up to the writer implementation"
>
> At the moment, all open source Parquet implementations I know of have a
> simple heuristics for choosing the encoding, but we are discussing a more
> dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there other
> conversations that people are aware of?
>
>
> Andrew
>
> p.s.  This is part of a larger point and misconception about the Parquet
> format, specifically that the choice of encoding is dictated by the
> format.  Many new file format proposal such as BtrBlocks[2] include a
> dynamic approach to encoding selection (after claiming that "Parquet" uses
> static heuristics):
>
> > In BtrBlocks, we test each encoding scheme on a sample and select the
> scheme that performs best.
>
> (the paper then proposes a sampling technique to avoid skewed samples)
>
> [1]: https://github.com/apache/arrow-rs/issues/8378
> [2]: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf
>

Re: Discussion: Dynamic encoding selection for Paruqet

Reply via email to