Picking the right encoding for parquet is much simpler than btrblocks or
fastlanes. The latter have to search a tree of nested encodings up to some
depth whereas for parquet the "tree" is fixed.

As we add more encodings, encoding selection will require a more
sophisticated search.

Right now the decision tree for Parquet is rather simple:
1. try dictionary, if profitable encode data pages with RLE [1]
2. if not dictionary
  a. if strings try DELTA_LENGTH_BYTE_ARRAY or DELTA_BYTE_ARRAY and pick
the best
  b. otherwise try DELTA_BINARY_PACKED or BYTE_STREAM_SPLIT
3. finally decide on Page v1 or v2 (sometimes v1 is better than v2)

We could bring Parquet closer to nested encodings by changing the spec a
bit without adding new encodings:
1. present dictionary encoding as a transformation from any domain to the
integer domain
2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and
BYTE_STREAM_SPLIT
3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY,
BYTE_STREAM_SPLIT_DICTIONARY

This would add more decisions to the writer and potentially generate better
Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful
for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some
cases though.

On Fri, Oct 31, 2025 at 3:38 AM wish maple <[email protected]> wrote:

> Generally I think encoded size can be a good target at first. And duckdb
> does well on pick dictionary and bloom filter, and split the page by row.
> However, seems that duckdb wouldn't pick the encoding algorithm
> in parquet-v2.
>
> Picking dictionary might be straightforward sometimes, however, picking
> row-group size and split pages would affect by reader workloads and
> writer memory limit.
>
>
> Best,
> Xuwei Fu
>
> Kenny Daniel <[email protected]> 于2025年10月31日周五 09:15写道:
>
> > It would be great if there were more discussion of these encoder choices.
> > Authoring hyparquet reader was more straightforward in some ways than
> > authoring the writer. A million little choices came up, and sometimes I
> > would just guess, and other times I would look at reference
> implementations
> > (mostly duckdb because it has the cleanest code of all the parquet
> > implementations imo)
> >
> > Examples:
> >
> >    - Choosing between RLE and bitpack in bitpacked hybrid encoding (try
> > both
> >    <
> >
> https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/encoding.js#L14-L24
> > >
> >    )
> >    - When to use dictionary encoding? (non-boolean columns where rows /
> >    unique > 2
> >    <
> >
> https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/column.js#L91-L100
> > >
> >    )
> >    - When to split pages? By row count? By compressed bytes? By
> >    uncompressed bytes?
> >    - Default row group size?
> >
> >
> >
> >
> > On Thu, Oct 30, 2025 at 2:09 PM Andrew Lamb <[email protected]>
> > wrote:
> >
> > > Hello,
> > >
> > > I wanted to start a discussion about adaptive encoding selection as a
> way
> > > to improve the performance of Parquet implementations that we started
> at
> > > the sync earlier this week.
> > >
> > > Specifically the question came up "would we need to specify some way in
> > the
> > > spec to pick between any new encodings" and I think the consensus
> answer
> > > was "no, that would be up to the writer implementation"
> > >
> > > At the moment, all open source Parquet implementations I know of have a
> > > simple heuristics for choosing the encoding, but we are discussing a
> more
> > > dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there
> > other
> > > conversations that people are aware of?
> > >
> > >
> > > Andrew
> > >
> > > p.s.  This is part of a larger point and misconception about the
> Parquet
> > > format, specifically that the choice of encoding is dictated by the
> > > format.  Many new file format proposal such as BtrBlocks[2] include a
> > > dynamic approach to encoding selection (after claiming that "Parquet"
> > uses
> > > static heuristics):
> > >
> > > > In BtrBlocks, we test each encoding scheme on a sample and select the
> > > scheme that performs best.
> > >
> > > (the paper then proposes a sampling technique to avoid skewed samples)
> > >
> > > [1]: https://github.com/apache/arrow-rs/issues/8378
> > > [2]:
> https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf
> > >
> >
>

Reply via email to