Picking the right encoding for parquet is much simpler than btrblocks or fastlanes. The latter have to search a tree of nested encodings up to some depth whereas for parquet the "tree" is fixed.
As we add more encodings, encoding selection will require a more sophisticated search. Right now the decision tree for Parquet is rather simple: 1. try dictionary, if profitable encode data pages with RLE [1] 2. if not dictionary a. if strings try DELTA_LENGTH_BYTE_ARRAY or DELTA_BYTE_ARRAY and pick the best b. otherwise try DELTA_BINARY_PACKED or BYTE_STREAM_SPLIT 3. finally decide on Page v1 or v2 (sometimes v1 is better than v2) We could bring Parquet closer to nested encodings by changing the spec a bit without adding new encodings: 1. present dictionary encoding as a transformation from any domain to the integer domain 2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and BYTE_STREAM_SPLIT 3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY, BYTE_STREAM_SPLIT_DICTIONARY This would add more decisions to the writer and potentially generate better Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some cases though. On Fri, Oct 31, 2025 at 3:38 AM wish maple <[email protected]> wrote: > Generally I think encoded size can be a good target at first. And duckdb > does well on pick dictionary and bloom filter, and split the page by row. > However, seems that duckdb wouldn't pick the encoding algorithm > in parquet-v2. > > Picking dictionary might be straightforward sometimes, however, picking > row-group size and split pages would affect by reader workloads and > writer memory limit. > > > Best, > Xuwei Fu > > Kenny Daniel <[email protected]> 于2025年10月31日周五 09:15写道: > > > It would be great if there were more discussion of these encoder choices. > > Authoring hyparquet reader was more straightforward in some ways than > > authoring the writer. A million little choices came up, and sometimes I > > would just guess, and other times I would look at reference > implementations > > (mostly duckdb because it has the cleanest code of all the parquet > > implementations imo) > > > > Examples: > > > > - Choosing between RLE and bitpack in bitpacked hybrid encoding (try > > both > > < > > > https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/encoding.js#L14-L24 > > > > > ) > > - When to use dictionary encoding? (non-boolean columns where rows / > > unique > 2 > > < > > > https://github.com/hyparam/hyparquet-writer/blob/v0.8.0/src/column.js#L91-L100 > > > > > ) > > - When to split pages? By row count? By compressed bytes? By > > uncompressed bytes? > > - Default row group size? > > > > > > > > > > On Thu, Oct 30, 2025 at 2:09 PM Andrew Lamb <[email protected]> > > wrote: > > > > > Hello, > > > > > > I wanted to start a discussion about adaptive encoding selection as a > way > > > to improve the performance of Parquet implementations that we started > at > > > the sync earlier this week. > > > > > > Specifically the question came up "would we need to specify some way in > > the > > > spec to pick between any new encodings" and I think the consensus > answer > > > was "no, that would be up to the writer implementation" > > > > > > At the moment, all open source Parquet implementations I know of have a > > > simple heuristics for choosing the encoding, but we are discussing a > more > > > dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there > > other > > > conversations that people are aware of? > > > > > > > > > Andrew > > > > > > p.s. This is part of a larger point and misconception about the > Parquet > > > format, specifically that the choice of encoding is dictated by the > > > format. Many new file format proposal such as BtrBlocks[2] include a > > > dynamic approach to encoding selection (after claiming that "Parquet" > > uses > > > static heuristics): > > > > > > > In BtrBlocks, we test each encoding scheme on a sample and select the > > > scheme that performs best. > > > > > > (the paper then proposes a sampling technique to avoid skewed samples) > > > > > > [1]: https://github.com/apache/arrow-rs/issues/8378 > > > [2]: > https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf > > > > > >
