Re: Discussion: Dynamic encoding selection for Paruqet

Alkis Evlogimenos Mon, 10 Nov 2025 00:19:05 -0800

DELTA_BINARY_PACKED of dictionary ids makes sense if writer assigns ids in
frequency order and the input values fit some power law/zipf distribution.


For example, say we get half a million dictionary ids but most values are
concentrated in the first 500 ids. RLE will use bitwidth of 19 but most
miniblocks will have small deltas (<500) and will pack very well in <= 9
bits.

On Sun, Nov 9, 2025 at 11:03 AM Antoine Pitrou <[email protected]> wrote:

> On Mon, 3 Nov 2025 08:26:00 +0100
> Alkis Evlogimenos
> <[email protected]>
> wrote:
> >
> > We could bring Parquet closer to nested encodings by changing the spec a
> > bit without adding new encodings:
> > 1. present dictionary encoding as a transformation from any domain to the
> > integer domain
> > 2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and
> > BYTE_STREAM_SPLIT
>
> I'm not sure delta-encoding of dictionary ids makes much sense? Unless
> perhaps the "active" subset of the dictionary changes slowly along a
> column.
>
> > 3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY,
> > BYTE_STREAM_SPLIT_DICTIONARY
> >
> > This would add more decisions to the writer and potentially generate
> better
> > Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful
> > for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some
> > cases though.
>
> Given that dictionary ids are generally small integers,
> BYTE_STREAM_SPLIT could increase their compressibility quite a bit. How
> it would fare compared to RLE is an open question.
>
> Regards
>
> Antoine.
>
>
>

Re: Discussion: Dynamic encoding selection for Paruqet

Reply via email to