DELTA_BINARY_PACKED of dictionary ids makes sense if writer assigns ids in frequency order and the input values fit some power law/zipf distribution.
For example, say we get half a million dictionary ids but most values are concentrated in the first 500 ids. RLE will use bitwidth of 19 but most miniblocks will have small deltas (<500) and will pack very well in <= 9 bits. On Sun, Nov 9, 2025 at 11:03 AM Antoine Pitrou <[email protected]> wrote: > On Mon, 3 Nov 2025 08:26:00 +0100 > Alkis Evlogimenos > <[email protected]> > wrote: > > > > We could bring Parquet closer to nested encodings by changing the spec a > > bit without adding new encodings: > > 1. present dictionary encoding as a transformation from any domain to the > > integer domain > > 2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and > > BYTE_STREAM_SPLIT > > I'm not sure delta-encoding of dictionary ids makes much sense? Unless > perhaps the "active" subset of the dictionary changes slowly along a > column. > > > 3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY, > > BYTE_STREAM_SPLIT_DICTIONARY > > > > This would add more decisions to the writer and potentially generate > better > > Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful > > for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some > > cases though. > > Given that dictionary ids are generally small integers, > BYTE_STREAM_SPLIT could increase their compressibility quite a bit. How > it would fare compared to RLE is an open question. > > Regards > > Antoine. > > >
