On Mon, 3 Nov 2025 08:26:00 +0100 Alkis Evlogimenos <[email protected]> wrote: > > We could bring Parquet closer to nested encodings by changing the spec a > bit without adding new encodings: > 1. present dictionary encoding as a transformation from any domain to the > integer domain > 2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and > BYTE_STREAM_SPLIT
I'm not sure delta-encoding of dictionary ids makes much sense? Unless perhaps the "active" subset of the dictionary changes slowly along a column. > 3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY, > BYTE_STREAM_SPLIT_DICTIONARY > > This would add more decisions to the writer and potentially generate better > Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful > for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some > cases though. Given that dictionary ids are generally small integers, BYTE_STREAM_SPLIT could increase their compressibility quite a bit. How it would fare compared to RLE is an open question. Regards Antoine.
