On Mon, 3 Nov 2025 08:26:00 +0100
Alkis Evlogimenos
<[email protected]>
wrote:
> 
> We could bring Parquet closer to nested encodings by changing the spec a
> bit without adding new encodings:
> 1. present dictionary encoding as a transformation from any domain to the
> integer domain
> 2. allow encoding integers with PLAIN, RLE, DELTA_BINARY_PACKED and
> BYTE_STREAM_SPLIT

I'm not sure delta-encoding of dictionary ids makes much sense? Unless
perhaps the "active" subset of the dictionary changes slowly along a
column.

> 3. "add" the new encodings (RLE_DICTIONARY exists): DELTA_DICTIONARY,
> BYTE_STREAM_SPLIT_DICTIONARY
> 
> This would add more decisions to the writer and potentially generate better
> Parquet files. That said I don't see BYTE_STREAM_SPLIT being super useful
> for dictionary ids. DELTA_BINARY_PACKED may be better than RLE in some
> cases though.

Given that dictionary ids are generally small integers,
BYTE_STREAM_SPLIT could increase their compressibility quite a bit. How
it would fare compared to RLE is an open question.

Regards

Antoine.


Reply via email to