When the input data fits a zipf distribution, assigning ids on values when
they are first seen gets *very* close to the "optimal" dictionary with ids
ordered by frequency. I have done such experiments in the past for web and
email data and the natural method of assign ids on first seen was giving
ids that were not far from optimal - for such large data. I do not recall
the exact numbers but sorting ids to get the optimal assignment was not
resulting in a large enough compression improvement to justify the
complexity.

Writers typically assign ids as above so it should work for all writers.

On Mon, Nov 10, 2025 at 11:44 AM Antoine Pitrou <[email protected]> wrote:

> On Mon, 10 Nov 2025 09:17:31 +0100
> Alkis Evlogimenos
> <[email protected]>
> wrote:
> > DELTA_BINARY_PACKED of dictionary ids makes sense if writer assigns ids
> in
> > frequency order and the input values fit some power law/zipf
> distribution.
>
> Does any writer actually do that, though? Parquet data is typically
> written page per page, and the column chunk's unique dictionary page
> will grow in column data order.
>
> Regards
>
> Antoine.
>
>
>

Reply via email to