When the input data fits a zipf distribution, assigning ids on values when they are first seen gets *very* close to the "optimal" dictionary with ids ordered by frequency. I have done such experiments in the past for web and email data and the natural method of assign ids on first seen was giving ids that were not far from optimal - for such large data. I do not recall the exact numbers but sorting ids to get the optimal assignment was not resulting in a large enough compression improvement to justify the complexity.
Writers typically assign ids as above so it should work for all writers. On Mon, Nov 10, 2025 at 11:44 AM Antoine Pitrou <[email protected]> wrote: > On Mon, 10 Nov 2025 09:17:31 +0100 > Alkis Evlogimenos > <[email protected]> > wrote: > > DELTA_BINARY_PACKED of dictionary ids makes sense if writer assigns ids > in > > frequency order and the input values fit some power law/zipf > distribution. > > Does any writer actually do that, though? Parquet data is typically > written page per page, and the column chunk's unique dictionary page > will grow in column data order. > > Regards > > Antoine. > > >
