Hello,

I wanted to start a discussion about adaptive encoding selection as a way
to improve the performance of Parquet implementations that we started at
the sync earlier this week.

Specifically the question came up "would we need to specify some way in the
spec to pick between any new encodings" and I think the consensus answer
was "no, that would be up to the writer implementation"

At the moment, all open source Parquet implementations I know of have a
simple heuristics for choosing the encoding, but we are discussing a more
dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there other
conversations that people are aware of?


Andrew

p.s.  This is part of a larger point and misconception about the Parquet
format, specifically that the choice of encoding is dictated by the
format.  Many new file format proposal such as BtrBlocks[2] include a
dynamic approach to encoding selection (after claiming that "Parquet" uses
static heuristics):

> In BtrBlocks, we test each encoding scheme on a sample and select the
scheme that performs best.

(the paper then proposes a sampling technique to avoid skewed samples)

[1]: https://github.com/apache/arrow-rs/issues/8378
[2]: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf

Reply via email to