Hello, I wanted to start a discussion about adaptive encoding selection as a way to improve the performance of Parquet implementations that we started at the sync earlier this week.
Specifically the question came up "would we need to specify some way in the spec to pick between any new encodings" and I think the consensus answer was "no, that would be up to the writer implementation" At the moment, all open source Parquet implementations I know of have a simple heuristics for choosing the encoding, but we are discussing a more dynamic encoding advisor in arrow-rs (thanks MapleFU)[1] . Are there other conversations that people are aware of? Andrew p.s. This is part of a larger point and misconception about the Parquet format, specifically that the choice of encoding is dictated by the format. Many new file format proposal such as BtrBlocks[2] include a dynamic approach to encoding selection (after claiming that "Parquet" uses static heuristics): > In BtrBlocks, we test each encoding scheme on a sample and select the scheme that performs best. (the paper then proposes a sampling technique to avoid skewed samples) [1]: https://github.com/apache/arrow-rs/issues/8378 [2]: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf
