I've developed a stable codec for numerical columns called Quantile Compression <https://github.com/mwlon/quantile-compression>. It has about 30% higher compression ratio than even Zstd for similar compression and decompression time. It achieves this by tailoring to the data type (floats, ints, timestamps, bools).
I'm using it in my own projects, and a few others have adopted it, but it would also be perfect for ORC columns. Assuming a 50-50 split between text-like and numerical data, it could reduce the average ORC file size by over 10% with no extra compute cost. Incorporating it into ORC would be quite powerful since the codec by itself only works on a single flat column of non-nullable numbers. Would the ORC community be interested in this? How can we make this available to users? I've already built a Spark connector <https://github.com/pancake-db/spark-pancake-connector> for a project using this codec and gotten fast query times. Thanks, Martin