I've developed a stable codec for numerical columns called Quantile
Compression <https://github.com/mwlon/quantile-compression>.
It has about 30% higher compression ratio than even Zstd for similar
compression and decompression time. It achieves this by tailoring to the
data type (floats, ints, timestamps, bools).

I'm using it in my own projects, and a few others have adopted it, but it
would also be perfect for ORC columns. Assuming a 50-50 split between
text-like and numerical data, it could reduce the average ORC file size by
over 10% with no extra compute cost. Incorporating it into ORC would be
quite powerful since the codec by itself only works on a single flat column
of non-nullable numbers.

Would the ORC community be interested in this? How can we make this
available to users? I've already built a Spark connector
<https://github.com/pancake-db/spark-pancake-connector> for a project using
this codec and gotten fast query times.

Thanks,
Martin

Reply via email to