Thank you for sharing, Martin.

For codec one, you can take advantage of our benchmark suite
(NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.

https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37

For the Spark connector one, I'd like to recommend you to send dev@spark
too. You will get attention in both parts (codec and connector).

Then, I'm looking forward to seeing your benchmark result.

Dongjoon.


On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> I've developed a stable codec for numerical columns called Quantile
> Compression <https://github.com/mwlon/quantile-compression>.
> It has about 30% higher compression ratio than even Zstd for similar
> compression and decompression time. It achieves this by tailoring to the
> data type (floats, ints, timestamps, bools).
>
> I'm using it in my own projects, and a few others have adopted it, but it
> would also be perfect for ORC columns. Assuming a 50-50 split between
> text-like and numerical data, it could reduce the average ORC file size by
> over 10% with no extra compute cost. Incorporating it into ORC would be
> quite powerful since the codec by itself only works on a single flat column
> of non-nullable numbers.
>
> Would the ORC community be interested in this? How can we make this
> available to users? I've already built a Spark connector
> <https://github.com/pancake-db/spark-pancake-connector> for a project
> using
> this codec and gotten fast query times.
>
> Thanks,
> Martin
>

Reply via email to