This is interesting, but it sounds like it corresponds more to the rle encoding 
that we do rather than the generic compression code. 

Has anyone done a Java version of the library? It is faster to iterate on this 
kind of design in Java. On the other hand, I’ve heard that someone is thinking 
about doing a Rust ORC reader & writer, but it isn’t done yet. 😊

.. Owen

> On May 19, 2022, at 17:16, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
> 
> Thank you for sharing, Martin.
> 
> For codec one, you can take advantage of our benchmark suite
> (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> 
> https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> 
> For the Spark connector one, I'd like to recommend you to send dev@spark
> too. You will get attention in both parts (codec and connector).
> 
> Then, I'm looking forward to seeing your benchmark result.
> 
> Dongjoon.
> 
> 
>> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <m.w.lonca...@gmail.com>
>> wrote:
>> 
>> I've developed a stable codec for numerical columns called Quantile
>> Compression <https://github.com/mwlon/quantile-compression>.
>> It has about 30% higher compression ratio than even Zstd for similar
>> compression and decompression time. It achieves this by tailoring to the
>> data type (floats, ints, timestamps, bools).
>> 
>> I'm using it in my own projects, and a few others have adopted it, but it
>> would also be perfect for ORC columns. Assuming a 50-50 split between
>> text-like and numerical data, it could reduce the average ORC file size by
>> over 10% with no extra compute cost. Incorporating it into ORC would be
>> quite powerful since the codec by itself only works on a single flat column
>> of non-nullable numbers.
>> 
>> Would the ORC community be interested in this? How can we make this
>> available to users? I've already built a Spark connector
>> <https://github.com/pancake-db/spark-pancake-connector> for a project
>> using
>> this codec and gotten fast query times.
>> 
>> Thanks,
>> Martin
>> 

Reply via email to