Re: New Codec for ORC

Dongjoon Hyun Fri, 20 May 2022 09:49:40 -0700

BTW, the license of that Rust ORC reader looks unusual to me.

`ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`


I guess we need to skip that repository.

Dongjoon


On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <[email protected]>
wrote:

> Thank you for sharing it, Ian.
>
> Dongjoon.
>
>
> On Fri, May 20, 2022 at 9:27 AM Ian Joiner <[email protected]> wrote:
>
>> There is already a Rust ORC reader:
>> https://rustrepo.com/repo/travisbrown-orcrs
>> We still need a writer though. If I have 6 months to do so I can write
>> one.
>> Then I can also integrate it into Arrow Rust.
>>
>> Ian
>>
>> On Friday, May 20, 2022, Dongjoon Hyun <[email protected]> wrote:
>>
>> > +1 for Owen's advice.
>> >
>> > BTW, Rust ORC reader & writer sounds like a great idea.
>> >
>> > Dongjoon.
>> >
>> >
>> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <[email protected]>
>> > wrote:
>> >
>> > > This is interesting, but it sounds like it corresponds more to the rle
>> > > encoding that we do rather than the generic compression code.
>> > >
>> > > Has anyone done a Java version of the library? It is faster to
>> iterate on
>> > > this kind of design in Java. On the other hand, I’ve heard that
>> someone
>> > is
>> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
>> yet.
>> > 😊
>> > >
>> > > .. Owen
>> > >
>> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <[email protected]>
>> > > wrote:
>> > > >
>> > > > Thank you for sharing, Martin.
>> > > >
>> > > > For codec one, you can take advantage of our benchmark suite
>> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
>> > > >
>> > > >
>> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
>> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
>> > > >
>> > > > For the Spark connector one, I'd like to recommend you to send
>> > dev@spark
>> > > > too. You will get attention in both parts (codec and connector).
>> > > >
>> > > > Then, I'm looking forward to seeing your benchmark result.
>> > > >
>> > > > Dongjoon.
>> > > >
>> > > >
>> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
>> > [email protected]
>> > > >
>> > > >> wrote:
>> > > >>
>> > > >> I've developed a stable codec for numerical columns called Quantile
>> > > >> Compression <https://github.com/mwlon/quantile-compression>.
>> > > >> It has about 30% higher compression ratio than even Zstd for
>> similar
>> > > >> compression and decompression time. It achieves this by tailoring
>> to
>> > the
>> > > >> data type (floats, ints, timestamps, bools).
>> > > >>
>> > > >> I'm using it in my own projects, and a few others have adopted it,
>> but
>> > > it
>> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
>> between
>> > > >> text-like and numerical data, it could reduce the average ORC file
>> > size
>> > > by
>> > > >> over 10% with no extra compute cost. Incorporating it into ORC
>> would
>> > be
>> > > >> quite powerful since the codec by itself only works on a single
>> flat
>> > > column
>> > > >> of non-nullable numbers.
>> > > >>
>> > > >> Would the ORC community be interested in this? How can we make this
>> > > >> available to users? I've already built a Spark connector
>> > > >> <https://github.com/pancake-db/spark-pancake-connector> for a
>> project
>> > > >> using
>> > > >> this codec and gotten fast query times.
>> > > >>
>> > > >> Thanks,
>> > > >> Martin
>> > > >>
>> > >
>> >
>>
>

Re: New Codec for ORC

Reply via email to