Re: New Codec for ORC

Ian Joiner Fri, 20 May 2022 09:25:49 -0700

There is already a Rust ORC reader:
https://rustrepo.com/repo/travisbrown-orcrs
We still need a writer though. If I have 6 months to do so I can write one.
Then I can also integrate it into Arrow Rust.


Ian

On Friday, May 20, 2022, Dongjoon Hyun <[email protected]> wrote:

> +1 for Owen's advice.
>
> BTW, Rust ORC reader & writer sounds like a great idea.
>
> Dongjoon.
>
>
> On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <[email protected]>
> wrote:
>
> > This is interesting, but it sounds like it corresponds more to the rle
> > encoding that we do rather than the generic compression code.
> >
> > Has anyone done a Java version of the library? It is faster to iterate on
> > this kind of design in Java. On the other hand, I’ve heard that someone
> is
> > thinking about doing a Rust ORC reader & writer, but it isn’t done yet.
> 😊
> >
> > .. Owen
> >
> > > On May 19, 2022, at 17:16, Dongjoon Hyun <[email protected]>
> > wrote:
> > >
> > > Thank you for sharing, Martin.
> > >
> > > For codec one, you can take advantage of our benchmark suite
> > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >
> > >
> > https://github.com/apache/orc/blob/main/java/bench/core/src/
> java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >
> > > For the Spark connector one, I'd like to recommend you to send
> dev@spark
> > > too. You will get attention in both parts (codec and connector).
> > >
> > > Then, I'm looking forward to seeing your benchmark result.
> > >
> > > Dongjoon.
> > >
> > >
> > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> [email protected]
> > >
> > >> wrote:
> > >>
> > >> I've developed a stable codec for numerical columns called Quantile
> > >> Compression <https://github.com/mwlon/quantile-compression>.
> > >> It has about 30% higher compression ratio than even Zstd for similar
> > >> compression and decompression time. It achieves this by tailoring to
> the
> > >> data type (floats, ints, timestamps, bools).
> > >>
> > >> I'm using it in my own projects, and a few others have adopted it, but
> > it
> > >> would also be perfect for ORC columns. Assuming a 50-50 split between
> > >> text-like and numerical data, it could reduce the average ORC file
> size
> > by
> > >> over 10% with no extra compute cost. Incorporating it into ORC would
> be
> > >> quite powerful since the codec by itself only works on a single flat
> > column
> > >> of non-nullable numbers.
> > >>
> > >> Would the ORC community be interested in this? How can we make this
> > >> available to users? I've already built a Spark connector
> > >> <https://github.com/pancake-db/spark-pancake-connector> for a project
> > >> using
> > >> this codec and gotten fast query times.
> > >>
> > >> Thanks,
> > >> Martin
> > >>
> >
>

Re: New Codec for ORC

Reply via email to