Re: New Codec for ORC

Ian Joiner Fri, 20 May 2022 09:54:06 -0700

Uh. I didn’t realize that. Give me 6 months and I can provide both the
reader and the writer.


Ian

On Friday, May 20, 2022, Dongjoon Hyun <[email protected]> wrote:

> BTW, the license of that Rust ORC reader looks unusual to me.
>
> `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
>
> I guess we need to skip that repository.
>
> Dongjoon
>
>
> On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <[email protected]>
> wrote:
>
> > Thank you for sharing it, Ian.
> >
> > Dongjoon.
> >
> >
> > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <[email protected]>
> wrote:
> >
> >> There is already a Rust ORC reader:
> >> https://rustrepo.com/repo/travisbrown-orcrs
> >> We still need a writer though. If I have 6 months to do so I can write
> >> one.
> >> Then I can also integrate it into Arrow Rust.
> >>
> >> Ian
> >>
> >> On Friday, May 20, 2022, Dongjoon Hyun <[email protected]> wrote:
> >>
> >> > +1 for Owen's advice.
> >> >
> >> > BTW, Rust ORC reader & writer sounds like a great idea.
> >> >
> >> > Dongjoon.
> >> >
> >> >
> >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> [email protected]>
> >> > wrote:
> >> >
> >> > > This is interesting, but it sounds like it corresponds more to the
> rle
> >> > > encoding that we do rather than the generic compression code.
> >> > >
> >> > > Has anyone done a Java version of the library? It is faster to
> >> iterate on
> >> > > this kind of design in Java. On the other hand, I’ve heard that
> >> someone
> >> > is
> >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
> >> yet.
> >> > 😊
> >> > >
> >> > > .. Owen
> >> > >
> >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > Thank you for sharing, Martin.
> >> > > >
> >> > > > For codec one, you can take advantage of our benchmark suite
> >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> >> > > >
> >> > > >
> >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> >> > > >
> >> > > > For the Spark connector one, I'd like to recommend you to send
> >> > dev@spark
> >> > > > too. You will get attention in both parts (codec and connector).
> >> > > >
> >> > > > Then, I'm looking forward to seeing your benchmark result.
> >> > > >
> >> > > > Dongjoon.
> >> > > >
> >> > > >
> >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> >> > [email protected]
> >> > > >
> >> > > >> wrote:
> >> > > >>
> >> > > >> I've developed a stable codec for numerical columns called
> Quantile
> >> > > >> Compression <https://github.com/mwlon/quantile-compression>.
> >> > > >> It has about 30% higher compression ratio than even Zstd for
> >> similar
> >> > > >> compression and decompression time. It achieves this by tailoring
> >> to
> >> > the
> >> > > >> data type (floats, ints, timestamps, bools).
> >> > > >>
> >> > > >> I'm using it in my own projects, and a few others have adopted
> it,
> >> but
> >> > > it
> >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> >> between
> >> > > >> text-like and numerical data, it could reduce the average ORC
> file
> >> > size
> >> > > by
> >> > > >> over 10% with no extra compute cost. Incorporating it into ORC
> >> would
> >> > be
> >> > > >> quite powerful since the codec by itself only works on a single
> >> flat
> >> > > column
> >> > > >> of non-nullable numbers.
> >> > > >>
> >> > > >> Would the ORC community be interested in this? How can we make
> this
> >> > > >> available to users? I've already built a Spark connector
> >> > > >> <https://github.com/pancake-db/spark-pancake-connector> for a
> >> project
> >> > > >> using
> >> > > >> this codec and gotten fast query times.
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Martin
> >> > > >>
> >> > >
> >> >
> >>
> >
>

Re: New Codec for ORC

Reply via email to