RE: Re: New Codec for ORC

Martin Loncaric Fri, 20 May 2022 09:46:53 -0700

Yes, I agree this would be better as an RLE encoding alternative (though it
isn't really RLE). Is there any ORC flag to choose between RLE
implementations, or is there only one available per specification? Is there
a benchmark suite for these encodings as well?


I'm working on a simple Java library right now (with a subset of
functionality via JNI), should have it out in a couple of days. You can try
it out then.

On 2022/05/20 15:48:42 Dongjoon Hyun wrote:
> +1 for Owen's advice.
>
> BTW, Rust ORC reader & writer sounds like a great idea.
>
> Dongjoon.
>
>
> On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <ow...@gmail.com>
> wrote:
>
> > This is interesting, but it sounds like it corresponds more to the rle
> > encoding that we do rather than the generic compression code.
> >
> > Has anyone done a Java version of the library? It is faster to iterate
on
> > this kind of design in Java. On the other hand, I’ve heard that someone
is
> > thinking about doing a Rust ORC reader & writer, but it isn’t done yet.
😊
> >
> > .. Owen
> >
> > > On May 19, 2022, at 17:16, Dongjoon Hyun <do...@gmail.com>
> > wrote:
> > >
> > > Thank you for sharing, Martin.
> > >
> > > For codec one, you can take advantage of our benchmark suite
> > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >
> > >
> >
https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >
> > > For the Spark connector one, I'd like to recommend you to send
dev@spark
> > > too. You will get attention in both parts (codec and connector).
> > >
> > > Then, I'm looking forward to seeing your benchmark result.
> > >
> > > Dongjoon.
> > >
> > >
> > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
m.w.lonca...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> I've developed a stable codec for numerical columns called Quantile
> > >> Compression <https://github.com/mwlon/quantile-compression>.
> > >> It has about 30% higher compression ratio than even Zstd for similar
> > >> compression and decompression time. It achieves this by tailoring to
the
> > >> data type (floats, ints, timestamps, bools).
> > >>
> > >> I'm using it in my own projects, and a few others have adopted it,
but
> > it
> > >> would also be perfect for ORC columns. Assuming a 50-50 split between
> > >> text-like and numerical data, it could reduce the average ORC file
size
> > by
> > >> over 10% with no extra compute cost. Incorporating it into ORC would
be
> > >> quite powerful since the codec by itself only works on a single flat
> > column
> > >> of non-nullable numbers.
> > >>
> > >> Would the ORC community be interested in this? How can we make this
> > >> available to users? I've already built a Spark connector
> > >> <https://github.com/pancake-db/spark-pancake-connector> for a project
> > >> using
> > >> this codec and gotten fast query times.
> > >>
> > >> Thanks,
> > >> Martin
> > >>
> >
>

RE: Re: New Codec for ORC

Reply via email to