Re: New Codec for ORC

Dongjoon Hyun Fri, 20 May 2022 10:16:28 -0700

Hi, Ian.

It was not a request or blame to that repo (and you) at all. There is no
time limit for someone's contributions.


We are in Apache ORC dev mailing address which discusses Apache ORC related
stuff, especially, on

- https://orc.apache.org (The Apache ORC website)
- https://github.com/apache/orc (Commits / GitHub Issues / PRs)
- https://issues.apache.org (Apache JIRA Issues)
- Some other ASF resources (other ASF projects repo and mailing lists)

In general, we are supposed to focus on some stuff contributed to the ASF
channel already.
It's a little different from users who have more broader options to choose
what they use.

Best,
Dongjoon.


On Fri, May 20, 2022 at 9:54 AM Ian Joiner <[email protected]> wrote:

> Uh. I didn’t realize that. Give me 6 months and I can provide both the
> reader and the writer.
>
> Ian
>
> On Friday, May 20, 2022, Dongjoon Hyun <[email protected]> wrote:
>
> > BTW, the license of that Rust ORC reader looks unusual to me.
> >
> > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
> >
> > I guess we need to skip that repository.
> >
> > Dongjoon
> >
> >
> > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <[email protected]>
> > wrote:
> >
> > > Thank you for sharing it, Ian.
> > >
> > > Dongjoon.
> > >
> > >
> > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <[email protected]>
> > wrote:
> > >
> > >> There is already a Rust ORC reader:
> > >> https://rustrepo.com/repo/travisbrown-orcrs
> > >> We still need a writer though. If I have 6 months to do so I can write
> > >> one.
> > >> Then I can also integrate it into Arrow Rust.
> > >>
> > >> Ian
> > >>
> > >> On Friday, May 20, 2022, Dongjoon Hyun <[email protected]>
> wrote:
> > >>
> > >> > +1 for Owen's advice.
> > >> >
> > >> > BTW, Rust ORC reader & writer sounds like a great idea.
> > >> >
> > >> > Dongjoon.
> > >> >
> > >> >
> > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> > [email protected]>
> > >> > wrote:
> > >> >
> > >> > > This is interesting, but it sounds like it corresponds more to the
> > rle
> > >> > > encoding that we do rather than the generic compression code.
> > >> > >
> > >> > > Has anyone done a Java version of the library? It is faster to
> > >> iterate on
> > >> > > this kind of design in Java. On the other hand, I’ve heard that
> > >> someone
> > >> > is
> > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
> > >> yet.
> > >> > 😊
> > >> > >
> > >> > > .. Owen
> > >> > >
> > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <
> [email protected]
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > Thank you for sharing, Martin.
> > >> > > >
> > >> > > > For codec one, you can take advantage of our benchmark suite
> > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >> > > >
> > >> > > >
> > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >> > > >
> > >> > > > For the Spark connector one, I'd like to recommend you to send
> > >> > dev@spark
> > >> > > > too. You will get attention in both parts (codec and connector).
> > >> > > >
> > >> > > > Then, I'm looking forward to seeing your benchmark result.
> > >> > > >
> > >> > > > Dongjoon.
> > >> > > >
> > >> > > >
> > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > >> > [email protected]
> > >> > > >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> I've developed a stable codec for numerical columns called
> > Quantile
> > >> > > >> Compression <https://github.com/mwlon/quantile-compression>.
> > >> > > >> It has about 30% higher compression ratio than even Zstd for
> > >> similar
> > >> > > >> compression and decompression time. It achieves this by
> tailoring
> > >> to
> > >> > the
> > >> > > >> data type (floats, ints, timestamps, bools).
> > >> > > >>
> > >> > > >> I'm using it in my own projects, and a few others have adopted
> > it,
> > >> but
> > >> > > it
> > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> > >> between
> > >> > > >> text-like and numerical data, it could reduce the average ORC
> > file
> > >> > size
> > >> > > by
> > >> > > >> over 10% with no extra compute cost. Incorporating it into ORC
> > >> would
> > >> > be
> > >> > > >> quite powerful since the codec by itself only works on a single
> > >> flat
> > >> > > column
> > >> > > >> of non-nullable numbers.
> > >> > > >>
> > >> > > >> Would the ORC community be interested in this? How can we make
> > this
> > >> > > >> available to users? I've already built a Spark connector
> > >> > > >> <https://github.com/pancake-db/spark-pancake-connector> for a
> > >> project
> > >> > > >> using
> > >> > > >> this codec and gotten fast query times.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >> Martin
> > >> > > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: New Codec for ORC

Reply via email to