Hi, Ian.

It was not a request or blame to that repo (and you) at all. There is no
time limit for someone's contributions.

We are in Apache ORC dev mailing address which discusses Apache ORC related
stuff, especially, on

- https://orc.apache.org (The Apache ORC website)
- https://github.com/apache/orc (Commits / GitHub Issues / PRs)
- https://issues.apache.org (Apache JIRA Issues)
- Some other ASF resources (other ASF projects repo and mailing lists)

In general, we are supposed to focus on some stuff contributed to the ASF
channel already.
It's a little different from users who have more broader options to choose
what they use.

Best,
Dongjoon.


On Fri, May 20, 2022 at 9:54 AM Ian Joiner <iajoiner...@gmail.com> wrote:

> Uh. I didn’t realize that. Give me 6 months and I can provide both the
> reader and the writer.
>
> Ian
>
> On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
>
> > BTW, the license of that Rust ORC reader looks unusual to me.
> >
> > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
> >
> > I guess we need to skip that repository.
> >
> > Dongjoon
> >
> >
> > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> > wrote:
> >
> > > Thank you for sharing it, Ian.
> > >
> > > Dongjoon.
> > >
> > >
> > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <iajoiner...@gmail.com>
> > wrote:
> > >
> > >> There is already a Rust ORC reader:
> > >> https://rustrepo.com/repo/travisbrown-orcrs
> > >> We still need a writer though. If I have 6 months to do so I can write
> > >> one.
> > >> Then I can also integrate it into Arrow Rust.
> > >>
> > >> Ian
> > >>
> > >> On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
> > >>
> > >> > +1 for Owen's advice.
> > >> >
> > >> > BTW, Rust ORC reader & writer sounds like a great idea.
> > >> >
> > >> > Dongjoon.
> > >> >
> > >> >
> > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> > owen.omal...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > This is interesting, but it sounds like it corresponds more to the
> > rle
> > >> > > encoding that we do rather than the generic compression code.
> > >> > >
> > >> > > Has anyone done a Java version of the library? It is faster to
> > >> iterate on
> > >> > > this kind of design in Java. On the other hand, I’ve heard that
> > >> someone
> > >> > is
> > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
> > >> yet.
> > >> > 😊
> > >> > >
> > >> > > .. Owen
> > >> > >
> > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <
> dongjoon.h...@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > Thank you for sharing, Martin.
> > >> > > >
> > >> > > > For codec one, you can take advantage of our benchmark suite
> > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >> > > >
> > >> > > >
> > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >> > > >
> > >> > > > For the Spark connector one, I'd like to recommend you to send
> > >> > dev@spark
> > >> > > > too. You will get attention in both parts (codec and connector).
> > >> > > >
> > >> > > > Then, I'm looking forward to seeing your benchmark result.
> > >> > > >
> > >> > > > Dongjoon.
> > >> > > >
> > >> > > >
> > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > >> > m.w.lonca...@gmail.com
> > >> > > >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> I've developed a stable codec for numerical columns called
> > Quantile
> > >> > > >> Compression <https://github.com/mwlon/quantile-compression>.
> > >> > > >> It has about 30% higher compression ratio than even Zstd for
> > >> similar
> > >> > > >> compression and decompression time. It achieves this by
> tailoring
> > >> to
> > >> > the
> > >> > > >> data type (floats, ints, timestamps, bools).
> > >> > > >>
> > >> > > >> I'm using it in my own projects, and a few others have adopted
> > it,
> > >> but
> > >> > > it
> > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> > >> between
> > >> > > >> text-like and numerical data, it could reduce the average ORC
> > file
> > >> > size
> > >> > > by
> > >> > > >> over 10% with no extra compute cost. Incorporating it into ORC
> > >> would
> > >> > be
> > >> > > >> quite powerful since the codec by itself only works on a single
> > >> flat
> > >> > > column
> > >> > > >> of non-nullable numbers.
> > >> > > >>
> > >> > > >> Would the ORC community be interested in this? How can we make
> > this
> > >> > > >> available to users? I've already built a Spark connector
> > >> > > >> <https://github.com/pancake-db/spark-pancake-connector> for a
> > >> project
> > >> > > >> using
> > >> > > >> this codec and gotten fast query times.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >> Martin
> > >> > > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to