Thank you for sharing the progress, Ian.

Dongjoon

On Tue, Jul 12, 2022 at 3:09 PM Ian Joiner <iajoiner...@gmail.com> wrote:

> Hi Dongjoon,
>
> Really thanks for reaching out to me! I will let you know when I have any
> questions. Currently the code base is hosted here:
> https://github.com/iajoiner/orc-rs. Once done I will talk to you guys
> about
> it. :)
>
> Thanks,
> Ian
>
> On Thu, Jul 7, 2022 at 2:53 PM Dongjoon Hyun <dongj...@apache.org> wrote:
>
> > Hi, Ian.
> >
> > Is there something for Apache ORC community to help you? :)
> >
> > Dongjoon.
> >
> > On 2022/05/20 17:35:46 Ian Joiner wrote:
> > > Hi Dongjoon,
> > >
> > > Haha I understand. As the guy who wrote the ORC write adapter in Arrow
> > and
> > > want to understand both Rust and internals of big data formats more I’d
> > > love to help out.
> > >
> > > I will file a self-assigned issue then.
> > >
> > > Ian
> > >
> > > On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
> > >
> > > > Hi, Ian.
> > > >
> > > > It was not a request or blame to that repo (and you) at all. There is
> > no
> > > > time limit for someone's contributions.
> > > >
> > > > We are in Apache ORC dev mailing address which discusses Apache ORC
> > related
> > > > stuff, especially, on
> > > >
> > > > - https://orc.apache.org (The Apache ORC website)
> > > > - https://github.com/apache/orc (Commits / GitHub Issues / PRs)
> > > > - https://issues.apache.org (Apache JIRA Issues)
> > > > - Some other ASF resources (other ASF projects repo and mailing
> lists)
> > > >
> > > > In general, we are supposed to focus on some stuff contributed to the
> > ASF
> > > > channel already.
> > > > It's a little different from users who have more broader options to
> > choose
> > > > what they use.
> > > >
> > > > Best,
> > > > Dongjoon.
> > > >
> > > >
> > > > On Fri, May 20, 2022 at 9:54 AM Ian Joiner <iajoiner...@gmail.com>
> > wrote:
> > > >
> > > > > Uh. I didn’t realize that. Give me 6 months and I can provide both
> > the
> > > > > reader and the writer.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com>
> > wrote:
> > > > >
> > > > > > BTW, the license of that Rust ORC reader looks unusual to me.
> > > > > >
> > > > > > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
> > > > > >
> > > > > > I guess we need to skip that repository.
> > > > > >
> > > > > > Dongjoon
> > > > > >
> > > > > >
> > > > > > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <
> > dongjoon.h...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thank you for sharing it, Ian.
> > > > > > >
> > > > > > > Dongjoon.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <
> > iajoiner...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> There is already a Rust ORC reader:
> > > > > > >> https://rustrepo.com/repo/travisbrown-orcrs
> > > > > > >> We still need a writer though. If I have 6 months to do so I
> can
> > > > write
> > > > > > >> one.
> > > > > > >> Then I can also integrate it into Arrow Rust.
> > > > > > >>
> > > > > > >> Ian
> > > > > > >>
> > > > > > >> On Friday, May 20, 2022, Dongjoon Hyun <
> dongjoon.h...@gmail.com
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > +1 for Owen's advice.
> > > > > > >> >
> > > > > > >> > BTW, Rust ORC reader & writer sounds like a great idea.
> > > > > > >> >
> > > > > > >> > Dongjoon.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> > > > > > owen.omal...@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > This is interesting, but it sounds like it corresponds
> more
> > to
> > > > the
> > > > > > rle
> > > > > > >> > > encoding that we do rather than the generic compression
> > code.
> > > > > > >> > >
> > > > > > >> > > Has anyone done a Java version of the library? It is
> faster
> > to
> > > > > > >> iterate on
> > > > > > >> > > this kind of design in Java. On the other hand, I’ve heard
> > that
> > > > > > >> someone
> > > > > > >> > is
> > > > > > >> > > thinking about doing a Rust ORC reader & writer, but it
> > isn’t
> > > > done
> > > > > > >> yet.
> > > > > > >> > 😊
> > > > > > >> > >
> > > > > > >> > > .. Owen
> > > > > > >> > >
> > > > > > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <
> > > > > dongjoon.h...@gmail.com
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > Thank you for sharing, Martin.
> > > > > > >> > > >
> > > > > > >> > > > For codec one, you can take advantage of our benchmark
> > suite
> > > > > > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC
> > format.
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > > > > > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > > > > > >> > > >
> > > > > > >> > > > For the Spark connector one, I'd like to recommend you
> to
> > send
> > > > > > >> > dev@spark
> > > > > > >> > > > too. You will get attention in both parts (codec and
> > > > connector).
> > > > > > >> > > >
> > > > > > >> > > > Then, I'm looking forward to seeing your benchmark
> result.
> > > > > > >> > > >
> > > > > > >> > > > Dongjoon.
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > > > > > >> > m.w.lonca...@gmail.com
> > > > > > >> > > >
> > > > > > >> > > >> wrote:
> > > > > > >> > > >>
> > > > > > >> > > >> I've developed a stable codec for numerical columns
> > called
> > > > > > Quantile
> > > > > > >> > > >> Compression <
> > https://github.com/mwlon/quantile-compression>.
> > > > > > >> > > >> It has about 30% higher compression ratio than even
> Zstd
> > for
> > > > > > >> similar
> > > > > > >> > > >> compression and decompression time. It achieves this by
> > > > > tailoring
> > > > > > >> to
> > > > > > >> > the
> > > > > > >> > > >> data type (floats, ints, timestamps, bools).
> > > > > > >> > > >>
> > > > > > >> > > >> I'm using it in my own projects, and a few others have
> > > > adopted
> > > > > > it,
> > > > > > >> but
> > > > > > >> > > it
> > > > > > >> > > >> would also be perfect for ORC columns. Assuming a 50-50
> > split
> > > > > > >> between
> > > > > > >> > > >> text-like and numerical data, it could reduce the
> > average ORC
> > > > > > file
> > > > > > >> > size
> > > > > > >> > > by
> > > > > > >> > > >> over 10% with no extra compute cost. Incorporating it
> > into
> > > > ORC
> > > > > > >> would
> > > > > > >> > be
> > > > > > >> > > >> quite powerful since the codec by itself only works on
> a
> > > > single
> > > > > > >> flat
> > > > > > >> > > column
> > > > > > >> > > >> of non-nullable numbers.
> > > > > > >> > > >>
> > > > > > >> > > >> Would the ORC community be interested in this? How can
> we
> > > > make
> > > > > > this
> > > > > > >> > > >> available to users? I've already built a Spark
> connector
> > > > > > >> > > >> <https://github.com/pancake-db/spark-pancake-connector
> >
> > for
> > > > a
> > > > > > >> project
> > > > > > >> > > >> using
> > > > > > >> > > >> this codec and gotten fast query times.
> > > > > > >> > > >>
> > > > > > >> > > >> Thanks,
> > > > > > >> > > >> Martin
> > > > > > >> > > >>
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to