Uh. I didn’t realize that. Give me 6 months and I can provide both the reader and the writer.
Ian On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > BTW, the license of that Rust ORC reader looks unusual to me. > > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)` > > I guess we need to skip that repository. > > Dongjoon > > > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > > > Thank you for sharing it, Ian. > > > > Dongjoon. > > > > > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <iajoiner...@gmail.com> > wrote: > > > >> There is already a Rust ORC reader: > >> https://rustrepo.com/repo/travisbrown-orcrs > >> We still need a writer though. If I have 6 months to do so I can write > >> one. > >> Then I can also integrate it into Arrow Rust. > >> > >> Ian > >> > >> On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > >> > >> > +1 for Owen's advice. > >> > > >> > BTW, Rust ORC reader & writer sounds like a great idea. > >> > > >> > Dongjoon. > >> > > >> > > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley < > owen.omal...@gmail.com> > >> > wrote: > >> > > >> > > This is interesting, but it sounds like it corresponds more to the > rle > >> > > encoding that we do rather than the generic compression code. > >> > > > >> > > Has anyone done a Java version of the library? It is faster to > >> iterate on > >> > > this kind of design in Java. On the other hand, I’ve heard that > >> someone > >> > is > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done > >> yet. > >> > 😊 > >> > > > >> > > .. Owen > >> > > > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <dongjoon.h...@gmail.com > > > >> > > wrote: > >> > > > > >> > > > Thank you for sharing, Martin. > >> > > > > >> > > > For codec one, you can take advantage of our benchmark suite > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format. > >> > > > > >> > > > > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/ > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37 > >> > > > > >> > > > For the Spark connector one, I'd like to recommend you to send > >> > dev@spark > >> > > > too. You will get attention in both parts (codec and connector). > >> > > > > >> > > > Then, I'm looking forward to seeing your benchmark result. > >> > > > > >> > > > Dongjoon. > >> > > > > >> > > > > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric < > >> > m.w.lonca...@gmail.com > >> > > > > >> > > >> wrote: > >> > > >> > >> > > >> I've developed a stable codec for numerical columns called > Quantile > >> > > >> Compression <https://github.com/mwlon/quantile-compression>. > >> > > >> It has about 30% higher compression ratio than even Zstd for > >> similar > >> > > >> compression and decompression time. It achieves this by tailoring > >> to > >> > the > >> > > >> data type (floats, ints, timestamps, bools). > >> > > >> > >> > > >> I'm using it in my own projects, and a few others have adopted > it, > >> but > >> > > it > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split > >> between > >> > > >> text-like and numerical data, it could reduce the average ORC > file > >> > size > >> > > by > >> > > >> over 10% with no extra compute cost. Incorporating it into ORC > >> would > >> > be > >> > > >> quite powerful since the codec by itself only works on a single > >> flat > >> > > column > >> > > >> of non-nullable numbers. > >> > > >> > >> > > >> Would the ORC community be interested in this? How can we make > this > >> > > >> available to users? I've already built a Spark connector > >> > > >> <https://github.com/pancake-db/spark-pancake-connector> for a > >> project > >> > > >> using > >> > > >> this codec and gotten fast query times. > >> > > >> > >> > > >> Thanks, > >> > > >> Martin > >> > > >> > >> > > > >> > > >> > > >