I'd lean toward using apache-orc, but as Dongjoon said it isn't worth doing anything public until the code is committed and released. (If someone outside the project preemptively claimed apache-orc, there must be a process for getting control of trademarked names.
.. Owen On Mon, May 30, 2022 at 11:09 PM Ian Joiner <iajoiner...@gmail.com> wrote: > Hi, > > In preliminary exploration for adding Rust to the project I found that the > crate names orc, orcrs and orc-rs have all been used. I wonder whether we > should name our new Rust crate orc_rs or apache-orc, both of which are > still unused. The former basically follows Avro convention while the latter > makes it clear that we are Apache ORC as opposed to some other project with > the same abbreviation or unofficial implementation of ORC in Rust (such as > orcrs). If any of you can come up with names better than these two please > let me know. > > Thanks, > Ian > > On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > > > Hi, Ian. > > > > It was not a request or blame to that repo (and you) at all. There is no > > time limit for someone's contributions. > > > > We are in Apache ORC dev mailing address which discusses Apache ORC > related > > stuff, especially, on > > > > - https://orc.apache.org (The Apache ORC website) > > - https://github.com/apache/orc (Commits / GitHub Issues / PRs) > > - https://issues.apache.org (Apache JIRA Issues) > > - Some other ASF resources (other ASF projects repo and mailing lists) > > > > In general, we are supposed to focus on some stuff contributed to the ASF > > channel already. > > It's a little different from users who have more broader options to > choose > > what they use. > > > > Best, > > Dongjoon. > > > > > > On Fri, May 20, 2022 at 9:54 AM Ian Joiner <iajoiner...@gmail.com> > wrote: > > > > > Uh. I didn’t realize that. Give me 6 months and I can provide both the > > > reader and the writer. > > > > > > Ian > > > > > > On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > > > > > > > BTW, the license of that Rust ORC reader looks unusual to me. > > > > > > > > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)` > > > > > > > > I guess we need to skip that repository. > > > > > > > > Dongjoon > > > > > > > > > > > > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun < > dongjoon.h...@gmail.com > > > > > > > wrote: > > > > > > > > > Thank you for sharing it, Ian. > > > > > > > > > > Dongjoon. > > > > > > > > > > > > > > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner <iajoiner...@gmail.com> > > > > wrote: > > > > > > > > > >> There is already a Rust ORC reader: > > > > >> https://rustrepo.com/repo/travisbrown-orcrs > > > > >> We still need a writer though. If I have 6 months to do so I can > > write > > > > >> one. > > > > >> Then I can also integrate it into Arrow Rust. > > > > >> > > > > >> Ian > > > > >> > > > > >> On Friday, May 20, 2022, Dongjoon Hyun <dongjoon.h...@gmail.com> > > > wrote: > > > > >> > > > > >> > +1 for Owen's advice. > > > > >> > > > > > >> > BTW, Rust ORC reader & writer sounds like a great idea. > > > > >> > > > > > >> > Dongjoon. > > > > >> > > > > > >> > > > > > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley < > > > > owen.omal...@gmail.com> > > > > >> > wrote: > > > > >> > > > > > >> > > This is interesting, but it sounds like it corresponds more to > > the > > > > rle > > > > >> > > encoding that we do rather than the generic compression code. > > > > >> > > > > > > >> > > Has anyone done a Java version of the library? It is faster to > > > > >> iterate on > > > > >> > > this kind of design in Java. On the other hand, I’ve heard > that > > > > >> someone > > > > >> > is > > > > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t > > done > > > > >> yet. > > > > >> > 😊 > > > > >> > > > > > > >> > > .. Owen > > > > >> > > > > > > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun < > > > dongjoon.h...@gmail.com > > > > > > > > > >> > > wrote: > > > > >> > > > > > > > >> > > > Thank you for sharing, Martin. > > > > >> > > > > > > > >> > > > For codec one, you can take advantage of our benchmark suite > > > > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format. > > > > >> > > > > > > > >> > > > > > > > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/ > > > > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37 > > > > >> > > > > > > > >> > > > For the Spark connector one, I'd like to recommend you to > send > > > > >> > dev@spark > > > > >> > > > too. You will get attention in both parts (codec and > > connector). > > > > >> > > > > > > > >> > > > Then, I'm looking forward to seeing your benchmark result. > > > > >> > > > > > > > >> > > > Dongjoon. > > > > >> > > > > > > > >> > > > > > > > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric < > > > > >> > m.w.lonca...@gmail.com > > > > >> > > > > > > > >> > > >> wrote: > > > > >> > > >> > > > > >> > > >> I've developed a stable codec for numerical columns called > > > > Quantile > > > > >> > > >> Compression <https://github.com/mwlon/quantile-compression > >. > > > > >> > > >> It has about 30% higher compression ratio than even Zstd > for > > > > >> similar > > > > >> > > >> compression and decompression time. It achieves this by > > > tailoring > > > > >> to > > > > >> > the > > > > >> > > >> data type (floats, ints, timestamps, bools). > > > > >> > > >> > > > > >> > > >> I'm using it in my own projects, and a few others have > > adopted > > > > it, > > > > >> but > > > > >> > > it > > > > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 > split > > > > >> between > > > > >> > > >> text-like and numerical data, it could reduce the average > ORC > > > > file > > > > >> > size > > > > >> > > by > > > > >> > > >> over 10% with no extra compute cost. Incorporating it into > > ORC > > > > >> would > > > > >> > be > > > > >> > > >> quite powerful since the codec by itself only works on a > > single > > > > >> flat > > > > >> > > column > > > > >> > > >> of non-nullable numbers. > > > > >> > > >> > > > > >> > > >> Would the ORC community be interested in this? How can we > > make > > > > this > > > > >> > > >> available to users? I've already built a Spark connector > > > > >> > > >> <https://github.com/pancake-db/spark-pancake-connector> > for > > a > > > > >> project > > > > >> > > >> using > > > > >> > > >> this codec and gotten fast query times. > > > > >> > > >> > > > > >> > > >> Thanks, > > > > >> > > >> Martin > > > > >> > > >> > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >