Yes, I agree this would be better as an RLE encoding alternative (though it isn't really RLE). Is there any ORC flag to choose between RLE implementations, or is there only one available per specification? Is there a benchmark suite for these encodings as well?
I'm working on a simple Java library right now (with a subset of functionality via JNI), should have it out in a couple of days. You can try it out then. On 2022/05/20 15:48:42 Dongjoon Hyun wrote: > +1 for Owen's advice. > > BTW, Rust ORC reader & writer sounds like a great idea. > > Dongjoon. > > > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <ow...@gmail.com> > wrote: > > > This is interesting, but it sounds like it corresponds more to the rle > > encoding that we do rather than the generic compression code. > > > > Has anyone done a Java version of the library? It is faster to iterate on > > this kind of design in Java. On the other hand, I’ve heard that someone is > > thinking about doing a Rust ORC reader & writer, but it isn’t done yet. 😊 > > > > .. Owen > > > > > On May 19, 2022, at 17:16, Dongjoon Hyun <do...@gmail.com> > > wrote: > > > > > > Thank you for sharing, Martin. > > > > > > For codec one, you can take advantage of our benchmark suite > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format. > > > > > > > > https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37 > > > > > > For the Spark connector one, I'd like to recommend you to send dev@spark > > > too. You will get attention in both parts (codec and connector). > > > > > > Then, I'm looking forward to seeing your benchmark result. > > > > > > Dongjoon. > > > > > > > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric < m.w.lonca...@gmail.com > > > > > >> wrote: > > >> > > >> I've developed a stable codec for numerical columns called Quantile > > >> Compression <https://github.com/mwlon/quantile-compression>. > > >> It has about 30% higher compression ratio than even Zstd for similar > > >> compression and decompression time. It achieves this by tailoring to the > > >> data type (floats, ints, timestamps, bools). > > >> > > >> I'm using it in my own projects, and a few others have adopted it, but > > it > > >> would also be perfect for ORC columns. Assuming a 50-50 split between > > >> text-like and numerical data, it could reduce the average ORC file size > > by > > >> over 10% with no extra compute cost. Incorporating it into ORC would be > > >> quite powerful since the codec by itself only works on a single flat > > column > > >> of non-nullable numbers. > > >> > > >> Would the ORC community be interested in this? How can we make this > > >> available to users? I've already built a Spark connector > > >> <https://github.com/pancake-db/spark-pancake-connector> for a project > > >> using > > >> this codec and gotten fast query times. > > >> > > >> Thanks, > > >> Martin > > >> > > >