Micah, Thanks for the link to the PR. I’ll add more comments about those specific proposals there, though maybe I should outline ideas here as I’m not sure the PR is the best place to expand on details.
I’m wondering how important retaining strict O(1) random access is for Arrow’s use cases? Would “substantially sub-linear” or “approximately O(c)” where c is a very small constant (a la B-Tree) work? > > I'm not sure if this is provided was provided as an example as something to > add, but I believe this is already supported in the spec. > In the longer term I think it is potentially something Arrow should include, > but given the current number of developers contributing across all languages > I'm wary of tacking on too much before we have working reference > implementations. I can understand trying to tackle less, just one new proposal and see it implemented widely first. > > If you think there are better encodings to tackle in the short term I'd > welcome feedback on the proposal (or a more formal proposal of your own). An alternative to RLE with storing run-ends for every element, is to use RLE or variable-length encoding schemes, but combine them with an index or offsets for every Nth element. This means you get O(c) random access where you have to decode no more than N elements at a time. The N can be adjusted to tune for either better compression (larger N) or faster reads (less N to decode at a time). I also have my own, SIMD-friendly, variable-length encoding which can compress both floating point and integers quickly. Some details here: https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking <https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking> Happy to give more details, perhaps in a separate channel if needed. -Evan > > Thanks, > -Micah > > [1] https://github.com/apache/arrow/pull/4815/files > <https://github.com/apache/arrow/pull/4815/files> > > > > On Wed, Mar 11, 2020 at 9:24 AM Evan Chan <evan_c...@apple.com > <mailto:evan_c...@apple.com>> wrote: > Sure thing. > > Computation speed needs to be thought about in context.... We might find > something which takes up half the space to be a little more computationally > expensive, but in the grand scheme of things is faster to compute as more of > it can fit in memory, and it saves I/O. I definitely agree that just making > something smaller without other gains might not be worth it. > > Some small changes help both computation and size. For example, encoding > nulls in dictionary values helps reduce the need for both bitmap storage and > lookup. > > So I suppose the process goes something like this? > > 1) Propose designs in this email list > 2) PR for format specification > 3) Start implementing across diff languages > > -Evan > > > On Mar 10, 2020, at 10:31 PM, Micah Kornfield <emkornfi...@gmail.com > > <mailto:emkornfi...@gmail.com>> wrote: > > > > +1 to what Wes said. > > > > I'm hoping to have some more time to spend on this end of Q2/beginning of > > Q3 if no progress is made by then. > > > > I still think we should be careful on what is added to the spec, in > > particular, we should be focused on encodings that can be used to improve > > computational efficiency rather than just smaller size. Also, it is > > important to note that any sort of encoding/compression must be supportable > > across multiple languages/platforms. > > > > Thanks, > > Micah > > > > On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney <wesmck...@gmail.com > > <mailto:wesmck...@gmail.com>> wrote: > > > >> On Tue, Mar 10, 2020 at 5:01 PM Evan Chan <evan_c...@apple.com.invalid> > >> wrote: > >>> > >>> Martin, > >>> > >>> Many thanks for the links. > >>> > >>> My main concern is not actually FP and integer data, but sparse string > >> data. Having many very sparse arrays, each with a bitmap and values > >> (assume dictionary also), would be really expensive. I have some ideas I’d > >> like to throw out there, around something like a MapArray (Think of it > >> essentially as dictionaries of keys and values, plus List<List<u8>> for > >> example), but something optimized for sparseness. > >>> > >>> Overall, while I appreciate the design of Arrow arrays to be super fast > >> for computation, I’d like to be able to keep more of such data in memory, > >> thus I’m interested in more compact representations, that ideally don’t > >> need a compressor. More like encoding. > >>> > >>> I saw a thread middle of last year about RLE encodings, this would be in > >> the right direction I think. It could be implemented properly such that > >> it doesn’t make random access that bad. > >>> > >>> As for FP, I have my own scheme which is scale tested, SIMD friendly and > >> should perform relatively well, and can fit in with different predictors > >> including XOR, DFCM, etc. Due to the high cardinality of most such data, > >> I wonder if it wouldn’t be simpler to stick with one such scheme for all FP > >> data. > >>> > >>> Anyways, I’m most curious about if there is a plan to implement RLE, the > >> FP schemes you describe, etc. and bring them to Arrow. IE, is there a plan > >> for space efficient encodings overall for Arrow? > >> > >> It's been discussed many times in the past. As Arrow is developed by > >> volunteers, if someone volunteers their time to work on it, it can > >> happen. The first step would be to build consensus about what sort of > >> protocol level additions (see the format/ directory and associated > >> documentation) are needed. Once there is consensus about what to build > >> and a complete specification for that, then implementation can move > >> forward. > >> > >>> Thanks very much, > >>> Evan > >>> > >>>> On Mar 10, 2020, at 1:41 AM, Radev, Martin <martin.ra...@tum.de > >>>> <mailto:martin.ra...@tum.de>> > >> wrote: > >>>> > >>>> Hey Evan, > >>>> > >>>> > >>>> thank you for the interest. > >>>> > >>>> There has been some effort for compressing floating-point data on the > >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not > >> compress floating point data but makes it more compressible for when a > >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for > >> high-entropy floating-point data, somewhere at least as large as >= 15 bits > >> of entropy per element. I suppose the encoding might actually also make > >> sense for high-entropy integer data but I am not super sure. > >>>> For low-entropy data, the dictionary encoding is good though I suspect > >> there can be room for performance improvements. > >>>> This is my final report for the encoding here: > >> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > >> > >> <https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf> > >>>> > >>>> Note that at some point my investigation turned out be quite the same > >> solution as the one in https://github.com/powturbo/Turbo-Transpose > >> <https://github.com/powturbo/Turbo-Transpose>. > >>>> > >>>> > >>>> Maybe the points I sent can be helpful. > >>>> > >>>> > >>>> Kinds regards, > >>>> > >>>> Martin > >>>> > >>>> ________________________________ > >>>> From: evan_c...@apple.com <mailto:evan_c...@apple.com> > >>>> <evan_c...@apple.com <mailto:evan_c...@apple.com>> on behalf of Evan > >> Chan <evan_c...@apple.com.INVALID> > >>>> Sent: Tuesday, March 10, 2020 5:15:48 AM > >>>> To: dev@arrow.apache.org <mailto:dev@arrow.apache.org> > >>>> Subject: Summary of RLE and other compression efforts? > >>>> > >>>> Hi folks, > >>>> > >>>> I’m curious about the state of efforts for more compressed encodings > >> in the Arrow columnar format. I saw discussions previously about RLE, but > >> is there a place to summarize all of the different efforts that are ongoing > >> to bring more compressed encodings? > >>>> > >>>> Is there an effort to compress floating point or integer data using > >> techniques such as XOR compression and Delta-Delta? I can contribute to > >> some of these efforts as well. > >>>> > >>>> Thanks, > >>>> Evan > >>>> > >>>> > >>> > >> >