Martin,

Many thanks for the links.  

My main concern is not actually FP and integer data, but sparse string data.  
Having many very sparse arrays, each with a bitmap and values (assume 
dictionary also), would be really expensive. I have some ideas I’d like to 
throw out there, around something like a MapArray (Think of it essentially as 
dictionaries of keys and values, plus List<List<u8>> for example), but 
something optimized for sparseness.

Overall, while I appreciate the design of Arrow arrays to be super fast for 
computation, I’d like to be able to keep more of such data in memory, thus I’m 
interested in more compact representations, that ideally don’t need a 
compressor.  More like encoding.

I saw a thread middle of last year about RLE encodings, this would be in the 
right direction I think.   It could be implemented properly such that it 
doesn’t make random access that bad.

As for FP, I have my own scheme which is scale tested, SIMD friendly and should 
perform relatively well, and can fit in with different predictors including 
XOR, DFCM, etc.   Due to the high cardinality of most such data, I wonder if it 
wouldn’t be simpler to stick with one such scheme for all FP data.

Anyways, I’m most curious about if there is a plan to implement RLE, the FP 
schemes you describe, etc. and bring them to Arrow.  IE, is there a plan for 
space efficient encodings overall for Arrow?

Thanks very much,
Evan

> On Mar 10, 2020, at 1:41 AM, Radev, Martin <martin.ra...@tum.de> wrote:
> 
> Hey Evan,
> 
> 
> thank you for the interest.
> 
> There has been some effort for compressing floating-point data on the Parquet 
> side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
> floating point data but makes it more compressible for when a compressor, 
> such as ZSTD, LZ4, etc, is used. It only works well for high-entropy 
> floating-point data, somewhere at least as large as >= 15 bits of entropy per 
> element. I suppose the encoding might actually also make sense for 
> high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there 
> can be room for performance improvements.
> This is my final report for the encoding here: 
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
> 
> Note that at some point my investigation turned out be quite the same 
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
> 
> 
> Maybe the points I sent can be helpful.
> 
> 
> Kinds regards,
> 
> Martin
> 
> ________________________________
> From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan 
> <evan_c...@apple.com.INVALID>
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
> 
> Hi folks,
> 
> I’m curious about the state of efforts for more compressed encodings in the 
> Arrow columnar format.  I saw discussions previously about RLE, but is there 
> a place to summarize all of the different efforts that are ongoing to bring 
> more compressed encodings?
> 
> Is there an effort to compress floating point or integer data using 
> techniques such as XOR compression and Delta-Delta?  I can contribute to some 
> of these efforts as well.
> 
> Thanks,
> Evan
> 
> 

Reply via email to