See this past mailing list thread

https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E

and associated PR

https://github.com/apache/arrow/pull/4815

There hasn't been a lot of movement on this but primarily because all
the key people who've expressed interest in it have been really busy
with other matters (myself included). Have RLE-encoding in memory at
minimum would be a huge benefit for a number of applications, so it
would be great to continue the discussion and create a more
comprehensive proposal document describing what we would like to
implement (and what we do not want to implement)

On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin <martin.ra...@tum.de> wrote:
>
> Hey Evan,
>
>
> thank you for the interest.
>
> There has been some effort for compressing floating-point data on the Parquet 
> side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress 
> floating point data but makes it more compressible for when a compressor, 
> such as ZSTD, LZ4, etc, is used. It only works well for high-entropy 
> floating-point data, somewhere at least as large as >= 15 bits of entropy per 
> element. I suppose the encoding might actually also make sense for 
> high-entropy integer data but I am not super sure.
> For low-entropy data, the dictionary encoding is good though I suspect there 
> can be room for performance improvements.
> This is my final report for the encoding here: 
> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
>
> Note that at some point my investigation turned out be quite the same 
> solution as the one in https://github.com/powturbo/Turbo-Transpose.
>
>
> Maybe the points I sent can be helpful.
>
>
> Kinds regards,
>
> Martin
>
> ________________________________
> From: evan_c...@apple.com <evan_c...@apple.com> on behalf of Evan Chan 
> <evan_c...@apple.com.INVALID>
> Sent: Tuesday, March 10, 2020 5:15:48 AM
> To: dev@arrow.apache.org
> Subject: Summary of RLE and other compression efforts?
>
> Hi folks,
>
> I’m curious about the state of efforts for more compressed encodings in the 
> Arrow columnar format.  I saw discussions previously about RLE, but is there 
> a place to summarize all of the different efforts that are ongoing to bring 
> more compressed encodings?
>
> Is there an effort to compress floating point or integer data using 
> techniques such as XOR compression and Delta-Delta?  I can contribute to some 
> of these efforts as well.
>
> Thanks,
> Evan
>
>

Reply via email to