Hi Radev, Thanks a lot for providing so much technical details. I need to read them carefully.
I think FP encoding is definitely a useful feature. I hope this feature can be implemented in Arrow soon, so that we can use it in our system. Best, Liya Fan On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <martin.ra...@tum.de> wrote: > Hello Liya Fan, > > > this explains the technique but for a more complex case: > > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ > > For FP data, the approach which seemed to be the best is the following. > > Say we have a buffer of two 32-bit floating point values: > > buf = [af, bf] > > We interpret each FP value as a 32-bit uint and look at each individual > byte. We have 8 bytes in total for this small input. > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3] > > Then we apply stream splitting and the new buffer becomes: > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3] > > We compress newbuf. > > Due to similarities the sign bits, mantissa bits and MSB exponent bits, we > might have a lot more repetitions in data. For scientific data, the 2nd and > 3rd byte for 32-bit data is probably largely noise. Thus in the original > representation we would always have a few bytes of data which could appear > somewhere else in the buffer and then a couple bytes of possible noise. In > the new representation we have a long stream of data which could compress > well and then a sequence of noise towards the end. > > This transformation improved compression ratio as can be seen in the > report. > > It also improved speed for ZSTD. This could be because ZSTD makes a > decision of how to compress the data - RLE, new huffman tree, huffman tree > of the previous frame, raw representation. Each can potentially achieve a > different compression ratio and compression/decompression speed. It turned > out that when the transformation is applied, zstd would attempt to compress > fewer frames and copy the other. This could lead to less attempts to build > a huffman tree. It's hard to pin-point the exact reason. > > I did not try other lossless text compressors but I expect similar results. > > For code, I can polish my patches, create a Jira task and submit the > patches for review. > > > Regards, > > Martin > > > ________________________________ > From: Fan Liya <liya.fa...@gmail.com> > Sent: Thursday, July 11, 2019 11:32:53 AM > To: dev@arrow.apache.org > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: Adding a new encoding for FP data > > Hi Radev, > > Thanks for the information. It seems interesting. > IMO, Arrow has much to do for data compression. However, it seems there are > some differences for memory data compression and external storage data > compression. > > Could you please provide some reference for stream splitting? > > Best, > Liya Fan > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.ra...@tum.de> wrote: > > > Hello people, > > > > > > there has been discussion in the Apache Parquet mailing list on adding a > > new encoder for FP data. > > The reason for this is that the supported compressors by Apache Parquet > > (zstd, gzip, etc) do not compress well raw FP data. > > > > > > In my investigation it turns out that a very simple simple technique, > > named stream splitting, can improve the compression ratio and even speed > > for some of the compressors. > > > > You can read about the results here: > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > > > > > > I went through the developer guide for Apache Arrow and wrote a patch to > > add the new encoding and test coverage for it. > > > > I will polish my patch and work in parallel to extend the Apache Parquet > > format for the new encoding. > > > > > > If you have any concerns, please let me know. > > > > > > Regards, > > > > Martin > > > > >