Hi Martin, Can you clarify were you expecting the encoding to only be used in Parquet, or more generally in Arrow?
Thanks, Micah On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi folks, > > If you could participate in Micah's discussion about compression and > encoding generally at > > > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E > > it would be helpful. I personally think that Arrow would benefit from > an alternate protocol message type to the current RecordBatch (as > defined in Message.fbs) that allows for encoded or compressed columns. > This won't be an overnight change (more on the order of months of > work), but it's worth taking the time to carefully consider the > implications of developing and supporting such a feature for the long > term > > On Thu, Jul 11, 2019 at 5:34 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > Hi Radev, > > > > Thanks a lot for providing so much technical details. I need to read them > > carefully. > > > > I think FP encoding is definitely a useful feature. > > I hope this feature can be implemented in Arrow soon, so that we can use > it > > in our system. > > > > Best, > > Liya Fan > > > > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <martin.ra...@tum.de> > wrote: > > > > > Hello Liya Fan, > > > > > > > > > this explains the technique but for a more complex case: > > > > > > > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ > > > > > > For FP data, the approach which seemed to be the best is the following. > > > > > > Say we have a buffer of two 32-bit floating point values: > > > > > > buf = [af, bf] > > > > > > We interpret each FP value as a 32-bit uint and look at each individual > > > byte. We have 8 bytes in total for this small input. > > > > > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3] > > > > > > Then we apply stream splitting and the new buffer becomes: > > > > > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3] > > > > > > We compress newbuf. > > > > > > Due to similarities the sign bits, mantissa bits and MSB exponent > bits, we > > > might have a lot more repetitions in data. For scientific data, the > 2nd and > > > 3rd byte for 32-bit data is probably largely noise. Thus in the > original > > > representation we would always have a few bytes of data which could > appear > > > somewhere else in the buffer and then a couple bytes of possible > noise. In > > > the new representation we have a long stream of data which could > compress > > > well and then a sequence of noise towards the end. > > > > > > This transformation improved compression ratio as can be seen in the > > > report. > > > > > > It also improved speed for ZSTD. This could be because ZSTD makes a > > > decision of how to compress the data - RLE, new huffman tree, huffman > tree > > > of the previous frame, raw representation. Each can potentially > achieve a > > > different compression ratio and compression/decompression speed. It > turned > > > out that when the transformation is applied, zstd would attempt to > compress > > > fewer frames and copy the other. This could lead to less attempts to > build > > > a huffman tree. It's hard to pin-point the exact reason. > > > > > > I did not try other lossless text compressors but I expect similar > results. > > > > > > For code, I can polish my patches, create a Jira task and submit the > > > patches for review. > > > > > > > > > Regards, > > > > > > Martin > > > > > > > > > ________________________________ > > > From: Fan Liya <liya.fa...@gmail.com> > > > Sent: Thursday, July 11, 2019 11:32:53 AM > > > To: dev@arrow.apache.org > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: Adding a new encoding for FP data > > > > > > Hi Radev, > > > > > > Thanks for the information. It seems interesting. > > > IMO, Arrow has much to do for data compression. However, it seems > there are > > > some differences for memory data compression and external storage data > > > compression. > > > > > > Could you please provide some reference for stream splitting? > > > > > > Best, > > > Liya Fan > > > > > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.ra...@tum.de> > wrote: > > > > > > > Hello people, > > > > > > > > > > > > there has been discussion in the Apache Parquet mailing list on > adding a > > > > new encoder for FP data. > > > > The reason for this is that the supported compressors by Apache > Parquet > > > > (zstd, gzip, etc) do not compress well raw FP data. > > > > > > > > > > > > In my investigation it turns out that a very simple simple technique, > > > > named stream splitting, can improve the compression ratio and even > speed > > > > for some of the compressors. > > > > > > > > You can read about the results here: > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > > > > > > > > > > > > I went through the developer guide for Apache Arrow and wrote a > patch to > > > > add the new encoding and test coverage for it. > > > > > > > > I will polish my patch and work in parallel to extend the Apache > Parquet > > > > format for the new encoding. > > > > > > > > > > > > If you have any concerns, please let me know. > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > > > > >