Hi Radev,

Thanks a lot for providing so much technical details. I need to read them
carefully.

I think FP encoding is definitely a useful feature.
I hope this feature can be implemented in Arrow soon, so that we can use it
in our system.

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin <martin.ra...@tum.de> wrote:

> Hello Liya Fan,
>
>
> this explains the technique but for a more complex case:
>
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
>
> For FP data, the approach which seemed to be the best is the following.
>
> Say we have a buffer of two 32-bit floating point values:
>
> buf = [af, bf]
>
> We interpret each FP value as a 32-bit uint and look at each individual
> byte. We have 8 bytes in total for this small input.
>
> buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
>
> Then we apply stream splitting and the new buffer becomes:
>
> newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
>
> We compress newbuf.
>
> Due to similarities the sign bits, mantissa bits and MSB exponent bits, we
> might have a lot more repetitions in data. For scientific data, the 2nd and
> 3rd byte for 32-bit data is probably largely noise. Thus in the original
> representation we would always have a few bytes of data which could appear
> somewhere else in the buffer and then a couple bytes of possible noise. In
> the new representation we have a long stream of data which could compress
> well and then a sequence of noise towards the end.
>
> This transformation improved compression ratio as can be seen in the
> report.
>
> It also improved speed for ZSTD. This could be because ZSTD makes a
> decision of how to compress the data - RLE, new huffman tree, huffman tree
> of the previous frame, raw representation. Each can potentially achieve a
> different compression ratio and compression/decompression speed. It turned
> out that when the transformation is applied, zstd would attempt to compress
> fewer frames and copy the other. This could lead to less attempts to build
> a huffman tree. It's hard to pin-point the exact reason.
>
> I did not try other lossless text compressors but I expect similar results.
>
> For code, I can polish my patches, create a Jira task and submit the
> patches for review.
>
>
> Regards,
>
> Martin
>
>
> ________________________________
> From: Fan Liya <liya.fa...@gmail.com>
> Sent: Thursday, July 11, 2019 11:32:53 AM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Radev,
>
> Thanks for the information. It seems interesting.
> IMO, Arrow has much to do for data compression. However, it seems there are
> some differences for memory data compression and external storage data
> compression.
>
> Could you please provide some reference for stream splitting?
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.ra...@tum.de> wrote:
>
> > Hello people,
> >
> >
> > there has been discussion in the Apache Parquet mailing list on adding a
> > new encoder for FP data.
> > The reason for this is that the supported compressors by Apache Parquet
> > (zstd, gzip, etc) do not compress well raw FP data.
> >
> >
> > In my investigation it turns out that a very simple simple technique,
> > named stream splitting, can improve the compression ratio and even speed
> > for some of the compressors.
> >
> > You can read about the results here:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> >
> >
> > I went through the developer guide for Apache Arrow and wrote a patch to
> > add the new encoding and test coverage for it.
> >
> > I will polish my patch and work in parallel to extend the Apache Parquet
> > format for the new encoding.
> >
> >
> > If you have any concerns, please let me know.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
>

Reply via email to