Re: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread Bryan Cutler
Mani, please send a reply to dev-unsubscr...@arrow.apache.org to remove
yourself from the list.

On Thu, Jul 11, 2019 at 11:10 AM mani vannan 
wrote:

> All,
>
> Can someone please help me to unsubscribe to this group?
>
> Thank you.
>
> -Original Message-
> From: Radev, Martin 
> Sent: Thursday, July 11, 2019 2:08 PM
> To: dev@arrow.apache.org; emkornfi...@gmail.com
> Cc: Raoofy, Amir ; Karlstetter, Roman <
> roman.karlstet...@tum.de>
> Subject: Re: Adding a new encoding for FP data
>
> Hello Micah,
>
>
> the changes will go to the C++ implementation of Parquet within Arrow.
>
> In that sense, if Arrow uses the compression and encoding methods
> available in Parquet in any way, I expect a benefit.
>
>
> My plan is to add the new encoding to parquet-cpp and parquer-mr (java).
>
>
> If you have any more questions or concerns, let me know.
>
> I am close to done with my patch.
>
>
> Regards,
>
> Martin
>
>
> 
> From: Micah Kornfield 
> Sent: Thursday, July 11, 2019 5:26:26 PM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Martin,
> Can you clarify were you expecting the encoding to only be used in
> Parquet, or more generally in Arrow?
>
> Thanks,
> Micah
>
> On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > If you could participate in Micah's discussion about compression and
> > encoding generally at
> >
> >
> > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd
> > 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
> >
> > it would be helpful. I personally think that Arrow would benefit from
> > an alternate protocol message type to the current RecordBatch (as
> > defined in Message.fbs) that allows for encoded or compressed columns.
> > This won't be an overnight change (more on the order of months of
> > work), but it's worth taking the time to carefully consider the
> > implications of developing and supporting such a feature for the long
> > term
> >
> > On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> > >
> > > Hi Radev,
> > >
> > > Thanks a lot for providing so much technical details. I need to read
> > > them carefully.
> > >
> > > I think FP encoding is definitely a useful feature.
> > > I hope this feature can be implemented in Arrow soon, so that we can
> > > use
> > it
> > > in our system.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> > wrote:
> > >
> > > > Hello Liya Fan,
> > > >
> > > >
> > > > this explains the technique but for a more complex case:
> > > >
> > > >
> > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun
> > chy/
> [https://s0.wp.com/i/blank.jpg]<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
>
> x86 code compression in kkrunchy | The ryg blog<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
> fgiesen.wordpress.com
> This is about the "secret ingredient" in my EXE packer kkrunchy, which was
> used in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and
> also in a lot of other ...
>
>
>
> > > >
> > > > For FP data, the approach which seemed to be the best is the
> following.
> > > >
> > > > Say we have a buffer of two 32-bit floating point values:
> > > >
> > > > buf = [af, bf]
> > > >
> > > > We interpret each FP value as a 32-bit uint and look at each
> > > > individual byte. We have 8 bytes in total for this small input.
> > > >
> > > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > > >
> > > > Then we apply stream splitting and the new buffer becomes:
> > > >
> > > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > > >
> > > > We compress newbuf.
> > > >
> > > > Due to similarities the sign bits, mantissa bits and MSB exponent
> > bits, we
> > > > might have a lot more repetitions in data. For scientific data,
> > > > the
> > 2nd and
> > > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> > original
> > > > representation we would always have a few bytes of data whic

RE: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread mani vannan
All, 

Can someone please help me to unsubscribe to this group?

Thank you.

-Original Message-
From: Radev, Martin  
Sent: Thursday, July 11, 2019 2:08 PM
To: dev@arrow.apache.org; emkornfi...@gmail.com
Cc: Raoofy, Amir ; Karlstetter, Roman 

Subject: Re: Adding a new encoding for FP data

Hello Micah,


the changes will go to the C++ implementation of Parquet within Arrow.

In that sense, if Arrow uses the compression and encoding methods available in 
Parquet in any way, I expect a benefit.


My plan is to add the new encoding to parquet-cpp and parquer-mr (java).


If you have any more questions or concerns, let me know.

I am close to done with my patch.


Regards,

Martin



From: Micah Kornfield 
Sent: Thursday, July 11, 2019 5:26:26 PM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet, or 
more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and 
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd
> 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from 
> an alternate protocol message type to the current RecordBatch (as 
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of 
> work), but it's worth taking the time to carefully consider the 
> implications of developing and supporting such a feature for the long 
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read 
> > them carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can 
> > use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun
> chy/
[https://s0.wp.com/i/blank.jpg]<https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/>

x86 code compression in kkrunchy | The ryg 
blog<https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/>
fgiesen.wordpress.com
This is about the "secret ingredient" in my EXE packer kkrunchy, which was used 
in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and also in a 
lot of other ...



> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each 
> > > individual byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, 
> > > the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which 
> > > could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in 
> > > the report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes 
> > > a decision of how to compress the data - RLE, new huffman tree, 
> > > huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. 
> > > It
> turned
> &g

Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Micah,


the changes will go to the C++ implementation of Parquet within Arrow.

In that sense, if Arrow uses the compression and encoding methods available in 
Parquet in any way, I expect a benefit.


My plan is to add the new encoding to parquet-cpp and parquer-mr (java).


If you have any more questions or concerns, let me know.

I am close to done with my patch.


Regards,

Martin



From: Micah Kornfield 
Sent: Thursday, July 11, 2019 5:26:26 PM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet,
or more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from
> an alternate protocol message type to the current RecordBatch (as
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of
> work), but it's worth taking the time to carefully consider the
> implications of developing and supporting such a feature for the long
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read them
> > carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
[https://s0.wp.com/i/blank.jpg]<https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/>

x86 code compression in kkrunchy | The ryg 
blog<https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/>
fgiesen.wordpress.com
This is about the “secret ingredient” in my EXE packer kkrunchy, which was used 
in our (Farbrausch) 64k intros starting from “fr-030: Candytron”, and also in a 
lot of other …



> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each individual
> > > byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in the
> > > report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes a
> > > decision of how to compress the data - RLE, new huffman tree, huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polis

Re: Adding a new encoding for FP data

2019-07-11 Thread Micah Kornfield
Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet,
or more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from
> an alternate protocol message type to the current RecordBatch (as
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of
> work), but it's worth taking the time to carefully consider the
> implications of developing and supporting such a feature for the long
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read them
> > carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each individual
> > > byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in the
> > > report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes a
> > > decision of how to compress the data - RLE, new huffman tree, huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polish my patches, create a Jira task and submit the
> > > patches for review.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > 
> > > From: Fan Liya 
> > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > To: dev@arrow.apache.org
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Adding a new encoding for FP data
> > >
> > > Hi Radev,
> > >
> > > Thanks for the information. It seems interesting.
> > > IMO, Arrow has much to do for data compression. However, it seems
> there are
> > > some differences for memory data compression and external storage data
> > > compression.
> > >
> > > Could you please provide some reference for stream splitting?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Mart

Re: Adding a new encoding for FP data

2019-07-11 Thread Wes McKinney
hi folks,

If you could participate in Micah's discussion about compression and
encoding generally at

https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E

it would be helpful. I personally think that Arrow would benefit from
an alternate protocol message type to the current RecordBatch (as
defined in Message.fbs) that allows for encoded or compressed columns.
This won't be an overnight change (more on the order of months of
work), but it's worth taking the time to carefully consider the
implications of developing and supporting such a feature for the long
term

On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
>
> Hi Radev,
>
> Thanks a lot for providing so much technical details. I need to read them
> carefully.
>
> I think FP encoding is definitely a useful feature.
> I hope this feature can be implemented in Arrow soon, so that we can use it
> in our system.
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin  wrote:
>
> > Hello Liya Fan,
> >
> >
> > this explains the technique but for a more complex case:
> >
> > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
> > For FP data, the approach which seemed to be the best is the following.
> >
> > Say we have a buffer of two 32-bit floating point values:
> >
> > buf = [af, bf]
> >
> > We interpret each FP value as a 32-bit uint and look at each individual
> > byte. We have 8 bytes in total for this small input.
> >
> > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> >
> > Then we apply stream splitting and the new buffer becomes:
> >
> > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> >
> > We compress newbuf.
> >
> > Due to similarities the sign bits, mantissa bits and MSB exponent bits, we
> > might have a lot more repetitions in data. For scientific data, the 2nd and
> > 3rd byte for 32-bit data is probably largely noise. Thus in the original
> > representation we would always have a few bytes of data which could appear
> > somewhere else in the buffer and then a couple bytes of possible noise. In
> > the new representation we have a long stream of data which could compress
> > well and then a sequence of noise towards the end.
> >
> > This transformation improved compression ratio as can be seen in the
> > report.
> >
> > It also improved speed for ZSTD. This could be because ZSTD makes a
> > decision of how to compress the data - RLE, new huffman tree, huffman tree
> > of the previous frame, raw representation. Each can potentially achieve a
> > different compression ratio and compression/decompression speed. It turned
> > out that when the transformation is applied, zstd would attempt to compress
> > fewer frames and copy the other. This could lead to less attempts to build
> > a huffman tree. It's hard to pin-point the exact reason.
> >
> > I did not try other lossless text compressors but I expect similar results.
> >
> > For code, I can polish my patches, create a Jira task and submit the
> > patches for review.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
> > 
> > From: Fan Liya 
> > Sent: Thursday, July 11, 2019 11:32:53 AM
> > To: dev@arrow.apache.org
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Adding a new encoding for FP data
> >
> > Hi Radev,
> >
> > Thanks for the information. It seems interesting.
> > IMO, Arrow has much to do for data compression. However, it seems there are
> > some differences for memory data compression and external storage data
> > compression.
> >
> > Could you please provide some reference for stream splitting?
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:
> >
> > > Hello people,
> > >
> > >
> > > there has been discussion in the Apache Parquet mailing list on adding a
> > > new encoder for FP data.
> > > The reason for this is that the supported compressors by Apache Parquet
> > > (zstd, gzip, etc) do not compress well raw FP data.
> > >
> > >
> > > In my investigation it turns out that a very simple simple technique,
> > > named stream splitting, can improve the compression ratio and even speed
> > > for some of the compressors.
> > >
> > > You can read about the results here:
> > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > >
> > >
> > > I went through the developer guide for Apache Arrow and wrote a patch to
> > > add the new encoding and test coverage for it.
> > >
> > > I will polish my patch and work in parallel to extend the Apache Parquet
> > > format for the new encoding.
> > >
> > >
> > > If you have any concerns, please let me know.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >


Re: Adding a new encoding for FP data

2019-07-11 Thread Fan Liya
Hi Radev,

Thanks a lot for providing so much technical details. I need to read them
carefully.

I think FP encoding is definitely a useful feature.
I hope this feature can be implemented in Arrow soon, so that we can use it
in our system.

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin  wrote:

> Hello Liya Fan,
>
>
> this explains the technique but for a more complex case:
>
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
>
> For FP data, the approach which seemed to be the best is the following.
>
> Say we have a buffer of two 32-bit floating point values:
>
> buf = [af, bf]
>
> We interpret each FP value as a 32-bit uint and look at each individual
> byte. We have 8 bytes in total for this small input.
>
> buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
>
> Then we apply stream splitting and the new buffer becomes:
>
> newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
>
> We compress newbuf.
>
> Due to similarities the sign bits, mantissa bits and MSB exponent bits, we
> might have a lot more repetitions in data. For scientific data, the 2nd and
> 3rd byte for 32-bit data is probably largely noise. Thus in the original
> representation we would always have a few bytes of data which could appear
> somewhere else in the buffer and then a couple bytes of possible noise. In
> the new representation we have a long stream of data which could compress
> well and then a sequence of noise towards the end.
>
> This transformation improved compression ratio as can be seen in the
> report.
>
> It also improved speed for ZSTD. This could be because ZSTD makes a
> decision of how to compress the data - RLE, new huffman tree, huffman tree
> of the previous frame, raw representation. Each can potentially achieve a
> different compression ratio and compression/decompression speed. It turned
> out that when the transformation is applied, zstd would attempt to compress
> fewer frames and copy the other. This could lead to less attempts to build
> a huffman tree. It's hard to pin-point the exact reason.
>
> I did not try other lossless text compressors but I expect similar results.
>
> For code, I can polish my patches, create a Jira task and submit the
> patches for review.
>
>
> Regards,
>
> Martin
>
>
> ________________
> From: Fan Liya 
> Sent: Thursday, July 11, 2019 11:32:53 AM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Radev,
>
> Thanks for the information. It seems interesting.
> IMO, Arrow has much to do for data compression. However, it seems there are
> some differences for memory data compression and external storage data
> compression.
>
> Could you please provide some reference for stream splitting?
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:
>
> > Hello people,
> >
> >
> > there has been discussion in the Apache Parquet mailing list on adding a
> > new encoder for FP data.
> > The reason for this is that the supported compressors by Apache Parquet
> > (zstd, gzip, etc) do not compress well raw FP data.
> >
> >
> > In my investigation it turns out that a very simple simple technique,
> > named stream splitting, can improve the compression ratio and even speed
> > for some of the compressors.
> >
> > You can read about the results here:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> >
> >
> > I went through the developer guide for Apache Arrow and wrote a patch to
> > add the new encoding and test coverage for it.
> >
> > I will polish my patch and work in parallel to extend the Apache Parquet
> > format for the new encoding.
> >
> >
> > If you have any concerns, please let me know.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
>


Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Liya Fan,


this explains the technique but for a more complex case:

https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/

For FP data, the approach which seemed to be the best is the following.

Say we have a buffer of two 32-bit floating point values:

buf = [af, bf]

We interpret each FP value as a 32-bit uint and look at each individual byte. 
We have 8 bytes in total for this small input.

buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]

Then we apply stream splitting and the new buffer becomes:

newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]

We compress newbuf.

Due to similarities the sign bits, mantissa bits and MSB exponent bits, we 
might have a lot more repetitions in data. For scientific data, the 2nd and 3rd 
byte for 32-bit data is probably largely noise. Thus in the original 
representation we would always have a few bytes of data which could appear 
somewhere else in the buffer and then a couple bytes of possible noise. In the 
new representation we have a long stream of data which could compress well and 
then a sequence of noise towards the end.

This transformation improved compression ratio as can be seen in the report.

It also improved speed for ZSTD. This could be because ZSTD makes a decision of 
how to compress the data - RLE, new huffman tree, huffman tree of the previous 
frame, raw representation. Each can potentially achieve a different compression 
ratio and compression/decompression speed. It turned out that when the 
transformation is applied, zstd would attempt to compress fewer frames and copy 
the other. This could lead to less attempts to build a huffman tree. It's hard 
to pin-point the exact reason.

I did not try other lossless text compressors but I expect similar results.

For code, I can polish my patches, create a Jira task and submit the patches 
for review.


Regards,

Martin



From: Fan Liya 
Sent: Thursday, July 11, 2019 11:32:53 AM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Radev,

Thanks for the information. It seems interesting.
IMO, Arrow has much to do for data compression. However, it seems there are
some differences for memory data compression and external storage data
compression.

Could you please provide some reference for stream splitting?

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:

> Hello people,
>
>
> there has been discussion in the Apache Parquet mailing list on adding a
> new encoder for FP data.
> The reason for this is that the supported compressors by Apache Parquet
> (zstd, gzip, etc) do not compress well raw FP data.
>
>
> In my investigation it turns out that a very simple simple technique,
> named stream splitting, can improve the compression ratio and even speed
> for some of the compressors.
>
> You can read about the results here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>
>
> I went through the developer guide for Apache Arrow and wrote a patch to
> add the new encoding and test coverage for it.
>
> I will polish my patch and work in parallel to extend the Apache Parquet
> format for the new encoding.
>
>
> If you have any concerns, please let me know.
>
>
> Regards,
>
> Martin
>
>