(Oops, the repeating binary decimal is 1100... with period 4, so exactly 2 bits of entropy for the 52 mantissa bits. The argument is the same though.)
On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <[email protected]> wrote: > To reach a conclusion on this thread, I understand the overall sentiment > as: > > Pco could technically work as a Parquet encoding, but people are wary of > its newness and weak FFI support. It seems there is no immediate action to > take, but would be worthwhile to consider this again further in the future. > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <[email protected]> > wrote: > >> I must admit I'm a bit surprised by these results. The first thing is >>> that the Pcodec results were actually obtained using dictionary >>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary >>> values or the dictionary indices? >> >> >> No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes >> and back. Some of Parquet's existing encodings are like this as well. >> >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much >>> worse than the PLAIN + Zstd results, which is unexpected (though not >>> impossible). >> >> >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly >> for this data because there is high correlation among each number's bytes. >> For instance, if each double is a multiple of 0.1, then the 52 mantissa >> bits will look like 011011011011011... (011 repeating). That means there >> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits >> for them. >> >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]> >> wrote: >> >>> >>> Hello Martin, >>> >>> On Sat, 6 Jan 2024 17:09:07 -0500 >>> Martin Loncaric <[email protected]> >>> wrote: >>> > > >>> > > It would be very interesting to expand the comparison against >>> > > BYTE_STREAM_SPLIT + compression. >>> > >>> > Antoine: I created one now, at the bottom of the post >>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In >>> this >>> > case, BYTE_STREAM_SPLIT did worse. >>> >>> I must admit I'm a bit surprised by these results. The first thing is >>> that the Pcodec results were actually obtained using dictionary >>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary >>> values or the dictionary indices? >>> >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much >>> worse than the PLAIN + Zstd results, which is unexpected (though not >>> impossible). >>> >>> Regards >>> >>> Antoine. >>> >>> >>>
