> > I must admit I'm a bit surprised by these results. The first thing is > that the Pcodec results were actually obtained using dictionary > encoding. Then I don't understand what is Pcodec-encoded: the dictionary > values or the dictionary indices?
No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes and back. Some of Parquet's existing encodings are like this as well. The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > worse than the PLAIN + Zstd results, which is unexpected (though not > impossible). I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly for this data because there is high correlation among each number's bytes. For instance, if each double is a multiple of 0.1, then the 52 mantissa bits will look like 011011011011011... (011 repeating). That means there are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits for them. On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]> wrote: > > Hello Martin, > > On Sat, 6 Jan 2024 17:09:07 -0500 > Martin Loncaric <[email protected]> > wrote: > > > > > > It would be very interesting to expand the comparison against > > > BYTE_STREAM_SPLIT + compression. > > > > Antoine: I created one now, at the bottom of the post > > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In this > > case, BYTE_STREAM_SPLIT did worse. > > I must admit I'm a bit surprised by these results. The first thing is > that the Pcodec results were actually obtained using dictionary > encoding. Then I don't understand what is Pcodec-encoded: the dictionary > values or the dictionary indices? > > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > worse than the PLAIN + Zstd results, which is unexpected (though not > impossible). > > Regards > > Antoine. > > >
