>
> I must admit I'm a bit surprised by these results. The first thing is
> that the Pcodec results were actually obtained using dictionary
> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> values or the dictionary indices?


No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes and
back. Some of Parquet's existing encodings are like this as well.

The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> worse than the PLAIN + Zstd results, which is unexpected (though not
> impossible).


I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly
for this data because there is high correlation among each number's bytes.
For instance, if each double is a multiple of 0.1, then the 52 mantissa
bits will look like 011011011011011... (011 repeating). That means there
are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each
number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits
for them.

On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello Martin,
>
> On Sat, 6 Jan 2024 17:09:07 -0500
> Martin Loncaric <[email protected]>
> wrote:
> > >
> > > It would be very interesting to expand the comparison against
> > > BYTE_STREAM_SPLIT + compression.
> >
> > Antoine: I created one now, at the bottom of the post
> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In this
> > case, BYTE_STREAM_SPLIT did worse.
>
> I must admit I'm a bit surprised by these results. The first thing is
> that the Pcodec results were actually obtained using dictionary
> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> values or the dictionary indices?
>
> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> worse than the PLAIN + Zstd results, which is unexpected (though not
> impossible).
>
> Regards
>
> Antoine.
>
>
>

Reply via email to