Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Thu, 11 Jan 2024 19:21:02 -0800

(Oops, the repeating binary decimal is 1100... with period 4, so exactly 2
bits of entropy for the 52 mantissa bits. The argument is the same though.)


On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <[email protected]>
wrote:

> To reach a conclusion on this thread, I understand the overall sentiment
> as:
>
> Pco could technically work as a Parquet encoding, but people are wary of
> its newness and weak FFI support. It seems there is no immediate action to
> take, but would be worthwhile to consider this again further in the future.
>
> On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <[email protected]>
> wrote:
>
>> I must admit I'm a bit surprised by these results. The first thing is
>>> that the Pcodec results were actually obtained using dictionary
>>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
>>> values or the dictionary indices?
>>
>>
>> No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes
>> and back. Some of Parquet's existing encodings are like this as well.
>>
>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
>>> worse than the PLAIN + Zstd results, which is unexpected (though not
>>> impossible).
>>
>>
>> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly
>> for this data because there is high correlation among each number's bytes.
>> For instance, if each double is a multiple of 0.1, then the 52 mantissa
>> bits will look like 011011011011011... (011 repeating). That means there
>> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each
>> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits
>> for them.
>>
>> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]>
>> wrote:
>>
>>>
>>> Hello Martin,
>>>
>>> On Sat, 6 Jan 2024 17:09:07 -0500
>>> Martin Loncaric <[email protected]>
>>> wrote:
>>> > >
>>> > > It would be very interesting to expand the comparison against
>>> > > BYTE_STREAM_SPLIT + compression.
>>> >
>>> > Antoine: I created one now, at the bottom of the post
>>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In
>>> this
>>> > case, BYTE_STREAM_SPLIT did worse.
>>>
>>> I must admit I'm a bit surprised by these results. The first thing is
>>> that the Pcodec results were actually obtained using dictionary
>>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
>>> values or the dictionary indices?
>>>
>>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
>>> worse than the PLAIN + Zstd results, which is unexpected (though not
>>> impossible).
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to