Re: Pitch for Pcodec Encoding in Parquet

Antoine Pitrou Mon, 15 Jan 2024 09:13:55 -0800


My personal sentiment is: not only its newness, but the fact that it is


1) highly non-trivial (it seems much more complicated than all other
Parquet encodings);
2) maintained by a single person, both the spec and the implementation
(please correct me if I'm wrong?); and
3) has little to no adoption currently (again, please correct me if
I'm wrong?).

Of course the adoption issue is a chicken-and-egg problem, but given
that Parquet files are used for long-term storage (not just transient
data), it's probably not a good idea to be an early adopter here.

And of course, if the encoding was simpler, points 2 and 3 wouldn't
really hurt.

This is just my opinion!

Regards

Antoine.


On Thu, 11 Jan 2024 22:02:02 -0500
Martin Loncaric <m.w.lonca...@gmail.com>
wrote:
> To reach a conclusion on this thread, I understand the overall sentiment as:
> 
> Pco could technically work as a Parquet encoding, but people are wary of
> its newness and weak FFI support. It seems there is no immediate action to
> take, but would be worthwhile to consider this again further in the future.
> 
> On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com>
> wrote:
> 
> > I must admit I'm a bit surprised by these results. The first thing is  
> >> that the Pcodec results were actually obtained using dictionary
> >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> >> values or the dictionary indices?  
> >
> >
> > No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes
> > and back. Some of Parquet's existing encodings are like this as well.
> >
> > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much  
> >> worse than the PLAIN + Zstd results, which is unexpected (though not
> >> impossible).  
> >
> >
> > I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly
> > for this data because there is high correlation among each number's bytes.
> > For instance, if each double is a multiple of 0.1, then the 52 mantissa
> > bits will look like 011011011011011... (011 repeating). That means there
> > are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each
> > number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits
> > for them.
> >
> > On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou 
> > <antoine-+zn9apsxkcfqfi55v6+...@public.gmane.orgg> wrote:
> >  
> >>
> >> Hello Martin,
> >>
> >> On Sat, 6 Jan 2024 17:09:07 -0500
> >> Martin Loncaric <m.w.lonca...@gmail.com>
> >> wrote:  
> >> > >
> >> > > It would be very interesting to expand the comparison against
> >> > > BYTE_STREAM_SPLIT + compression.  
> >> >
> >> > Antoine: I created one now, at the bottom of the post
> >> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In  
> >> this  
> >> > case, BYTE_STREAM_SPLIT did worse.  
> >>
> >> I must admit I'm a bit surprised by these results. The first thing is
> >> that the Pcodec results were actually obtained using dictionary
> >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> >> values or the dictionary indices?
> >>
> >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> >> worse than the PLAIN + Zstd results, which is unexpected (though not
> >> impossible).
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>  
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to