My personal sentiment is: not only its newness, but the fact that it is
1) highly non-trivial (it seems much more complicated than all other Parquet encodings); 2) maintained by a single person, both the spec and the implementation (please correct me if I'm wrong?); and 3) has little to no adoption currently (again, please correct me if I'm wrong?). Of course the adoption issue is a chicken-and-egg problem, but given that Parquet files are used for long-term storage (not just transient data), it's probably not a good idea to be an early adopter here. And of course, if the encoding was simpler, points 2 and 3 wouldn't really hurt. This is just my opinion! Regards Antoine. On Thu, 11 Jan 2024 22:02:02 -0500 Martin Loncaric <m.w.lonca...@gmail.com> wrote: > To reach a conclusion on this thread, I understand the overall sentiment as: > > Pco could technically work as a Parquet encoding, but people are wary of > its newness and weak FFI support. It seems there is no immediate action to > take, but would be worthwhile to consider this again further in the future. > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com> > wrote: > > > I must admit I'm a bit surprised by these results. The first thing is > >> that the Pcodec results were actually obtained using dictionary > >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary > >> values or the dictionary indices? > > > > > > No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes > > and back. Some of Parquet's existing encodings are like this as well. > > > > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > >> worse than the PLAIN + Zstd results, which is unexpected (though not > >> impossible). > > > > > > I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly > > for this data because there is high correlation among each number's bytes. > > For instance, if each double is a multiple of 0.1, then the 52 mantissa > > bits will look like 011011011011011... (011 repeating). That means there > > are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each > > number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits > > for them. > > > > On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou > > <antoine-+zn9apsxkcfqfi55v6+...@public.gmane.orgg> wrote: > > > >> > >> Hello Martin, > >> > >> On Sat, 6 Jan 2024 17:09:07 -0500 > >> Martin Loncaric <m.w.lonca...@gmail.com> > >> wrote: > >> > > > >> > > It would be very interesting to expand the comparison against > >> > > BYTE_STREAM_SPLIT + compression. > >> > > >> > Antoine: I created one now, at the bottom of the post > >> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In > >> this > >> > case, BYTE_STREAM_SPLIT did worse. > >> > >> I must admit I'm a bit surprised by these results. The first thing is > >> that the Pcodec results were actually obtained using dictionary > >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary > >> values or the dictionary indices? > >> > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > >> worse than the PLAIN + Zstd results, which is unexpected (though not > >> impossible). > >> > >> Regards > >> > >> Antoine. > >> > >> > >> >