Micah: I've added a format doc now: https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would appreciate any feedback or thoughts on it.
On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield <[email protected]> wrote: > > > > Pco could technically work as a Parquet encoding, but people are wary of > > its newness and weak FFI support. It seems there is no immediate action > to > > take, but would be worthwhile to consider this again further in the > future. > > > I guess I'm more optimistic on the potential gaps. I think if there were a > spec that allowed one to code it from scratch, I'd be willing to take a > crack at seeing what it would take for another implementation in either > Java or C++. (I looked at the links you provided but they were somewhat too > high-level). I think having a spec would also guard against the "newness" > concern. > > I can't say there wouldn't be other technical blockers but at least this > would be someplace to start? > > Cheers, > Micah > > On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric <[email protected]> > wrote: > > > (Oops, the repeating binary decimal is 1100... with period 4, so exactly > 2 > > bits of entropy for the 52 mantissa bits. The argument is the same > though.) > > > > On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <[email protected] > > > > wrote: > > > > > To reach a conclusion on this thread, I understand the overall > sentiment > > > as: > > > > > > Pco could technically work as a Parquet encoding, but people are wary > of > > > its newness and weak FFI support. It seems there is no immediate action > > to > > > take, but would be worthwhile to consider this again further in the > > future. > > > > > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric < > [email protected]> > > > wrote: > > > > > >> I must admit I'm a bit surprised by these results. The first thing is > > >>> that the Pcodec results were actually obtained using dictionary > > >>> encoding. Then I don't understand what is Pcodec-encoded: the > > dictionary > > >>> values or the dictionary indices? > > >> > > >> > > >> No, pco cannot be dictionary encoded; it only goes from vec<T> -> > Bytes > > >> and back. Some of Parquet's existing encodings are like this as well. > > >> > > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > > >>> worse than the PLAIN + Zstd results, which is unexpected (though not > > >>> impossible). > > >> > > >> > > >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does > > terribly > > >> for this data because there is high correlation among each number's > > bytes. > > >> For instance, if each double is a multiple of 0.1, then the 52 > mantissa > > >> bits will look like 011011011011011... (011 repeating). That means > there > > >> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of > > each > > >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many > > bits > > >> for them. > > >> > > >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]> > > >> wrote: > > >> > > >>> > > >>> Hello Martin, > > >>> > > >>> On Sat, 6 Jan 2024 17:09:07 -0500 > > >>> Martin Loncaric <[email protected]> > > >>> wrote: > > >>> > > > > >>> > > It would be very interesting to expand the comparison against > > >>> > > BYTE_STREAM_SPLIT + compression. > > >>> > > > >>> > Antoine: I created one now, at the bottom of the post > > >>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. > In > > >>> this > > >>> > case, BYTE_STREAM_SPLIT did worse. > > >>> > > >>> I must admit I'm a bit surprised by these results. The first thing is > > >>> that the Pcodec results were actually obtained using dictionary > > >>> encoding. Then I don't understand what is Pcodec-encoded: the > > dictionary > > >>> values or the dictionary indices? > > >>> > > >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are > much > > >>> worse than the PLAIN + Zstd results, which is unexpected (though not > > >>> impossible). > > >>> > > >>> Regards > > >>> > > >>> Antoine. > > >>> > > >>> > > >>> > > >
