Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Sat, 13 Jan 2024 08:42:45 -0800

Micah: I've added a format doc now:
https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would appreciate
any feedback or thoughts on it.


On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield <[email protected]>
wrote:

> >
> > Pco could technically work as a Parquet encoding, but people are wary of
> > its newness and weak FFI support. It seems there is no immediate action
> to
> > take, but would be worthwhile to consider this again further in the
> future.
>
>
> I guess I'm more optimistic on the potential gaps.  I think if there were a
> spec that allowed one to code it from scratch, I'd be willing to take a
> crack at seeing what it would take for another implementation in either
> Java or C++. (I looked at the links you provided but they were somewhat too
> high-level).  I think having a spec would also guard against the "newness"
> concern.
>
> I can't say there wouldn't be other technical blockers but at least this
> would be someplace to start?
>
> Cheers,
> Micah
>
> On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric <[email protected]>
> wrote:
>
> > (Oops, the repeating binary decimal is 1100... with period 4, so exactly
> 2
> > bits of entropy for the 52 mantissa bits. The argument is the same
> though.)
> >
> > On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <[email protected]
> >
> > wrote:
> >
> > > To reach a conclusion on this thread, I understand the overall
> sentiment
> > > as:
> > >
> > > Pco could technically work as a Parquet encoding, but people are wary
> of
> > > its newness and weak FFI support. It seems there is no immediate action
> > to
> > > take, but would be worthwhile to consider this again further in the
> > future.
> > >
> > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <
> [email protected]>
> > > wrote:
> > >
> > >> I must admit I'm a bit surprised by these results. The first thing is
> > >>> that the Pcodec results were actually obtained using dictionary
> > >>> encoding. Then I don't understand what is Pcodec-encoded: the
> > dictionary
> > >>> values or the dictionary indices?
> > >>
> > >>
> > >> No, pco cannot be dictionary encoded; it only goes from vec<T> ->
> Bytes
> > >> and back. Some of Parquet's existing encodings are like this as well.
> > >>
> > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> > >>> worse than the PLAIN + Zstd results, which is unexpected (though not
> > >>> impossible).
> > >>
> > >>
> > >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does
> > terribly
> > >> for this data because there is high correlation among each number's
> > bytes.
> > >> For instance, if each double is a multiple of 0.1, then the 52
> mantissa
> > >> bits will look like 011011011011011... (011 repeating). That means
> there
> > >> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of
> > each
> > >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many
> > bits
> > >> for them.
> > >>
> > >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <[email protected]>
> > >> wrote:
> > >>
> > >>>
> > >>> Hello Martin,
> > >>>
> > >>> On Sat, 6 Jan 2024 17:09:07 -0500
> > >>> Martin Loncaric <[email protected]>
> > >>> wrote:
> > >>> > >
> > >>> > > It would be very interesting to expand the comparison against
> > >>> > > BYTE_STREAM_SPLIT + compression.
> > >>> >
> > >>> > Antoine: I created one now, at the bottom of the post
> > >>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> In
> > >>> this
> > >>> > case, BYTE_STREAM_SPLIT did worse.
> > >>>
> > >>> I must admit I'm a bit surprised by these results. The first thing is
> > >>> that the Pcodec results were actually obtained using dictionary
> > >>> encoding. Then I don't understand what is Pcodec-encoded: the
> > dictionary
> > >>> values or the dictionary indices?
> > >>>
> > >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are
> much
> > >>> worse than the PLAIN + Zstd results, which is unexpected (though not
> > >>> impossible).
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> > >>>
> > >>>
> > >>>
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to