Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Sat, 13 Jan 2024 21:43:59 -0800

Hi Martin,
I agree with Gang's point about tAns.  I opened up an issue against the
pcodec repo enumerating some items that I think could be improved for
someone trying to implement this from the spec [1] . We can maybe use that
as a centralized place to discuss any understanding/clarity issues (I hope
this feedback is helpful in general)?


Having read a little bit more detail I had a few follow up questions:
1.  I assume the way this would be integrated is that each page using the
encoding would be self-contained (i.e. contains the header,  metadata and
the datages)? (From the branched thread with Xuwei, I believe this is
intended to be treated as a parquet encoding and not compression).
2.  The docs say this becomes effective at ~20k elements.  Given that is
the default for number of rows per page in parquet-mr (when paged indexes
are used), is it effective for exactly 20K elements (on average in the
benchmark results, what was the average number of values per page?)
3.  It looks like there is a patent controversy [2] for ANS algorithms.
IIUC the patent granted might only affect rAns and not tAns used for
pCodec, does that match your understanding?  Are you aware of any other
potential patent issues that could come up related to pcodec?

Thanks,
Micah

[1] https://github.com/mwlon/pcodec/issues/149
[2]
https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Patent_controversy


On Sat, Jan 13, 2024 at 8:23 PM Gang Wu <[email protected]> wrote:

> Hi Martin,
>
> Sorry for chiming in late. I have just read your blog post and the format
> specs. Below are my two cents:
>
> - The PCO spec is a good starting point with good explanation on the format
>   definition. For people unfamiliar with the background, it would be good
> to
>   also include the mechanism of the algorithm. For example, how does tANS
>   entropy encoding work?
> - Maintaining JNI in the parquet-mr is a burden. Though we have
> incorporated
>   zstd-jni which is provided by hadoop dependency, the library itself
> provides
>   a fallback mechanism to leverage pure Java implementation via
> aircompressor.
>   So a pure java implementation of PCO is always desirable.
> - CMIW, it seems to have the potential to achieve good compression ratio on
>    integers having common suffix, e.g. decimal(10,2) values that all have
> .00
>    suffix or timestamp values that all have their milliseconds to be 0.
> - To add a new parquet format change, the community usually requires two
>    PoC implementations (w/o restriction on library or language) before a
> formal
>    vote.
>
> Best,
> Gang
>
> On Sun, Jan 14, 2024 at 12:43 AM Martin Loncaric <[email protected]>
> wrote:
>
> > Micah: I've added a format doc now:
> > https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would
> appreciate
> > any feedback or thoughts on it.
> >
> > On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > >
> > > > Pco could technically work as a Parquet encoding, but people are wary
> > of
> > > > its newness and weak FFI support. It seems there is no immediate
> action
> > > to
> > > > take, but would be worthwhile to consider this again further in the
> > > future.
> > >
> > >
> > > I guess I'm more optimistic on the potential gaps.  I think if there
> > were a
> > > spec that allowed one to code it from scratch, I'd be willing to take a
> > > crack at seeing what it would take for another implementation in either
> > > Java or C++. (I looked at the links you provided but they were somewhat
> > too
> > > high-level).  I think having a spec would also guard against the
> > "newness"
> > > concern.
> > >
> > > I can't say there wouldn't be other technical blockers but at least
> this
> > > would be someplace to start?
> > >
> > > Cheers,
> > > Micah
> > >
> > > On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric <
> [email protected]>
> > > wrote:
> > >
> > > > (Oops, the repeating binary decimal is 1100... with period 4, so
> > exactly
> > > 2
> > > > bits of entropy for the 52 mantissa bits. The argument is the same
> > > though.)
> > > >
> > > > On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > To reach a conclusion on this thread, I understand the overall
> > > sentiment
> > > > > as:
> > > > >
> > > > > Pco could technically work as a Parquet encoding, but people are
> wary
> > > of
> > > > > its newness and weak FFI support. It seems there is no immediate
> > action
> > > > to
> > > > > take, but would be worthwhile to consider this again further in the
> > > > future.
> > > > >
> > > > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > >> I must admit I'm a bit surprised by these results. The first thing
> > is
> > > > >>> that the Pcodec results were actually obtained using dictionary
> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the
> > > > dictionary
> > > > >>> values or the dictionary indices?
> > > > >>
> > > > >>
> > > > >> No, pco cannot be dictionary encoded; it only goes from vec<T> ->
> > > Bytes
> > > > >> and back. Some of Parquet's existing encodings are like this as
> > well.
> > > > >>
> > > > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are
> > much
> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though
> > not
> > > > >>> impossible).
> > > > >>
> > > > >>
> > > > >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does
> > > > terribly
> > > > >> for this data because there is high correlation among each
> number's
> > > > bytes.
> > > > >> For instance, if each double is a multiple of 0.1, then the 52
> > > mantissa
> > > > >> bits will look like 011011011011011... (011 repeating). That means
> > > there
> > > > >> are only 3 possibilities (<2 bits of entropy) for the last 6+
> bytes
> > of
> > > > each
> > > > >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as
> > many
> > > > bits
> > > > >> for them.
> > > > >>
> > > > >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <
> [email protected]>
> > > > >> wrote:
> > > > >>
> > > > >>>
> > > > >>> Hello Martin,
> > > > >>>
> > > > >>> On Sat, 6 Jan 2024 17:09:07 -0500
> > > > >>> Martin Loncaric <[email protected]>
> > > > >>> wrote:
> > > > >>> > >
> > > > >>> > > It would be very interesting to expand the comparison against
> > > > >>> > > BYTE_STREAM_SPLIT + compression.
> > > > >>> >
> > > > >>> > Antoine: I created one now, at the bottom of the post
> > > > >>> > <https://graphallthethings.com/posts/the-parquet-we-could-have
> >.
> > > In
> > > > >>> this
> > > > >>> > case, BYTE_STREAM_SPLIT did worse.
> > > > >>>
> > > > >>> I must admit I'm a bit surprised by these results. The first
> thing
> > is
> > > > >>> that the Pcodec results were actually obtained using dictionary
> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the
> > > > dictionary
> > > > >>> values or the dictionary indices?
> > > > >>>
> > > > >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are
> > > much
> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though
> > not
> > > > >>> impossible).
> > > > >>>
> > > > >>> Regards
> > > > >>>
> > > > >>> Antoine.
> > > > >>>
> > > > >>>
> > > > >>>
> > > >
> > >
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to