Re: Pitch for Pcodec Encoding in Parquet

Will Jones Fri, 05 Jan 2024 11:47:25 -0800

> Another thing that could help with adoption here is if pcodec had a
> specification document (apologies if I missed it), that would allow others
> to more easily port it.
>


+1 to this. The encodings right now are described by a spec, rather than
some specific library. I think if we wanted Pcodec to be integrated into
Parquet, it should be as a specification for an encoding not as a library.
If some Parquet implementation wanted to use your implementation of Pcodec,
that should be a separate decision made by individual Parquet
implementations. Do you have the specification for the codec written down
somewhere?

For some added context, there are many Parquet libraries even outside of
the Apache governance. For example, both Velox and DuckDB have their own
C++ implementation of Parquet, independent of the Apache C++ one.

On Fri, Jan 5, 2024 at 9:26 AM Micah Kornfield <[email protected]>
wrote:

> >
> > I don't believe Apache has any restriction against Rust. We are not
> > collectively beholden to any other organization's restrictions, are we?
>
>
> It is correct that Apache does not have any restrictions.  The point is
> mostly about:
> 1.  Even if there is no restriction, maintainers of Apache projects need to
> maintain their own tool chains, adding a new dependency might not be
> something they care to take on (I am not actively involved in tool chain
> maintenance, but Arrow/Parquet C++ build system is particularly complex due
> to the wide range ot systems it targets).
> 2.  IMO it is important to consider downstream users in these decisions as
> well.
>
> Another thing that could help with adoption here is if pcodec had a
> specification document (apologies if I missed it), that would allow others
> to more easily port it.
>
> Thanks,
> Micah
>
>
> On Fri, Jan 5, 2024 at 5:53 AM Martin Loncaric <[email protected]>
> wrote:
>
> > I would make the comparison to byte_stream_split immediately, filtering
> > down to only float columns, but looks like it's the one encoding not
> > supported by arrow-rs. Seeing if I can get this merged in:
> > https://github.com/apache/arrow-rs/pull/4183.
> >
> > In the meantime I'll see if I can do a compression-ratio-only comparison
> > using pyarrow or something.
> >
> > Micah:
> >
> > maintainers of parquet don't necessarily
> > > have strong influence on all toolchain decisions their organizations
> may
> > > make.
> >
> >
> > I don't believe Apache has any restriction against Rust. We are not
> > collectively beholden to any other organization's restrictions, are we?
> >
> > It does sound like a good idea for you to start publishing Maven packages
> > > and other native language bindings to generally expand the reach of
> your
> > > project.
> >
> >
> > Totally agreed. My understanding is that the JVM and C++ implementations
> > are most important to support, and other languages can follow (e.g. as
> they
> > have for byte stream split, apparently). Rust<>C++ bindings aren't too
> hard
> > since you only need to build for the target architecture. JNI and some
> > others are trickier.
> >
> > On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > It would be very interesting to expand the comparison against
> > > BYTE_STREAM_SPLIT + compression.
> > >
> > > See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
> > > to extend the range of types supporting BYTE_STREAM_SPLIT.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Wed, 3 Jan 2024 00:10:14 -0500
> > > Martin Loncaric <[email protected]>
> > > wrote:
> > > > I'd like to propose and get feedback on a new encoding for numerical
> > > > columns: pco. I just did a blog post demonstrating how this would
> > perform
> > > > on various real-world datasets
> > > > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> > TL;DR:
> > > pco
> > > > losslessly achieves much better compression ratio (44-158% higher)
> and
> > > > slightly faster decompression speed than zstd-compressed Parquet. On
> > the
> > > > other hand, it compresses somewhat slower at default compression
> level,
> > > but
> > > > I think this difference may disappear in future updates.
> > > >
> > > > I think supporting this optional encoding would be an enormous win,
> but
> > > I'm
> > > > not blind to the difficulties of implementing it:
> > > > * Writing a good JVM implementation would be very difficult, so we'd
> > > > probably have to make a JNI library.
> > > > * Pco must be compressed one "chunk" (probably one per Parquet data
> > page)
> > > > at a time, with no way to estimate the encoded size until it has
> > already
> > > > done >50% of the compression work. I suspect the best solution is to
> > > split
> > > > pco data pages based on unencoded size, which is different from
> > existing
> > > > encodings. I think this makes sense since pco fulfills the role
> usually
> > > > played by compression in Parquet.
> > > >
> > > > Please let me know what you think of this idea.
> > > >
> > > > Thanks,
> > > > Martin
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to