IMHO, any implementation relying on JNI on Java is a non-starter. The Java
ecosystem prefers pure Java libraries a lot over libraries with native
components. parquet-mr is currently a pure Java library. Making it a mixed
library with native libraries and JNI would be such a maintenance disaster,
that even the best encoding - and be it 10x faster than any other - would
not be worth it.

Pure Java has the big advantage of running on each platform where the JVM
runs, while JNI variants need to have a binary for the platform. While it
might be comparatively simple to maintain a binary for x86 windows & linux,
once you dive into other architectures, you usually have a problem.

I am not a maintainer of parquet so this is just my own opinion, but I
doubt that you will ever get a commit that introduces JNI bindings into
parquet-mr for the aforementioned reasons. I myself am maintaining a C++
project, so I am indifferent about the Java implementation of Parquet. But
if I was using it in a Java project that had to run in a lot of
environments, I would refuse to use it if it had JNI bindings.

Cheers,
Jan





Am Fr., 5. Jan. 2024 um 14:53 Uhr schrieb Martin Loncaric <
m.w.lonca...@gmail.com>:

> I would make the comparison to byte_stream_split immediately, filtering
> down to only float columns, but looks like it's the one encoding not
> supported by arrow-rs. Seeing if I can get this merged in:
> https://github.com/apache/arrow-rs/pull/4183.
>
> In the meantime I'll see if I can do a compression-ratio-only comparison
> using pyarrow or something.
>
> Micah:
>
> maintainers of parquet don't necessarily
> > have strong influence on all toolchain decisions their organizations may
> > make.
>
>
> I don't believe Apache has any restriction against Rust. We are not
> collectively beholden to any other organization's restrictions, are we?
>
> It does sound like a good idea for you to start publishing Maven packages
> > and other native language bindings to generally expand the reach of your
> > project.
>
>
> Totally agreed. My understanding is that the JVM and C++ implementations
> are most important to support, and other languages can follow (e.g. as they
> have for byte stream split, apparently). Rust<>C++ bindings aren't too hard
> since you only need to build for the target architecture. JNI and some
> others are trickier.
>
> On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello,
> >
> > It would be very interesting to expand the comparison against
> > BYTE_STREAM_SPLIT + compression.
> >
> > See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
> > to extend the range of types supporting BYTE_STREAM_SPLIT.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Wed, 3 Jan 2024 00:10:14 -0500
> > Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> > > I'd like to propose and get feedback on a new encoding for numerical
> > > columns: pco. I just did a blog post demonstrating how this would
> perform
> > > on various real-world datasets
> > > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> TL;DR:
> > pco
> > > losslessly achieves much better compression ratio (44-158% higher) and
> > > slightly faster decompression speed than zstd-compressed Parquet. On
> the
> > > other hand, it compresses somewhat slower at default compression level,
> > but
> > > I think this difference may disappear in future updates.
> > >
> > > I think supporting this optional encoding would be an enormous win, but
> > I'm
> > > not blind to the difficulties of implementing it:
> > > * Writing a good JVM implementation would be very difficult, so we'd
> > > probably have to make a JNI library.
> > > * Pco must be compressed one "chunk" (probably one per Parquet data
> page)
> > > at a time, with no way to estimate the encoded size until it has
> already
> > > done >50% of the compression work. I suspect the best solution is to
> > split
> > > pco data pages based on unencoded size, which is different from
> existing
> > > encodings. I think this makes sense since pco fulfills the role usually
> > > played by compression in Parquet.
> > >
> > > Please let me know what you think of this idea.
> > >
> > > Thanks,
> > > Martin
> > >
> >
> >
> >
> >
>

Reply via email to