Hi,

Thanks for bringing this up!

For the 1.14.0 release, I think it would be good to include some open
PRs, e.g. [1].

Thanks Gabor for the idea of new APIs! I agree that we need to clean
up some misused APIs and remove the Hadoop dependencies. In the
meanwhile, I actually have some concerns. For example, recently I have
just investigated how ApacheSpark and Apache Iceberg support vectorized
reading parquet. I have seen many code duplication between them but
have different high-level APIs. If we aim to support similar vectorized
reader based on Arrow vectors, I am not sure if these clients are willing to
migrate due to the difference in type system, performance of vector
conversion, etc. That said, this is worth doing and we need to collect
sufficient feedback from different communities.

[1] https://github.com/apache/parquet-mr/pull/1139

Best,
Gang

On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky <[email protected]> wrote:

> Thanks for bringing this up, Fokko.
> Unfortunately, I won't be able to join next week. (Hopefully I will be
> there at the one after.)
> So, let me write my thoughts here.
>
> I agree it is time to start preparing the next parquet-mr release. I have
> some thoughts:
> - We should check that parquet-mr implements everything introduced by the
> new parquet-format release
> - We should check on every ongoing PRs and jiras that seem to be targeting
> the next parquet-mr release, and decide if we want to wait for them or not
> - I am currently doing some work related to direct memory. Not all the
> related jiras are created. Will try to create them and set 1.14.0 as
> target. Will try to finalize everything by the end of next week.
>
> About parquet-mr 2.0: we need to decide what we expect from it. The java
> upgrade is just one thing that even can be done without a major version
> (e.g. separate releases for different java versions)
> My original thoughts about 2.0 was to provide a new API for our clients
>
> - We've had many issues because different API users started using
> classes/methods that were originally implemented for internal use only.
> Like reading the pages directly.
>
> - We need to have different levels of APIs that support all current
> use-cases. e.g.:
>
> - Easy to use high level row-wise reading/writing
>
> - vectorized reading/writing; probably native support of Arrow vectors
>
> - We need to get rid of the Hadoop dependencies
>
> - The goal is to have a well-defined public API that we share with our
> clients and hide everything else. It is much easier to keep backward
> compatibility for the public API only.
>
> - The new API itself does not need a major release. We can start working on
> it in a separate module. We'll need some minor release cycles to build it.
> (We'll need our client's feedback.) What we need a major release for is
> (after having the finalized new API) moving all current public classes to
> internal modules.
>
>
> Cheers,
> Gabor
>
>
>
> Fokko Driesprong <[email protected]> ezt írta (időpont: 2024. febr. 21.,
> Sze, 13:04):
>
> > Hi everyone,
> >
> > I'm seeing some great progress on the Parquet side and it was almost one
> > year ago that I ran the last 1.13.1 release (May 2023). Are there any
> > considerations of doing a 1.14.0 release?
> >
> > Looking forward, I would like to discuss a Parquet-mr 2.0 release.
> >
> >    - Looking at other projects in the space there are more and more that
> >    are moving to Java 11+, for example, Spark 4.0 (June 2024) and Iceberg
> > 2.0
> >    (the first release after 1.5.0 that's being voted on right now).
> >    - We currently have support for Hadoop 2.x which is compiled against
> >    Java 7. I would suggest dropping everything below 3.3 as that's the
> > minimal
> >    version supporting Java 11
> >    <
> > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> >.
> >    Because some APIs changed, we also have to use reflection, which is
> not
> >    great.
> >
> > I would also like to thank Xinli for updating the Parquet Sync invite. I
> > was there on the 30th of January, but all by myself. The next sync next
> > week Tuesday would be a great opportunity to go over this topic.
> >
> > Looking forward to your thoughts!
> >
> > Kind regards,
> > Fokko Driesprong
> >
>

Reply via email to