Thanks for bringing this up, Fokko.
Unfortunately, I won't be able to join next week. (Hopefully I will be
there at the one after.)
So, let me write my thoughts here.

I agree it is time to start preparing the next parquet-mr release. I have
some thoughts:
- We should check that parquet-mr implements everything introduced by the
new parquet-format release
- We should check on every ongoing PRs and jiras that seem to be targeting
the next parquet-mr release, and decide if we want to wait for them or not
- I am currently doing some work related to direct memory. Not all the
related jiras are created. Will try to create them and set 1.14.0 as
target. Will try to finalize everything by the end of next week.

About parquet-mr 2.0: we need to decide what we expect from it. The java
upgrade is just one thing that even can be done without a major version
(e.g. separate releases for different java versions)
My original thoughts about 2.0 was to provide a new API for our clients

- We've had many issues because different API users started using
classes/methods that were originally implemented for internal use only.
Like reading the pages directly.

- We need to have different levels of APIs that support all current
use-cases. e.g.:

- Easy to use high level row-wise reading/writing

- vectorized reading/writing; probably native support of Arrow vectors

- We need to get rid of the Hadoop dependencies

- The goal is to have a well-defined public API that we share with our
clients and hide everything else. It is much easier to keep backward
compatibility for the public API only.

- The new API itself does not need a major release. We can start working on
it in a separate module. We'll need some minor release cycles to build it.
(We'll need our client's feedback.) What we need a major release for is
(after having the finalized new API) moving all current public classes to
internal modules.


Cheers,
Gabor



Fokko Driesprong <[email protected]> ezt írta (időpont: 2024. febr. 21.,
Sze, 13:04):

> Hi everyone,
>
> I'm seeing some great progress on the Parquet side and it was almost one
> year ago that I ran the last 1.13.1 release (May 2023). Are there any
> considerations of doing a 1.14.0 release?
>
> Looking forward, I would like to discuss a Parquet-mr 2.0 release.
>
>    - Looking at other projects in the space there are more and more that
>    are moving to Java 11+, for example, Spark 4.0 (June 2024) and Iceberg
> 2.0
>    (the first release after 1.5.0 that's being voted on right now).
>    - We currently have support for Hadoop 2.x which is compiled against
>    Java 7. I would suggest dropping everything below 3.3 as that's the
> minimal
>    version supporting Java 11
>    <
> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions>.
>    Because some APIs changed, we also have to use reflection, which is not
>    great.
>
> I would also like to thank Xinli for updating the Parquet Sync invite. I
> was there on the 30th of January, but all by myself. The next sync next
> week Tuesday would be a great opportunity to go over this topic.
>
> Looking forward to your thoughts!
>
> Kind regards,
> Fokko Driesprong
>

Reply via email to