RE: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Suresh, Adi Wed, 10 Apr 2024 15:40:47 -0700

Hi, https://issues.apache.org/jira/browse/PARQUET-2450 is current affecting 
many of our customers. 
https://github.com/apache/parquet-mr/pull/1300#issuecomment-2046590751 will fix 
the issue. Can 1.14.0 be expedited? Or can we do a 1.13.2 patch release to get 
this fix out faster?


Let me know if there’s anything on my end that I can do to help.

On 2024/02/27 14:42:39 Fokko Driesprong wrote:
> Hey everyone,
>
> Thanks for the many responses.
>
> We should check that parquet-mr implements everything introduced by the new
> > parquet-format release.
>
>
> Good call and I fully agree with that. Let's double check that before
> starting any releases.
>
> We should check on every ongoing PRs and Jira's that seem to be targeting
> > the next parquet-mr release, and decide if we want to wait for them or
> > not.
>
>
> I'm happy to do a first pass on that.
>
> I am currently doing some work related to direct memory. Not all the related
> > jiras are created. Will try to create them and set 1.14.0 as target. Will
> > try to finalize everything by the end of next week.
>
>
> Thanks, it is not my main area of expertise, but let me know if you need a
> review. I would not want to rush the release if there is still ongoing
> work, just wanted to get the ball rolling and collect expectations.
>
> For the new API, I feel like we're doing a 1.15 and then jump to 2.0, which
> is also totally fine with me.
>
> For who's there, see you at the sync!
>
> Kind regards,
> Fokko Driesprong
>
> Op do 22 feb 2024 om 13:47 schreef Steve Loughran
> <st...@cloudera.com.inva<mailto:st...@cloudera.com.inva>lid>:
>
> > Apologies for not making any progress -been too busy with releases.
> >
> > This week I am helping Hadoop 3.4.0 out the door. Hopefully we will only
> > need one more iteration to get the packaging right (essentially strip out
> > as many transient JARs as we can). My release module does actually build
> > parquet as one stage in the validation, so I'm happy we aren't breaking
> > your build.
> >
> > Moving to 3.3+ would be absolutely wonderful; it has been out for years and
> > we have fixed many issues as well as done our best to move to less insecure
> > transitive dependencies -that is still ongoing. It is ongoing forever I
> > suspect.
> >
> > Unless you use a release with vector IO (3.3.5+) you'll still need to use
> > reflection there.
> >
> > What you will get as soon as you move to 3.3.0 is the openFile() API which
> > lets you
> > Explicitly declare the read/seek policy of a file. For parquet, "random" is
> > what you want.
> > Pass in the filestatus or file length when opening a file. For object
> > stores, that can save the overhead of an HTTP HEAD request as we can skip
> > the probe for the existence and length of the file.
> >
> > Random IO is the biggest saving here; s3a FS tries to guess your read
> > policy and switch to random on the first backwards seek, but it isn't
> > perfect.
> >
> > Regarding vectored read APIs, the Hadoop one maps trivially to the java nio
> > scatter/gather read API. Which can deliver great speed ups on native
> > storage, especially SSD -more from the ability to do parallel block reads
> > than anything else. What does that mean? use the hadoop raw local fS and
> > you get it. It also means that any non-hadoop java code should use the nio
> > read API directly.
> >
> > Anyway: I do plan to get onto that PR request as soon as I get a chance.
> > - add range overlap detection in the parquet code
> > - make sure all hadoop filesystem reject that too. s3a already does AFAIK,
> > but I want consistency, contract tests and coverage in the specification.
> >
> >
> > On Wed, 21 Feb 2024 at 15:30, Gang Wu 
> > <us...@gmail.com<mailto:us...@gmail.com>> wrote:
> >
> > > Hi,
> > >
> > > Thanks for bringing this up!
> > >
> > > For the 1.14.0 release, I think it would be good to include some open
> > > PRs, e.g. [1].
> > >
> > > Thanks Gabor for the idea of new APIs! I agree that we need to clean
> > > up some misused APIs and remove the Hadoop dependencies. In the
> > > meanwhile, I actually have some concerns. For example, recently I have
> > > just investigated how ApacheSpark and Apache Iceberg support vectorized
> > > reading parquet. I have seen many code duplication between them but
> > > have different high-level APIs. If we aim to support similar vectorized
> > > reader based on Arrow vectors, I am not sure if these clients are willing
> > > to
> > > migrate due to the difference in type system, performance of vector
> > > conversion, etc. That said, this is worth doing and we need to collect
> > > sufficient feedback from different communities.
> > >
> > > [1] https://github.com/apache/parquet-mr/pull/1139
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky 
> > > <ga...@apache.org<mailto:ga...@apache.org>>
> > wrote:
> > >
> > > > Thanks for bringing this up, Fokko.
> > > > Unfortunately, I won't be able to join next week. (Hopefully I will be
> > > > there at the one after.)
> > > > So, let me write my thoughts here.
> > > >
> > > > I agree it is time to start preparing the next parquet-mr release. I
> > have
> > > > some thoughts:
> > > > - We should check that parquet-mr implements everything introduced by
> > the
> > > > new parquet-format release
> > > > - We should check on every ongoing PRs and jiras that seem to be
> > > targeting
> > > > the next parquet-mr release, and decide if we want to wait for them or
> > > not
> > > > - I am currently doing some work related to direct memory. Not all the
> > > > related jiras are created. Will try to create them and set 1.14.0 as
> > > > target. Will try to finalize everything by the end of next week.
> > > >
> > > > About parquet-mr 2.0: we need to decide what we expect from it. The
> > java
> > > > upgrade is just one thing that even can be done without a major version
> > > > (e.g. separate releases for different java versions)
> > > > My original thoughts about 2.0 was to provide a new API for our clients
> > > >
> > > > - We've had many issues because different API users started using
> > > > classes/methods that were originally implemented for internal use only.
> > > > Like reading the pages directly.
> > > >
> > > > - We need to have different levels of APIs that support all current
> > > > use-cases. e.g.:
> > > >
> > > > - Easy to use high level row-wise reading/writing
> > > >
> > > > - vectorized reading/writing; probably native support of Arrow vectors
> > > >
> > > > - We need to get rid of the Hadoop dependencies
> > > >
> > > > - The goal is to have a well-defined public API that we share with our
> > > > clients and hide everything else. It is much easier to keep backward
> > > > compatibility for the public API only.
> > > >
> > > > - The new API itself does not need a major release. We can start
> > working
> > > on
> > > > it in a separate module. We'll need some minor release cycles to build
> > > it.
> > > > (We'll need our client's feedback.) What we need a major release for is
> > > > (after having the finalized new API) moving all current public classes
> > to
> > > > internal modules.
> > > >
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > >
> > > >
> > > > Fokko Driesprong <fo...@apache.org<mailto:fo...@apache.org>> ezt írta 
> > > > (időpont: 2024. febr.
> > 21.,
> > > > Sze, 13:04):
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I'm seeing some great progress on the Parquet side and it was almost
> > > one
> > > > > year ago that I ran the last 1.13.1 release (May 2023). Are there any
> > > > > considerations of doing a 1.14.0 release?
> > > > >
> > > > > Looking forward, I would like to discuss a Parquet-mr 2.0 release.
> > > > >
> > > > >    - Looking at other projects in the space there are more and more
> > > that
> > > > >    are moving to Java 11+, for example, Spark 4.0 (June 2024) and
> > > Iceberg
> > > > > 2.0
> > > > >    (the first release after 1.5.0 that's being voted on right now).
> > > > >    - We currently have support for Hadoop 2.x which is compiled
> > against
> > > > >    Java 7. I would suggest dropping everything below 3.3 as that's
> > the
> > > > > minimal
> > > > >    version supporting Java 11
> > > > >    <
> > > > >
> > > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> > > > >.
> > > > >    Because some APIs changed, we also have to use reflection, which
> > is
> > > > not
> > > > >    great.
> > > > >
> > > > > I would also like to thank Xinli for updating the Parquet Sync
> > invite.
> > > I
> > > > > was there on the 30th of January, but all by myself. The next sync
> > next
> > > > > week Tuesday would be a great opportunity to go over this topic.
> > > > >
> > > > > Looking forward to your thoughts!
> > > > >
> > > > > Kind regards,
> > > > > Fokko Driesprong
> > > > >
> > > >
> > >
> >
>

RE: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Reply via email to