Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Gang Wu Thu, 11 Apr 2024 08:00:59 -0700

On my end, the only PR waiting for 1.14 release is [1] and it is very close
to be merged. As the release process is pretty much the same for 1.14.0
and 1.13.2, I'd prefer to expedite the process for 1.14.0.


[1] https://github.com/apache/parquet-mr/pull/1139

Best,
Gang

On Thu, Apr 11, 2024 at 6:40 AM Suresh, Adi <[email protected]>
wrote:

> Hi, https://issues.apache.org/jira/browse/PARQUET-2450 is current
> affecting many of our customers.
> https://github.com/apache/parquet-mr/pull/1300#issuecomment-2046590751
> will fix the issue. Can 1.14.0 be expedited? Or can we do a 1.13.2 patch
> release to get this fix out faster?
>
> Let me know if there’s anything on my end that I can do to help.
>
> On 2024/02/27 14:42:39 Fokko Driesprong wrote:
> > Hey everyone,
> >
> > Thanks for the many responses.
> >
> > We should check that parquet-mr implements everything introduced by the
> new
> > > parquet-format release.
> >
> >
> > Good call and I fully agree with that. Let's double check that before
> > starting any releases.
> >
> > We should check on every ongoing PRs and Jira's that seem to be targeting
> > > the next parquet-mr release, and decide if we want to wait for them or
> > > not.
> >
> >
> > I'm happy to do a first pass on that.
> >
> > I am currently doing some work related to direct memory. Not all the
> related
> > > jiras are created. Will try to create them and set 1.14.0 as target.
> Will
> > > try to finalize everything by the end of next week.
> >
> >
> > Thanks, it is not my main area of expertise, but let me know if you need
> a
> > review. I would not want to rush the release if there is still ongoing
> > work, just wanted to get the ball rolling and collect expectations.
> >
> > For the new API, I feel like we're doing a 1.15 and then jump to 2.0,
> which
> > is also totally fine with me.
> >
> > For who's there, see you at the sync!
> >
> > Kind regards,
> > Fokko Driesprong
> >
> > Op do 22 feb 2024 om 13:47 schreef Steve Loughran
> > <[email protected]<mailto:[email protected]>lid>:
> >
> > > Apologies for not making any progress -been too busy with releases.
> > >
> > > This week I am helping Hadoop 3.4.0 out the door. Hopefully we will
> only
> > > need one more iteration to get the packaging right (essentially strip
> out
> > > as many transient JARs as we can). My release module does actually
> build
> > > parquet as one stage in the validation, so I'm happy we aren't breaking
> > > your build.
> > >
> > > Moving to 3.3+ would be absolutely wonderful; it has been out for
> years and
> > > we have fixed many issues as well as done our best to move to less
> insecure
> > > transitive dependencies -that is still ongoing. It is ongoing forever I
> > > suspect.
> > >
> > > Unless you use a release with vector IO (3.3.5+) you'll still need to
> use
> > > reflection there.
> > >
> > > What you will get as soon as you move to 3.3.0 is the openFile() API
> which
> > > lets you
> > > Explicitly declare the read/seek policy of a file. For parquet,
> "random" is
> > > what you want.
> > > Pass in the filestatus or file length when opening a file. For object
> > > stores, that can save the overhead of an HTTP HEAD request as we can
> skip
> > > the probe for the existence and length of the file.
> > >
> > > Random IO is the biggest saving here; s3a FS tries to guess your read
> > > policy and switch to random on the first backwards seek, but it isn't
> > > perfect.
> > >
> > > Regarding vectored read APIs, the Hadoop one maps trivially to the
> java nio
> > > scatter/gather read API. Which can deliver great speed ups on native
> > > storage, especially SSD -more from the ability to do parallel block
> reads
> > > than anything else. What does that mean? use the hadoop raw local fS
> and
> > > you get it. It also means that any non-hadoop java code should use the
> nio
> > > read API directly.
> > >
> > > Anyway: I do plan to get onto that PR request as soon as I get a
> chance.
> > > - add range overlap detection in the parquet code
> > > - make sure all hadoop filesystem reject that too. s3a already does
> AFAIK,
> > > but I want consistency, contract tests and coverage in the
> specification.
> > >
> > >
> > > On Wed, 21 Feb 2024 at 15:30, Gang Wu <[email protected]<mailto:
> [email protected]>> wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for bringing this up!
> > > >
> > > > For the 1.14.0 release, I think it would be good to include some open
> > > > PRs, e.g. [1].
> > > >
> > > > Thanks Gabor for the idea of new APIs! I agree that we need to clean
> > > > up some misused APIs and remove the Hadoop dependencies. In the
> > > > meanwhile, I actually have some concerns. For example, recently I
> have
> > > > just investigated how ApacheSpark and Apache Iceberg support
> vectorized
> > > > reading parquet. I have seen many code duplication between them but
> > > > have different high-level APIs. If we aim to support similar
> vectorized
> > > > reader based on Arrow vectors, I am not sure if these clients are
> willing
> > > > to
> > > > migrate due to the difference in type system, performance of vector
> > > > conversion, etc. That said, this is worth doing and we need to
> collect
> > > > sufficient feedback from different communities.
> > > >
> > > > [1] https://github.com/apache/parquet-mr/pull/1139
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky <[email protected]
> <mailto:[email protected]>>
> > > wrote:
> > > >
> > > > > Thanks for bringing this up, Fokko.
> > > > > Unfortunately, I won't be able to join next week. (Hopefully I
> will be
> > > > > there at the one after.)
> > > > > So, let me write my thoughts here.
> > > > >
> > > > > I agree it is time to start preparing the next parquet-mr release.
> I
> > > have
> > > > > some thoughts:
> > > > > - We should check that parquet-mr implements everything introduced
> by
> > > the
> > > > > new parquet-format release
> > > > > - We should check on every ongoing PRs and jiras that seem to be
> > > > targeting
> > > > > the next parquet-mr release, and decide if we want to wait for
> them or
> > > > not
> > > > > - I am currently doing some work related to direct memory. Not all
> the
> > > > > related jiras are created. Will try to create them and set 1.14.0
> as
> > > > > target. Will try to finalize everything by the end of next week.
> > > > >
> > > > > About parquet-mr 2.0: we need to decide what we expect from it. The
> > > java
> > > > > upgrade is just one thing that even can be done without a major
> version
> > > > > (e.g. separate releases for different java versions)
> > > > > My original thoughts about 2.0 was to provide a new API for our
> clients
> > > > >
> > > > > - We've had many issues because different API users started using
> > > > > classes/methods that were originally implemented for internal use
> only.
> > > > > Like reading the pages directly.
> > > > >
> > > > > - We need to have different levels of APIs that support all current
> > > > > use-cases. e.g.:
> > > > >
> > > > > - Easy to use high level row-wise reading/writing
> > > > >
> > > > > - vectorized reading/writing; probably native support of Arrow
> vectors
> > > > >
> > > > > - We need to get rid of the Hadoop dependencies
> > > > >
> > > > > - The goal is to have a well-defined public API that we share with
> our
> > > > > clients and hide everything else. It is much easier to keep
> backward
> > > > > compatibility for the public API only.
> > > > >
> > > > > - The new API itself does not need a major release. We can start
> > > working
> > > > on
> > > > > it in a separate module. We'll need some minor release cycles to
> build
> > > > it.
> > > > > (We'll need our client's feedback.) What we need a major release
> for is
> > > > > (after having the finalized new API) moving all current public
> classes
> > > to
> > > > > internal modules.
> > > > >
> > > > >
> > > > > Cheers,
> > > > > Gabor
> > > > >
> > > > >
> > > > >
> > > > > Fokko Driesprong <[email protected]<mailto:[email protected]>> ezt
> írta (időpont: 2024. febr.
> > > 21.,
> > > > > Sze, 13:04):
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'm seeing some great progress on the Parquet side and it was
> almost
> > > > one
> > > > > > year ago that I ran the last 1.13.1 release (May 2023). Are
> there any
> > > > > > considerations of doing a 1.14.0 release?
> > > > > >
> > > > > > Looking forward, I would like to discuss a Parquet-mr 2.0
> release.
> > > > > >
> > > > > >    - Looking at other projects in the space there are more and
> more
> > > > that
> > > > > >    are moving to Java 11+, for example, Spark 4.0 (June 2024) and
> > > > Iceberg
> > > > > > 2.0
> > > > > >    (the first release after 1.5.0 that's being voted on right
> now).
> > > > > >    - We currently have support for Hadoop 2.x which is compiled
> > > against
> > > > > >    Java 7. I would suggest dropping everything below 3.3 as
> that's
> > > the
> > > > > > minimal
> > > > > >    version supporting Java 11
> > > > > >    <
> > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> > > > > >.
> > > > > >    Because some APIs changed, we also have to use reflection,
> which
> > > is
> > > > > not
> > > > > >    great.
> > > > > >
> > > > > > I would also like to thank Xinli for updating the Parquet Sync
> > > invite.
> > > > I
> > > > > > was there on the 30th of January, but all by myself. The next
> sync
> > > next
> > > > > > week Tuesday would be a great opportunity to go over this topic.
> > > > > >
> > > > > > Looking forward to your thoughts!
> > > > > >
> > > > > > Kind regards,
> > > > > > Fokko Driesprong
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

Reply via email to