Re: [DISCUSS] Hadoop support

Steve Loughran Wed, 13 Nov 2024 06:14:00 -0800

I'm doing this.

One thing to consider here is that current hadoop 3 version is 3.3.6


If the minimum version for releases is 3.3.0 then I think that should be
the version, to guarantee no accidental uses of newer APIs and classes.
3.3.6 might still be good as the test version though...



On Wed, 13 Nov 2024 at 09:13, Fokko Driesprong <[email protected]> wrote:

> From the replies here, and looking at the major cloud providers, I don't
> see any concerns regarding moving the lower bound to Hadoop 3.3.x. As
> suggested on the issue, it would be good
> <https://github.com/apache/parquet-java/issues/2943> to first get rid of
> the Hadoop 2 profile and all the error-prone reflections.
>
> Thanks everyone!
>
> Kind regards,
> Fokko
>
>
>
> Op ma 11 nov 2024 om 17:23 schreef Steve Loughran
> <[email protected]>:
>
> > That's about what I expected
> >
> > HD/Insight it's probably a fork of Hadoop 3.1.x that is kept up-to-date
> > -certainly  they do almost all of the work on the abfs connector against
> > trunk, with backport to the 3.4 branch, while AWS developers are
> > contributing great stuff in the S3A codebase (while I get left with the
> > mundane stuff like libraries forgetting to close streams (
> > https://github.com/apache/hadoop/pull/7151),
> >
> > Cloudera code is itself a 3.1.x fork but is more up to date w.r.t java 11
> > and CVEs; ~everything on hadoop branch-3.4 for s3a and abfs is in, and
> ~all
> > internal changes go into apache trunk and branch-3.4 first. That's not
> just
> > "community spirit"  –microsoft, amazon, cloudera and may others sharing a
> > common codebase means that we all benefit from the broader test coverage,
> > especially of those "so rare you will never see them" failure conditions
> > which actually happen a few times a day across the entire user bases of
> > everyone's products (e.g. HADOOP-19221). Having parquet on 3.3.0+ means
> > that everyone will be using up-to-date code meaning problems which
> surface
> > testshould be replicable in your own IDEs and tests.
> >
> > Steve
> >
> > * more testing is always welcome, especially: third-party stores, long
> and
> > slow haul links, proxies, VPNs, customer supplied encryption keys, heavy
> > load -and more. It's those configurations which neither developers nor
> the
> > CI builds test which can always benefit from extra coverage. And tests
> > *through* parquet are the way to be sure that parquet's code isn't
> hitting
> > regressions.
> >
> >
> > On Thu, 7 Nov 2024 at 19:36, Fokko Driesprong <[email protected]> wrote:
> >
> > > Thanks for jumping in here Steve,
> > >
> > > I agree with you, my only concern is that this is quite a jump.
> However,
> > > looking at the ecosystem, it might not be such a problem. Looking at
> the
> > > cloud providers:
> > >
> > > AWS active EMR distributions:
> > >
> > >    1. EMR 7.3.0
> > >    <
> > >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html
> > >
> > > is
> > >    at Hadoop 3.3.6
> > >    2. EMR 6.15
> > >    <
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html>
> > > is
> > >    at Hadoop 3.3.6 (<6.6.x is EOL
> > >    <
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> > > >
> > >    )
> > >    3. EMR 5.36
> > >    <
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html>
> > >    is at Hadoop 2.10.1 (≤5.35 is EOL
> > >    <
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> > > >,
> > >    so only bugfixes for 5.36.x)
> > >
> > > GCP active DataProc distributions:
> > >
> > >    - Dataproc 2.2.x
> > >    <
> > >
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2
> > > >
> > > is
> > >    at Hadoop 3.3.6
> > >    - Dataproc 2.1.x
> > >    <
> > >
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1
> > > >
> > > is
> > >    at Hadoop 3.3.6
> > >    - Dataproc 2.0.x
> > >    <
> > >
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0
> > > >
> > >    is at Hadoop 3.2.4 (EOL 2024/07/31
> > >    <
> > >
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters
> > > >
> > >    )
> > >
> > > Azure active HDI distributions:
> > >
> > >    - HDInsight 5.x
> > >    <
> > >
> >
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning
> > > >is
> > >    at Hadoop 3.3.4
> > >    - HDInsight 4.0
> > >    <
> > >
> >
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning
> > > >
> > >    is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster
> types
> > > that
> > >    have retired or will be retiring soon).
> > >
> > > Or, query engines:
> > >
> > >    - Spark 3.5.3
> > >    <
> > >
> >
> https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125
> > > >
> > >    is at Hadoop 3.3.4
> > >    - Spark 3.4.4
> > >    <
> > >
> >
> https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122
> > > >
> > >    is at Hadoop 3.3.4
> > >
> > > Hive 3.x has also been marked as EOL since October
> > > <https://hive.apache.org/general/downloads/>, and Hive 4 is also at
> > Hadoop
> > > 3.3.6
> > > <
> > >
> >
> https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143
> > > >
> > > .
> > >
> > > Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems
> > reasonable
> > > to me. They can still use 1.14.x if they are on an older Hadoop
> version.
> > >
> > > Kind regards,
> > > Fokko
> > >
> > >
> > >
> > > Op do 7 nov 2024 om 16:16 schreef Steve Loughran
> > > <[email protected]>:
> > >
> > > > On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <[email protected]>
> > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Breaking the radio silence from my end, I was enjoying paternity
> > leave.
> > > > >
> > > > > I wanted to bring this up for a while. In Parquet we're still
> > > supporting
> > > > > Hadoop 2.7.3, which was released in August 2016
> > > > > <https://hadoop.apache.org/release/2.7.3.html>. For things like
> > JDK21
> > > > > support, we have to drop these old versions. I was curious about
> what
> > > > > everyone thinks as a reasonable lower bound.
> > > > >
> > > > > My suggested route is to bump it to Hadoop 2.9.3
> > > > > <https://github.com/apache/parquet-java/pull/2944/> (November
> 2019)
> > > for
> > > > > Parquet 1.15.0, and then drop Hadoop 2 in the major release after
> > that.
> > > > Any
> > > > > thoughts, questions or concerns?
> > > > >
> > > > > I'd be ruthless and say hadoop 3.3.x only.
> > > >
> > > > hadoop 2.x is nominally "java 7" only. really.
> > > >
> > > > hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to
> > get a
> > > > set of dependencies which work OK with java 17+.
> > > >
> > > > Staying with older releases hampers parquet in terms of testing,
> > > > maintenance, inability to use improvements written in the past five
> or
> > > more
> > > > years, and more
> > > >
> > > > My proposal would be
> > > >
> > > >    - 1.14.x: move to 2.9.3
> > > >    - 1.15.x hadoop 3.3.x only
> > > >
> > >
> >
>

Re: [DISCUSS] Hadoop support

Reply via email to