I think we just need to provide two options and let end users choose the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark 3.1 release to me.
I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based on this link https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it sounds like Hadoop 3.x is not as popular as Hadoop 2.7. On Tue, Jun 23, 2020 at 8:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > I fully understand your concern, but we cannot live with Hadoop 2.7.4 > forever, Xiao. Like Hadoop 2.6, we should let it go. > > So, are you saying that CRAN/PyPy should have all combination of Apache > Spark including Hive 1.2 distribution? > > What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to > remove the road blocks for that. > > As a side note, Homebrew is not Apache Spark official channel, but it's > also popular distribution channel in the community. And, it's using Hadoop > 3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark > 3.1), isn't it? > > Bests, > Dongjoon. > > > > On Tue, Jun 23, 2020 at 7:55 PM Xiao Li <lix...@databricks.com> wrote: > >> Then, it will be a little complex after this PR. It might make the >> community more confused. >> >> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; >> however, in the other distributions, we are using Hadoop 3.2 as the >> default? >> >> How to explain this to the community? I would not change the default for >> consistency. >> >> Xiao >> >> >> >> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Thanks. Uploading PySpark to PyPI is a simple manual step and >>> our release script is able to build PySpark with Hadoop 2.7 still if we >>> want. >>> So, `No` for the following question. I updated my PR according to your >>> comment. >>> >>> > If we change the default, will it impact them? If YES,... >>> >>> From the comment on the PR, the following become irrelevant to the >>> current PR. >>> >>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) >>> >>> Bests, >>> Dongjoon. >>> >>> >>> >>> >>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <lix...@databricks.com> wrote: >>> >>>> >>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We >>>> should avoid forcing the current PySpark users to upgrade their Hadoop >>>> versions. If we change the default, will it impact them? If YES, I think we >>>> should not do it until it is ready and they have a workaround. So far, our >>>> pypi downloads are still relying on our default version. >>>> >>>> Please correct me if my concern is not valid. >>>> >>>> Xiao >>>> >>>> >>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, All. >>>>> >>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a >>>>> default Hadoop profile in 3.1.0?" >>>>> There exists some recent discussion on the following PR. Please let us >>>>> know your thoughts. >>>>> >>>>> https://github.com/apache/spark/pull/28897 >>>>> >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <lix...@databricks.com> wrote: >>>>> >>>>>> Hi, Steve, >>>>>> >>>>>> Thanks for your comments! My major quality concern is not against >>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to >>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop >>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 >>>>>> profile >>>>>> is more risky due to these changes. >>>>>> >>>>>> To speed up the adoption of Spark 3.0, which has many other highly >>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the >>>>>> default. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Xiao. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <ste...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> What is the current default value? as the 2.x releases are becoming >>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 >>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means >>>>>>> there will inevitably be surprises. >>>>>>> >>>>>>> One issue about using a older versions is that any problem reported >>>>>>> -especially at stack traces you can blame me for- Will generally be met >>>>>>> by >>>>>>> a response of "does it go away when you upgrade?" The other issue is how >>>>>>> much test coverage are things getting? >>>>>>> >>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The >>>>>>> ABFS client is there, and I the big guava update (HADOOP-16213) went in. >>>>>>> People will either love or hate that. >>>>>>> >>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large >>>>>>> backport planned though, including changes to better handle AWS caching >>>>>>> of >>>>>>> 404s generatd from HEAD requests before an object was actually created. >>>>>>> >>>>>>> It would be really good if the spark distributions shipped with >>>>>>> later versions of the hadoop artifacts. >>>>>>> >>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <lix...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The >>>>>>>> changes are massive, including Hive execution and a new version of Hive >>>>>>>> thriftserver. >>>>>>>> >>>>>>>> To reduce the risk, I would like to keep the current default >>>>>>>> version unchanged. When it becomes stable, we can change the default >>>>>>>> profile to Hadoop-3.2. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Xiao >>>>>>>> >>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sro...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about >>>>>>>>> the >>>>>>>>> implications. >>>>>>>>> That said my guess is we're close to the point where we don't need >>>>>>>>> to >>>>>>>>> support Hadoop 2.x anyway, so, yeah. >>>>>>>>> >>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun < >>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > Hi, All. >>>>>>>>> > >>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop >>>>>>>>> 3 . >>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` >>>>>>>>> will be the same because we didn't change anything yet. >>>>>>>>> > >>>>>>>>> > Technically, we need to change two places for publishing. >>>>>>>>> > >>>>>>>>> > 1. Jenkins Snapshot Publishing >>>>>>>>> > >>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ >>>>>>>>> > >>>>>>>>> > 2. Release Snapshot/Release Publishing >>>>>>>>> > >>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh >>>>>>>>> > >>>>>>>>> > To minimize the change, we need to switch our default Hadoop >>>>>>>>> profile. >>>>>>>>> > >>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and >>>>>>>>> `hadoop-3.2 (3.2.0)` is optional. >>>>>>>>> > We had better use `hadoop-3.2` profile by default and >>>>>>>>> `hadoop-2.7` optionally. >>>>>>>>> > >>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only >>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark >>>>>>>>> 2.4.x. >>>>>>>>> > >>>>>>>>> > Bests, >>>>>>>>> > Dongjoon. >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> [image: Databricks Summit - Watch the talks] >>>>>>>> <https://databricks.com/sparkaisummit/north-america> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> [image: Databricks Summit - Watch the talks] >>>>>> <https://databricks.com/sparkaisummit/north-america> >>>>>> >>>>> >>>> >>>> -- >>>> <https://databricks.com/sparkaisummit/north-america> >>>> >>> >> >> -- >> <https://databricks.com/sparkaisummit/north-america> >> > -- <https://databricks.com/sparkaisummit/north-america>