Hey Nicholas, Thanks for pointing this out. I just realized that I misread the spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles, "hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud POM (here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L174> and here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L213>). But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for the misleading information.
Cheng On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > > I don't think the default Hadoop version matters except for the > spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 > profile. > > What do you mean by "only meaningful under the hadoop-3.2 profile"? > > On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <lian.cs....@gmail.com> wrote: > >> Hey Steve, >> >> In terms of Maven artifact, I don't think the default Hadoop version >> matters except for the spark-hadoop-cloud module, which is only meaningful >> under the hadoop-3.2 profile. All the other spark-* artifacts published to >> Maven central are Hadoop-version-neutral. >> >> Another issue about switching the default Hadoop version to 3.2 is >> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt >> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency >> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark >> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one. >> >> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via >> the proposed hive-2.3 profile, I personally don't have a preference over >> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for >> minimizing the release management work, in case we decided to publish other >> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case >> spark-hadoop-cloud and publish it using a hadoop-3.2 build. >> >> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> I also agree with Steve and Felix. >>> >>> Let's have another thread to discuss Hive issue >>> >>> because this thread was originally for `hadoop` version. >>> >>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and >>> `hadoop-3.0` versions. >>> >>> We don't need to mix both. >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <felixcheun...@hotmail.com> >>> wrote: >>> >>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. >>>> It is old and rather buggy; and It’s been *years* >>>> >>>> I think we should decouple hive change from everything else if people >>>> are concerned? >>>> >>>> ------------------------------ >>>> *From:* Steve Loughran <ste...@cloudera.com.INVALID> >>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM >>>> *To:* Cheng Lian <lian.cs....@gmail.com> >>>> *Cc:* Sean Owen <sro...@gmail.com>; Wenchen Fan <cloud0...@gmail.com>; >>>> Dongjoon Hyun <dongjoon.h...@gmail.com>; dev <dev@spark.apache.org>; >>>> Yuming Wang <wgy...@gmail.com> >>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0? >>>> >>>> Can I take this moment to remind everyone that the version of hive >>>> which spark has historically bundled (the org.spark-project one) is an >>>> orphan project put together to deal with Hive's shading issues and a source >>>> of unhappiness in the Hive project. What ever get shipped should do its >>>> best to avoid including that file. >>>> >>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the >>>> safest move from a risk minimisation perspective. If something has broken >>>> then it is you can start with the assumption that it is in the o.a.s >>>> packages without having to debug o.a.hadoop and o.a.hive first. There is a >>>> cost: if there are problems with the hadoop / hive dependencies those teams >>>> will inevitably ignore filed bug reports for the same reason spark team >>>> will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for >>>> the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear >>>> that in mind. It's not been tested, it has dependencies on artifacts we >>>> know are incompatible, and as far as the Hadoop project is concerned: >>>> people should move to branch 3 if they want to run on a modern version of >>>> Java >>>> >>>> It would be really really good if the published spark maven artefacts >>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop >>>> 3.x. That way people doing things with their own projects will get >>>> up-to-date dependencies and don't get WONTFIX responses themselves. >>>> >>>> -Steve >>>> >>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last >>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be >>>> the transition release. >>>> >>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <lian.cs....@gmail.com> >>>> wrote: >>>> >>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I >>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which >>>> seemed risky, and therefore we only introduced Hive 2.3 under the >>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong >>>> here... >>>> >>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed >>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not >>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 >>>> upgrade together looks too risky. >>>> >>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sro...@gmail.com> wrote: >>>> >>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather >>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2 >>>> work and is there demand for it? >>>> >>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> > >>>> > Do we have a limitation on the number of pre-built distributions? >>>> Seems this time we need >>>> > 1. hadoop 2.7 + hive 1.2 >>>> > 2. hadoop 2.7 + hive 2.3 >>>> > 3. hadoop 3 + hive 2.3 >>>> > >>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so >>>> don't need to add JDK version to the combination. >>>> > >>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun < >>>> dongjoon.h...@gmail.com> wrote: >>>> >> >>>> >> Thank you for suggestion. >>>> >> >>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal >>>> to Hadoop 3. >>>> >> IIRC, originally, it was proposed in that way, but we put it under >>>> `hadoop-3.2` to avoid adding new profiles at that time. >>>> >> >>>> >> And, I'm wondering if you are considering additional pre-built >>>> distribution and Jenkins jobs. >>>> >> >>>> >> Bests, >>>> >> Dongjoon. >>>> >> >>>> >>>>