> I don't think the default Hadoop version matters except for the spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 profile.
What do you mean by "only meaningful under the hadoop-3.2 profile"? On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <lian.cs....@gmail.com> wrote: > Hey Steve, > > In terms of Maven artifact, I don't think the default Hadoop version > matters except for the spark-hadoop-cloud module, which is only meaningful > under the hadoop-3.2 profile. All the other spark-* artifacts published to > Maven central are Hadoop-version-neutral. > > Another issue about switching the default Hadoop version to 3.2 is PySpark > distribution. Right now, we only publish PySpark artifacts prebuilt with > Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to > 3.2 is feasible for PySpark users. Or maybe we should publish PySpark > prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one. > > Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the > proposed hive-2.3 profile, I personally don't have a preference over having > Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing > the release management work, in case we decided to publish other spark-* > Maven artifacts from a Hadoop 2.7 build, we can still special case > spark-hadoop-cloud and publish it using a hadoop-3.2 build. > > On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> I also agree with Steve and Felix. >> >> Let's have another thread to discuss Hive issue >> >> because this thread was originally for `hadoop` version. >> >> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and >> `hadoop-3.0` versions. >> >> We don't need to mix both. >> >> Bests, >> Dongjoon. >> >> >> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. >>> It is old and rather buggy; and It’s been *years* >>> >>> I think we should decouple hive change from everything else if people >>> are concerned? >>> >>> ------------------------------ >>> *From:* Steve Loughran <ste...@cloudera.com.INVALID> >>> *Sent:* Sunday, November 17, 2019 9:22:09 AM >>> *To:* Cheng Lian <lian.cs....@gmail.com> >>> *Cc:* Sean Owen <sro...@gmail.com>; Wenchen Fan <cloud0...@gmail.com>; >>> Dongjoon Hyun <dongjoon.h...@gmail.com>; dev <dev@spark.apache.org>; >>> Yuming Wang <wgy...@gmail.com> >>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0? >>> >>> Can I take this moment to remind everyone that the version of hive which >>> spark has historically bundled (the org.spark-project one) is an orphan >>> project put together to deal with Hive's shading issues and a source of >>> unhappiness in the Hive project. What ever get shipped should do its best >>> to avoid including that file. >>> >>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest >>> move from a risk minimisation perspective. If something has broken then it >>> is you can start with the assumption that it is in the o.a.s packages >>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if >>> there are problems with the hadoop / hive dependencies those teams will >>> inevitably ignore filed bug reports for the same reason spark team will >>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the >>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that >>> in mind. It's not been tested, it has dependencies on artifacts we know are >>> incompatible, and as far as the Hadoop project is concerned: people should >>> move to branch 3 if they want to run on a modern version of Java >>> >>> It would be really really good if the published spark maven artefacts >>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop >>> 3.x. That way people doing things with their own projects will get >>> up-to-date dependencies and don't get WONTFIX responses themselves. >>> >>> -Steve >>> >>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last >>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be >>> the transition release. >>> >>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <lian.cs....@gmail.com> >>> wrote: >>> >>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I >>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which >>> seemed risky, and therefore we only introduced Hive 2.3 under the >>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong >>> here... >>> >>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed >>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not >>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 >>> upgrade together looks too risky. >>> >>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sro...@gmail.com> wrote: >>> >>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather >>> than introduce yet another build combination. Does Hadoop 2 + Hive 2 >>> work and is there demand for it? >>> >>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> > >>> > Do we have a limitation on the number of pre-built distributions? >>> Seems this time we need >>> > 1. hadoop 2.7 + hive 1.2 >>> > 2. hadoop 2.7 + hive 2.3 >>> > 3. hadoop 3 + hive 2.3 >>> > >>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so >>> don't need to add JDK version to the combination. >>> > >>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >> >>> >> Thank you for suggestion. >>> >> >>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal >>> to Hadoop 3. >>> >> IIRC, originally, it was proposed in that way, but we put it under >>> `hadoop-3.2` to avoid adding new profiles at that time. >>> >> >>> >> And, I'm wondering if you are considering additional pre-built >>> distribution and Jenkins jobs. >>> >> >>> >> Bests, >>> >> Dongjoon. >>> >> >>> >>>