Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3
AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Thank you for suggestion. > > Having `hive-2.3` profile sounds good to me because it's orthogonal to > Hadoop 3. > IIRC, originally, it was proposed in that way, but we put it under > `hadoop-3.2` to avoid adding new profiles at that time. > > And, I'm wondering if you are considering additional pre-built > distribution and Jenkins jobs. > > Bests, > Dongjoon. > > > > On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <lian.cs....@gmail.com> wrote: > >> Cc Yuming, Steve, and Dongjoon >> >> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs....@gmail.com> >> wrote: >> >>> Similar to Xiao, my major concern about making Hadoop 3.2 the default >>> Hadoop version is quality control. The current hadoop-3.2 profile >>> covers too many major component upgrades, i.e.: >>> >>> - Hadoop 3.2 >>> - Hive 2.3 >>> - JDK 11 >>> >>> We have already found and fixed some feature and performance regressions >>> related to these upgrades. Empirically, I’m not surprised at all if more >>> regressions are lurking somewhere. On the other hand, we do want help from >>> the community to help us to evaluate and stabilize these new changes. >>> Following that, I’d like to propose: >>> >>> 1. >>> >>> Introduce a new profile hive-2.3 to enable (hopefully) less risky >>> Hadoop/Hive/JDK version combinations. >>> >>> This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 >>> profile, so that users may try out some less risky Hadoop/Hive/JDK >>> combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to >>> face potential regressions introduced by the Hadoop 3.2 upgrade. >>> >>> Yuming Wang has already sent out PR #26533 >>> <https://github.com/apache/spark/pull/26533> to exercise the Hadoop >>> 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the >>> hive-2.3 profile yet), and the result looks promising: the Kafka >>> streaming and Arrow related test failures should be irrelevant to the >>> topic >>> discussed here. >>> >>> After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a >>> lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default >>> Hadoop version. For users who are still using Hadoop 2.x in production, >>> they will have to use a hadoop-provided prebuilt package or build >>> Spark 3.0 against their own 2.x version anyway. It does make a difference >>> for cloud users who don’t use Hadoop at all, though. And this probably >>> also >>> helps to stabilize the Hadoop 3.2 code path faster since our PR builder >>> will exercise it regularly. >>> 2. >>> >>> Defer Hadoop 2.x upgrade to Spark 3.1+ >>> >>> I personally do want to bump our Hadoop 2.x version to 2.9 or even >>> 2.10. Steve has already stated the benefits very well. My worry here is >>> still quality control: Spark 3.0 has already had tons of changes and >>> major >>> component version upgrades that are subject to all kinds of known and >>> hidden regressions. Having Hadoop 2.7 there provides us a safety net, >>> since >>> it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop >>> 2.7 >>> to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in >>> the >>> next 1 or 2 Spark 3.x releases. >>> >>> Cheng >>> >>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind. >>>> but they kept the public apis stable at the 2.7 level, because thats kind >>>> of the point. arent those the hadoop apis spark uses? >>>> >>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran >>>> <ste...@cloudera.com.invalid> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran >>>>>> <ste...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> It would be really good if the spark distributions shipped with >>>>>>> later versions of the hadoop artifacts. >>>>>>> >>>>>> >>>>>> I second this. If we need to keep a Hadoop 2.x profile around, why >>>>>> not make it Hadoop 2.8 or something newer? >>>>>> >>>>> >>>>> go for 2.9 >>>>> >>>>>> >>>>>> Koert Kuipers <ko...@tresata.com> wrote: >>>>>> >>>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 >>>>>>> profile to latest would probably be an issue for us. >>>>>> >>>>>> >>>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? >>>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same? >>>>>> >>>>> >>>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A >>>>> really large proportion of the later branch-2 patches are backported. 2,7 >>>>> was left behind a long time ago >>>>> >>>>> >>>>> >>>>> >>>>