So, are we able to conclude our plans as below? 1. In Spark 3, we release as below: - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11 - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
2. In Spark 3.1, we target: - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11 - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default) 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general. I roughly suspect it would be a couple of months after Spark 3.0 release (?). BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3. 2019년 11월 20일 (수) 오전 9:52, Cheng Lian <lian.cs....@gmail.com>님이 작성: > Thanks for taking care of this, Dongjoon! > > We can target SPARK-20202 to 3.1.0, but I don't think we should do it > immediately after cutting the branch-3.0. The Hive 1.2 code paths can only > be removed once the Hive 2.3 code paths are proven to be stable. If it > turned out to be buggy in Spark 3.1, we may want to further postpone > SPARK-20202 to 3.2.0 by then. > > On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Yes. It does. I meant SPARK-20202. >> >> Thanks. I understand that it can be considered like Scala version issue. >> So, that's the reason why I put this as a `policy` issue from the >> beginning. >> >> > First of all, I want to put this as a policy issue instead of a >> technical issue. >> >> In the policy perspective, we should remove this immediately if we have a >> solution to fix this. >> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to >> the current discussion status. >> >> https://issues.apache.org/jira/browse/SPARK-20202 >> >> And, if there is no other issues, I'll create a PR to remove it from >> `master` branch when we cut `branch-3.0`. >> >> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do >> you think about this, Sean? >> The preparation is already started in another email thread and I believe >> that is a keystone to prove `Hive 2.3` version stability >> (which Cheng/Hyukjin/you asked). >> >> Bests, >> Dongjoon. >> >> >> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <lian.cs....@gmail.com> wrote: >> >>> It's kinda like Scala version upgrade. Historically, we only remove the >>> support of an older Scala version when the newer version is proven to be >>> stable after one or more Spark minor versions. >>> >>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <lian.cs....@gmail.com> >>> wrote: >>> >>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in >>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the >>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the >>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a >>>> particular preference between using Hive 1.2 or 2.3 as the default Hive >>>> version. After all, for end-users and providers who need a particular >>>> version combination, they can always build Spark with proper profiles >>>> themselves. >>>> >>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that >>>> it's due to the folder name. >>>> >>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only. >>>>> >>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed >>>>> the renaming the directories until 3.0.0 deadline to minimize the diff. >>>>> >>>>> We can replace it immediately if we want right now. >>>>> >>>>> >>>>> >>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun < >>>>> dongjoon.h...@gmail.com> wrote: >>>>> >>>>>> Hi, Cheng. >>>>>> >>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 >>>>>> world. >>>>>> If we consider them, it could be the followings. >>>>>> >>>>>> +----------+-----------------+--------------------+ >>>>>> | | Hive 1.2.1 fork | Apache Hive 2.3.6 | >>>>>> +-------------------------------------------------+ >>>>>> |Legitimate| X | O | >>>>>> |JDK11 | X | O | >>>>>> |Hadoop3 | X | O | >>>>>> |Hadoop2 | O | O | >>>>>> |Functions | Baseline | More | >>>>>> |Bug fixes | Baseline | More | >>>>>> +-------------------------------------------------+ >>>>>> >>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves >>>>>> (including Jenkins/GitHubAction/AppVeyor). >>>>>> >>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices, >>>>>> to give more visibility to the whole community, >>>>>> >>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built >>>>>> distribution >>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for >>>>>> 3.1 after `branch-3.0` branch cut. >>>>>> >>>>>> I know that we have been reluctant to (1) and (2) due to its burden. >>>>>> But, it's time to prepare. Without them, we are going to be >>>>>> insufficient again and again. >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x >>>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive >>>>>>> 1.2 >>>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still >>>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM >>>>>>> is >>>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here >>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135> >>>>>>> and here >>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927> >>>>>>> .) >>>>>>> >>>>>>> Again, I'm happy to get rid of ancient legacy dependencies like >>>>>>> Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety >>>>>>> net for Spark 3.0. For preview releases, I'm afraid that their >>>>>>> visibility >>>>>>> is not good enough for covering such major upgrades. >>>>>>> >>>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun < >>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>> >>>>>>>> Thank you for feedback, Hyujkjin and Sean. >>>>>>>> >>>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that >>>>>>>> at 3.1 >>>>>>>> if we can make a decision to eliminate the illegitimate Hive fork >>>>>>>> reference >>>>>>>> immediately after `branch-3.0` cut. >>>>>>>> >>>>>>>> Sean, I'm referencing Cheng Lian's email for the status of >>>>>>>> `hadoop-2.7`. >>>>>>>> >>>>>>>> - >>>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E >>>>>>>> >>>>>>>> The way I see this is that it's not a user problem. Apache Spark >>>>>>>> community didn't try to drop the illegitimate Hive fork yet. >>>>>>>> We need to drop it by ourselves because we created it and it's our >>>>>>>> bad. >>>>>>>> >>>>>>>> Bests, >>>>>>>> Dongjoon. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Just to clarify, as even I have lost the details over time: >>>>>>>>> hadoop-2.7 >>>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2? >>>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive >>>>>>>>> 2.x, for end users using Hive via Spark? >>>>>>>>> I don't have a strong opinion, other than sharing the view that we >>>>>>>>> have to dump the Hive 1.x fork at the first opportunity. >>>>>>>>> Question is simply how much risk that entails. Keeping in mind that >>>>>>>>> Spark 3.0 is already something that people understand works >>>>>>>>> differently. We can accept some behavior changes. >>>>>>>>> >>>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun < >>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > Hi, All. >>>>>>>>> > >>>>>>>>> > First of all, I want to put this as a policy issue instead of a >>>>>>>>> technical issue. >>>>>>>>> > Also, this is orthogonal from `hadoop` version discussion. >>>>>>>>> > >>>>>>>>> > Apache Spark community kept (not maintained) the forked Apache >>>>>>>>> Hive >>>>>>>>> > 1.2.1 because there has been no other options before. As we see >>>>>>>>> at >>>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache >>>>>>>>> projects. >>>>>>>>> > >>>>>>>>> > https://issues.apache.org/jira/browse/SPARK-20202 >>>>>>>>> > >>>>>>>>> > Also, please note that we `kept`, not `maintained`, because we >>>>>>>>> know it's not good. >>>>>>>>> > There are several attempt to update that forked repository >>>>>>>>> > for several reasons (Hadoop 3 support is one of the example), >>>>>>>>> > but those attempts are also turned down. >>>>>>>>> > >>>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible >>>>>>>>> option >>>>>>>>> > `hive-2.3` profile. What about moving forward in this direction >>>>>>>>> further? >>>>>>>>> > >>>>>>>>> > For example, can we remove the usage of forked `hive` in Apache >>>>>>>>> Spark 3.0 >>>>>>>>> > completely officially? If someone still needs to use the forked >>>>>>>>> `hive`, we can >>>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default >>>>>>>>> profile in the community. >>>>>>>>> > >>>>>>>>> > I want to say this is a goal we should achieve someday. >>>>>>>>> > If we don't do anything, nothing happen. At least we need to >>>>>>>>> prepare this. >>>>>>>>> > Without any preparation, Spark 3.1+ will be the same. >>>>>>>>> > >>>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6? >>>>>>>>> > If the only reason is that we didn't use it before, we can >>>>>>>>> release another >>>>>>>>> > `3.0.0-preview` for that. >>>>>>>>> > >>>>>>>>> > Bests, >>>>>>>>> > Dongjoon. >>>>>>>>> >>>>>>>>