Yes. It does. I meant SPARK-20202. Thanks. I understand that it can be considered like Scala version issue. So, that's the reason why I put this as a `policy` issue from the beginning.
> First of all, I want to put this as a policy issue instead of a technical issue. In the policy perspective, we should remove this immediately if we have a solution to fix this. For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the current discussion status. https://issues.apache.org/jira/browse/SPARK-20202 And, if there is no other issues, I'll create a PR to remove it from `master` branch when we cut `branch-3.0`. For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean? The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability (which Cheng/Hyukjin/you asked). Bests, Dongjoon. On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <lian.cs....@gmail.com> wrote: > It's kinda like Scala version upgrade. Historically, we only remove the > support of an older Scala version when the newer version is proven to be > stable after one or more Spark minor versions. > > On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <lian.cs....@gmail.com> wrote: > >> Hmm, what exactly did you mean by "remove the usage of forked `hive` in >> Apache Spark 3.0 completely officially"? I thought you wanted to remove the >> forked Hive 1.2 dependencies completely, no? As long as we still keep the >> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a >> particular preference between using Hive 1.2 or 2.3 as the default Hive >> version. After all, for end-users and providers who need a particular >> version combination, they can always build Spark with proper profiles >> themselves. >> >> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's >> due to the folder name. >> >> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only. >>> >>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed >>> the renaming the directories until 3.0.0 deadline to minimize the diff. >>> >>> We can replace it immediately if we want right now. >>> >>> >>> >>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Hi, Cheng. >>>> >>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world. >>>> If we consider them, it could be the followings. >>>> >>>> +----------+-----------------+--------------------+ >>>> | | Hive 1.2.1 fork | Apache Hive 2.3.6 | >>>> +-------------------------------------------------+ >>>> |Legitimate| X | O | >>>> |JDK11 | X | O | >>>> |Hadoop3 | X | O | >>>> |Hadoop2 | O | O | >>>> |Functions | Baseline | More | >>>> |Bug fixes | Baseline | More | >>>> +-------------------------------------------------+ >>>> >>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves >>>> (including Jenkins/GitHubAction/AppVeyor). >>>> >>>> For me, AS-IS 3.0 is not enough for that. According to your advices, >>>> to give more visibility to the whole community, >>>> >>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built >>>> distribution >>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 >>>> after `branch-3.0` branch cut. >>>> >>>> I know that we have been reluctant to (1) and (2) due to its burden. >>>> But, it's time to prepare. Without them, we are going to be >>>> insufficient again and again. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> >>>> >>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com> >>>> wrote: >>>> >>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x >>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive >>>>> 1.2 >>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still >>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is >>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here >>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135> >>>>> and here >>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927> >>>>> .) >>>>> >>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop >>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for >>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not >>>>> good enough for covering such major upgrades. >>>>> >>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thank you for feedback, Hyujkjin and Sean. >>>>>> >>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that >>>>>> at 3.1 >>>>>> if we can make a decision to eliminate the illegitimate Hive fork >>>>>> reference >>>>>> immediately after `branch-3.0` cut. >>>>>> >>>>>> Sean, I'm referencing Cheng Lian's email for the status of >>>>>> `hadoop-2.7`. >>>>>> >>>>>> - >>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E >>>>>> >>>>>> The way I see this is that it's not a user problem. Apache Spark >>>>>> community didn't try to drop the illegitimate Hive fork yet. >>>>>> We need to drop it by ourselves because we created it and it's our >>>>>> bad. >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> Just to clarify, as even I have lost the details over time: >>>>>>> hadoop-2.7 >>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2? >>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive >>>>>>> 2.x, for end users using Hive via Spark? >>>>>>> I don't have a strong opinion, other than sharing the view that we >>>>>>> have to dump the Hive 1.x fork at the first opportunity. >>>>>>> Question is simply how much risk that entails. Keeping in mind that >>>>>>> Spark 3.0 is already something that people understand works >>>>>>> differently. We can accept some behavior changes. >>>>>>> >>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun < >>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>> > >>>>>>> > Hi, All. >>>>>>> > >>>>>>> > First of all, I want to put this as a policy issue instead of a >>>>>>> technical issue. >>>>>>> > Also, this is orthogonal from `hadoop` version discussion. >>>>>>> > >>>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive >>>>>>> > 1.2.1 because there has been no other options before. As we see at >>>>>>> > SPARK-20202, it's not a desirable situation among the Apache >>>>>>> projects. >>>>>>> > >>>>>>> > https://issues.apache.org/jira/browse/SPARK-20202 >>>>>>> > >>>>>>> > Also, please note that we `kept`, not `maintained`, because we >>>>>>> know it's not good. >>>>>>> > There are several attempt to update that forked repository >>>>>>> > for several reasons (Hadoop 3 support is one of the example), >>>>>>> > but those attempts are also turned down. >>>>>>> > >>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option >>>>>>> > `hive-2.3` profile. What about moving forward in this direction >>>>>>> further? >>>>>>> > >>>>>>> > For example, can we remove the usage of forked `hive` in Apache >>>>>>> Spark 3.0 >>>>>>> > completely officially? If someone still needs to use the forked >>>>>>> `hive`, we can >>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default >>>>>>> profile in the community. >>>>>>> > >>>>>>> > I want to say this is a goal we should achieve someday. >>>>>>> > If we don't do anything, nothing happen. At least we need to >>>>>>> prepare this. >>>>>>> > Without any preparation, Spark 3.1+ will be the same. >>>>>>> > >>>>>>> > Shall we focus on what are our problems with Hive 2.3.6? >>>>>>> > If the only reason is that we didn't use it before, we can release >>>>>>> another >>>>>>> > `3.0.0-preview` for that. >>>>>>> > >>>>>>> > Bests, >>>>>>> > Dongjoon. >>>>>>> >>>>>>