Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.
And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name. On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > BTW, `hive.version.short` is a directory name. We are using 2.3.6 only. > > For directory name, we use '1.2.1' and '2.3.5' because we just delayed the > renaming the directories until 3.0.0 deadline to minimize the diff. > > We can replace it immediately if we want right now. > > > > On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, Cheng. >> >> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world. >> If we consider them, it could be the followings. >> >> +----------+-----------------+--------------------+ >> | | Hive 1.2.1 fork | Apache Hive 2.3.6 | >> +-------------------------------------------------+ >> |Legitimate| X | O | >> |JDK11 | X | O | >> |Hadoop3 | X | O | >> |Hadoop2 | O | O | >> |Functions | Baseline | More | >> |Bug fixes | Baseline | More | >> +-------------------------------------------------+ >> >> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves >> (including Jenkins/GitHubAction/AppVeyor). >> >> For me, AS-IS 3.0 is not enough for that. According to your advices, >> to give more visibility to the whole community, >> >> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built >> distribution >> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 >> after `branch-3.0` branch cut. >> >> I know that we have been reluctant to (1) and (2) due to its burden. >> But, it's time to prepare. Without them, we are going to be insufficient >> again and again. >> >> Bests, >> Dongjoon. >> >> >> >> >> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com> wrote: >> >>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor >>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 >>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still >>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is >>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here >>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135> >>> and here >>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927> >>> .) >>> >>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop >>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for >>> Spark 3.0. For preview releases, I'm afraid that their visibility is not >>> good enough for covering such major upgrades. >>> >>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Thank you for feedback, Hyujkjin and Sean. >>>> >>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at >>>> 3.1 >>>> if we can make a decision to eliminate the illegitimate Hive fork >>>> reference >>>> immediately after `branch-3.0` cut. >>>> >>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`. >>>> >>>> - >>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E >>>> >>>> The way I see this is that it's not a user problem. Apache Spark >>>> community didn't try to drop the illegitimate Hive fork yet. >>>> We need to drop it by ourselves because we created it and it's our bad. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> >>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7 >>>>> works with hive-2.3? it isn't tied to hadoop-3.2? >>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive >>>>> 2.x, for end users using Hive via Spark? >>>>> I don't have a strong opinion, other than sharing the view that we >>>>> have to dump the Hive 1.x fork at the first opportunity. >>>>> Question is simply how much risk that entails. Keeping in mind that >>>>> Spark 3.0 is already something that people understand works >>>>> differently. We can accept some behavior changes. >>>>> >>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun < >>>>> dongjoon.h...@gmail.com> wrote: >>>>> > >>>>> > Hi, All. >>>>> > >>>>> > First of all, I want to put this as a policy issue instead of a >>>>> technical issue. >>>>> > Also, this is orthogonal from `hadoop` version discussion. >>>>> > >>>>> > Apache Spark community kept (not maintained) the forked Apache Hive >>>>> > 1.2.1 because there has been no other options before. As we see at >>>>> > SPARK-20202, it's not a desirable situation among the Apache >>>>> projects. >>>>> > >>>>> > https://issues.apache.org/jira/browse/SPARK-20202 >>>>> > >>>>> > Also, please note that we `kept`, not `maintained`, because we know >>>>> it's not good. >>>>> > There are several attempt to update that forked repository >>>>> > for several reasons (Hadoop 3 support is one of the example), >>>>> > but those attempts are also turned down. >>>>> > >>>>> > From Apache Spark 3.0, it seems that we have a new feasible option >>>>> > `hive-2.3` profile. What about moving forward in this direction >>>>> further? >>>>> > >>>>> > For example, can we remove the usage of forked `hive` in Apache >>>>> Spark 3.0 >>>>> > completely officially? If someone still needs to use the forked >>>>> `hive`, we can >>>>> > have a profile `hive-1.2`. Of course, it should not be a default >>>>> profile in the community. >>>>> > >>>>> > I want to say this is a goal we should achieve someday. >>>>> > If we don't do anything, nothing happen. At least we need to prepare >>>>> this. >>>>> > Without any preparation, Spark 3.1+ will be the same. >>>>> > >>>>> > Shall we focus on what are our problems with Hive 2.3.6? >>>>> > If the only reason is that we didn't use it before, we can release >>>>> another >>>>> > `3.0.0-preview` for that. >>>>> > >>>>> > Bests, >>>>> > Dongjoon. >>>>> >>>>