Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring both Hive 2.3.6 and 2.3.5 at the moment, see here
<https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
and here
<https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
3.0. For preview releases, I'm afraid that their visibility is not good
enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
> if we can make a decision to eliminate the illegitimate Hive fork reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark community
> didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. Keeping in mind that
>> Spark 3.0 is already something that people understand works
>> differently. We can accept some behavior changes.
>>
>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>> >
>> > Hi, All.
>> >
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>> > Also, this is orthogonal from `hadoop` version discussion.
>> >
>> > Apache Spark community kept (not maintained) the forked Apache Hive
>> > 1.2.1 because there has been no other options before. As we see at
>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>> >
>> >     https://issues.apache.org/jira/browse/SPARK-20202
>> >
>> > Also, please note that we `kept`, not `maintained`, because we know
>> it's not good.
>> > There are several attempt to update that forked repository
>> > for several reasons (Hadoop 3 support is one of the example),
>> > but those attempts are also turned down.
>> >
>> > From Apache Spark 3.0, it seems that we have a new feasible option
>> > `hive-2.3` profile. What about moving forward in this direction further?
>> >
>> > For example, can we remove the usage of forked `hive` in Apache Spark
>> 3.0
>> > completely officially? If someone still needs to use the forked `hive`,
>> we can
>> > have a profile `hive-1.2`. Of course, it should not be a default
>> profile in the community.
>> >
>> > I want to say this is a goal we should achieve someday.
>> > If we don't do anything, nothing happen. At least we need to prepare
>> this.
>> > Without any preparation, Spark 3.1+ will be the same.
>> >
>> > Shall we focus on what are our problems with Hive 2.3.6?
>> > If the only reason is that we didn't use it before, we can release
>> another
>> > `3.0.0-preview` for that.
>> >
>> > Bests,
>> > Dongjoon.
>>
>

Reply via email to