Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Hyukjin Kwon Tue, 19 Nov 2019 17:12:14 -0800

So, are we able to conclude our plans as below?

1. In Spark 3,  we release as below:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)


2. In Spark 3.1, we target:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)

3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo
right away after cutting branch-3 to see if Hive 2.3 is considered as
stable in general.
    I roughly suspect it would be a couple of months after Spark 3.0
release (?).

BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) +
JDK8 (default)" combination is deprecated anyway in Spark 3.



2019년 11월 20일 (수) 오전 9:52, Cheng Lian <lian.cs....@gmail.com>님이 작성:

> Thanks for taking care of this, Dongjoon!
>
> We can target SPARK-20202 to 3.1.0, but I don't think we should do it
> immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
> be removed once the Hive 2.3 code paths are proven to be stable. If it
> turned out to be buggy in Spark 3.1, we may want to further postpone
> SPARK-20202 to 3.2.0 by then.
>
> On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Yes. It does. I meant SPARK-20202.
>>
>> Thanks. I understand that it can be considered like Scala version issue.
>> So, that's the reason why I put this as a `policy` issue from the
>> beginning.
>>
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>>
>> In the policy perspective, we should remove this immediately if we have a
>> solution to fix this.
>> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
>> the current discussion status.
>>
>>     https://issues.apache.org/jira/browse/SPARK-20202
>>
>> And, if there is no other issues, I'll create a PR to remove it from
>> `master` branch when we cut `branch-3.0`.
>>
>> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
>> you think about this, Sean?
>> The preparation is already started in another email thread and I believe
>> that is a keystone to prove `Hive 2.3` version stability
>> (which Cheng/Hyukjin/you asked).
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>>
>>> It's kinda like Scala version upgrade. Historically, we only remove the
>>> support of an older Scala version when the newer version is proven to be
>>> stable after one or more Spark minor versions.
>>>
>>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <lian.cs....@gmail.com>
>>> wrote:
>>>
>>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>>> version. After all, for end-users and providers who need a particular
>>>> version combination, they can always build Spark with proper profiles
>>>> themselves.
>>>>
>>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>>> it's due to the folder name.
>>>>
>>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>>
>>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>>
>>>>> We can replace it immediately if we want right now.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>
>>>>>> Hi, Cheng.
>>>>>>
>>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8
>>>>>> world.
>>>>>> If we consider them, it could be the followings.
>>>>>>
>>>>>> +----------+-----------------+--------------------+
>>>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>>> +-------------------------------------------------+
>>>>>> |Legitimate|        X        |         O          |
>>>>>> |JDK11     |        X        |         O          |
>>>>>> |Hadoop3   |        X        |         O          |
>>>>>> |Hadoop2   |        O        |         O          |
>>>>>> |Functions |     Baseline    |       More         |
>>>>>> |Bug fixes |     Baseline    |       More         |
>>>>>> +-------------------------------------------------+
>>>>>>
>>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>>
>>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>>> to give more visibility to the whole community,
>>>>>>
>>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>>> distribution
>>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for
>>>>>> 3.1 after `branch-3.0` branch cut.
>>>>>>
>>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>>> But, it's time to prepare. Without them, we are going to be
>>>>>> insufficient again and again.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 
>>>>>>> 1.2
>>>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM 
>>>>>>> is
>>>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>>>> and here
>>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>>>> .)
>>>>>>>
>>>>>>> Again, I'm happy to get rid of ancient legacy dependencies like
>>>>>>> Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety
>>>>>>> net for Spark 3.0. For preview releases, I'm afraid that their 
>>>>>>> visibility
>>>>>>> is not good enough for covering such major upgrades.
>>>>>>>
>>>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>>>
>>>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>>>>>>>> at 3.1
>>>>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>>>>> reference
>>>>>>>> immediately after `branch-3.0` cut.
>>>>>>>>
>>>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>>>> `hadoop-2.7`.
>>>>>>>>
>>>>>>>> -
>>>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>>>
>>>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>>>> We need to drop it by ourselves because we created it and it's our
>>>>>>>> bad.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>>>> hadoop-2.7
>>>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>>>> differently. We can accept some behavior changes.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi, All.
>>>>>>>>> >
>>>>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>>>>> technical issue.
>>>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>>>> >
>>>>>>>>> > Apache Spark community kept (not maintained) the forked Apache
>>>>>>>>> Hive
>>>>>>>>> > 1.2.1 because there has been no other options before. As we see
>>>>>>>>> at
>>>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>>>>> projects.
>>>>>>>>> >
>>>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>>>> >
>>>>>>>>> > Also, please note that we `kept`, not `maintained`, because we
>>>>>>>>> know it's not good.
>>>>>>>>> > There are several attempt to update that forked repository
>>>>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>>>>> > but those attempts are also turned down.
>>>>>>>>> >
>>>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible
>>>>>>>>> option
>>>>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>>>>> further?
>>>>>>>>> >
>>>>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>>>>> Spark 3.0
>>>>>>>>> > completely officially? If someone still needs to use the forked
>>>>>>>>> `hive`, we can
>>>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>>>>> profile in the community.
>>>>>>>>> >
>>>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>>>>> prepare this.
>>>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>>>> >
>>>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>>>> > If the only reason is that we didn't use it before, we can
>>>>>>>>> release another
>>>>>>>>> > `3.0.0-preview` for that.
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>>
>>>>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Reply via email to