Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun Tue, 19 Nov 2019 14:54:43 -0800

Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.


> First of all, I want to put this as a policy issue instead of a technical
issue.

In the policy perspective, we should remove this immediately if we have a
solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the
current discussion status.

    https://issues.apache.org/jira/browse/SPARK-20202

And, if there is no other issues, I'll create a PR to remove it from
`master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
you think about this, Sean?
The preparation is already started in another email thread and I believe
that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <lian.cs....@gmail.com> wrote:

> It's kinda like Scala version upgrade. Historically, we only remove the
> support of an older Scala version when the newer version is proven to be
> stable after one or more Spark minor versions.
>
> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>> version. After all, for end-users and providers who need a particular
>> version combination, they can always build Spark with proper profiles
>> themselves.
>>
>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
>> due to the folder name.
>>
>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>
>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>
>>> We can replace it immediately if we want right now.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Cheng.
>>>>
>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>> If we consider them, it could be the followings.
>>>>
>>>> +----------+-----------------+--------------------+
>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>> +-------------------------------------------------+
>>>> |Legitimate|        X        |         O          |
>>>> |JDK11     |        X        |         O          |
>>>> |Hadoop3   |        X        |         O          |
>>>> |Hadoop2   |        O        |         O          |
>>>> |Functions |     Baseline    |       More         |
>>>> |Bug fixes |     Baseline    |       More         |
>>>> +-------------------------------------------------+
>>>>
>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>
>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>> to give more visibility to the whole community,
>>>>
>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>> distribution
>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>> after `branch-3.0` branch cut.
>>>>
>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>> But, it's time to prepare. Without them, we are going to be
>>>> insufficient again and again.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com>
>>>> wrote:
>>>>
>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 
>>>>> 1.2
>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>> and here
>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>> .)
>>>>>
>>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>>>> good enough for covering such major upgrades.
>>>>>
>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>
>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>>>>>> at 3.1
>>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>>> reference
>>>>>> immediately after `branch-3.0` cut.
>>>>>>
>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>> `hadoop-2.7`.
>>>>>>
>>>>>> -
>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>
>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>> We need to drop it by ourselves because we created it and it's our
>>>>>> bad.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>> hadoop-2.7
>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>> differently. We can accept some behavior changes.
>>>>>>>
>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi, All.
>>>>>>> >
>>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>>> technical issue.
>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>> >
>>>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>>> projects.
>>>>>>> >
>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>> >
>>>>>>> > Also, please note that we `kept`, not `maintained`, because we
>>>>>>> know it's not good.
>>>>>>> > There are several attempt to update that forked repository
>>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>>> > but those attempts are also turned down.
>>>>>>> >
>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>>> further?
>>>>>>> >
>>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>>> Spark 3.0
>>>>>>> > completely officially? If someone still needs to use the forked
>>>>>>> `hive`, we can
>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>>> profile in the community.
>>>>>>> >
>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>>> prepare this.
>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>> >
>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>> > If the only reason is that we didn't use it before, we can release
>>>>>>> another
>>>>>>> > `3.0.0-preview` for that.
>>>>>>> >
>>>>>>> > Bests,
>>>>>>> > Dongjoon.
>>>>>>>
>>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Reply via email to