Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian Tue, 19 Nov 2019 14:08:54 -0800

Hmm, what exactly did you mean by "remove the usage of forked `hive` in
Apache Spark 3.0 completely officially"? I thought you wanted to remove the
forked Hive 1.2 dependencies completely, no? As long as we still keep the
Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
particular preference between using Hive 1.2 or 2.3 as the default Hive
version. After all, for end-users and providers who need a particular
version combination, they can always build Spark with proper profiles
themselves.


And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
> renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>> If we consider them, it could be the followings.
>>
>> +----------+-----------------+--------------------+
>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-------------------------------------------------+
>> |Legitimate|        X        |         O          |
>> |JDK11     |        X        |         O          |
>> |Hadoop3   |        X        |         O          |
>> |Hadoop2   |        O        |         O          |
>> |Functions |     Baseline    |       More         |
>> |Bug fixes |     Baseline    |       More         |
>> +-------------------------------------------------+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>> after `branch-3.0` branch cut.
>>
>> I know that we have been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs....@gmail.com> wrote:
>>
>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>> and here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>> .)
>>>
>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>> good enough for covering such major upgrades.
>>>
>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>
>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>> 3.1
>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>> reference
>>>> immediately after `branch-3.0` cut.
>>>>
>>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>>
>>>> -
>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>
>>>> The way I see this is that it's not a user problem. Apache Spark
>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>> We need to drop it by ourselves because we created it and it's our bad.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>> 2.x, for end users using Hive via Spark?
>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>> Spark 3.0 is already something that people understand works
>>>>> differently. We can accept some behavior changes.
>>>>>
>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>> technical issue.
>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>> >
>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>> projects.
>>>>> >
>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>> >
>>>>> > Also, please note that we `kept`, not `maintained`, because we know
>>>>> it's not good.
>>>>> > There are several attempt to update that forked repository
>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>> > but those attempts are also turned down.
>>>>> >
>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>> further?
>>>>> >
>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>> Spark 3.0
>>>>> > completely officially? If someone still needs to use the forked
>>>>> `hive`, we can
>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>> profile in the community.
>>>>> >
>>>>> > I want to say this is a goal we should achieve someday.
>>>>> > If we don't do anything, nothing happen. At least we need to prepare
>>>>> this.
>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>> >
>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>> > If the only reason is that we didn't use it before, we can release
>>>>> another
>>>>> > `3.0.0-preview` for that.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Reply via email to