Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Felix Cheung
Likely need a shim (which we should have anyway) because of namespace/import 
changes.

I’m huge +1 on this.



From: Hyukjin Kwon 
Sent: Monday, February 4, 2019 12:27 PM
To: Xiao Li
Cc: Sean Owen; Felix Cheung; Ryan Blue; Marcelo Vanzin; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

I should check the details and feasiablity by myself but to me it sounds fine 
if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li 
mailto:gatorsm...@gmail.com> wrote:
Yes. When our support/integration with Hive 2.x becomes stable, we can do it in 
Hadoop 2.x profile too, if needed. The whole proposal is to minimize the risk 
and ensure the release stability and quality.

Hyukjin Kwon mailto:gurwls...@gmail.com>> 于2019年2月4日周一 
下午12:01写道:
Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with Hadoop 
3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 
mailto:gatorsm...@gmail.com>>님이 작성:
To reduce the impact and risk of upgrading Hive execution JARs, we can just 
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The 
support of Hadoop 3 will be still experimental in our next release. That means, 
the impact and risk are very minimal for most users who are still using Hadoop 
2.x profile.

The code changes in Spark thrift server are massive. It is risky and hard to 
review. The original code of our Spark thrift server is from Hive-service 
1.2.1. To reduce the risk of the upgrade, we can inline the new version. In the 
future, we can completely get rid of the thrift server, and build our own 
high-performant JDBC server.

Does this proposal sound good to you?

In the last two weeks, Yuming was trying this proposal. Now, he is on vacation. 
In China, today is already the lunar New Year. I would not expect he will reply 
this email in the next 7 days.

Cheers,

Xiao



Sean Owen mailto:sro...@gmail.com>> 于2019年2月4日周一 上午7:56写道:
I was unclear from this thread what the objection to these PRs is:

https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553

Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I wasn't clear whether those concerns specifically argue against
these PRs.


On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
>
> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> 
> From: Xiao Li mailto:gatorsm...@gmail.com>>
> Sent: Wednesday, January 16, 2019 9:37 AM
> To: Ryan Blue
> Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep 
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue mailto:rb...@netflix.com>> 于2019年1月16日周三 
> 上午9:27写道:
>>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version that 
>> is now more than 3 years old just to solve a problem for Spark. Maybe that 
>> would have been a reasonable ask instead of publishing a fork years ago, but 
>> I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> mailto:van...@cloudera.com>> wrote:
>>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> mailto:gurwls...@gmail.com>> wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes 
>>> > of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically 
>>> > says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer release 
>>> > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for 
&

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
I should check the details and feasiablity by myself but to me it sounds
fine if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li  Yes. When our support/integration with Hive 2.x becomes stable, we can do
> it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
> the risk and ensure the release stability and quality.
>
> Hyukjin Kwon  于2019年2月4日周一 下午12:01写道:
>
>> Xiao, to check if I understood correctly, do you mean the below?
>>
>> 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
>> Hadoop 3.x profile.
>> 2. Make another newer version of thrift server by Hive 2.x(?) in Spark
>> side.
>> 3. Target the transition to Hive 2.x completely and slowly later in the
>> future.
>>
>>
>>
>> 2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:
>>
>>> To reduce the impact and risk of upgrading Hive execution JARs, we can
>>> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
>>> The support of Hadoop 3 will be still experimental in our next release.
>>> That means, the impact and risk are very minimal for most users who are
>>> still using Hadoop 2.x profile.
>>>
>>> The code changes in Spark thrift server are massive. It is risky and
>>> hard to review. The original code of our Spark thrift server is from
>>> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
>>> new version. In the future, we can completely get rid of the thrift server,
>>> and build our own high-performant JDBC server.
>>>
>>> Does this proposal sound good to you?
>>>
>>> In the last two weeks, Yuming was trying this proposal. Now, he is on
>>> vacation. In China, today is already the lunar New Year. I would not expect
>>> he will reply this email in the next 7 days.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>> Sean Owen  于2019年2月4日周一 上午7:56写道:
>>>
>>>> I was unclear from this thread what the objection to these PRs is:
>>>>
>>>> https://github.com/apache/spark/pull/23552
>>>> https://github.com/apache/spark/pull/23553
>>>>
>>>> Would we like to specifically discuss whether to merge these or not? I
>>>> hear support for it, concerns about continuing to support Hive too,
>>>> but I wasn't clear whether those concerns specifically argue against
>>>> these PRs.
>>>>
>>>>
>>>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>>>> wrote:
>>>> >
>>>> > What’s the update and next step on this?
>>>> >
>>>> > We have real users getting blocked by this issue.
>>>> >
>>>> >
>>>> > 
>>>> > From: Xiao Li 
>>>> > Sent: Wednesday, January 16, 2019 9:37 AM
>>>> > To: Ryan Blue
>>>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming
>>>> Wang; dev
>>>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>>> >
>>>> > Thanks for your feedbacks!
>>>> >
>>>> > Working with Yuming to reduce the risk of stability and quality. Will
>>>> keep you posted when the proposal is ready.
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Xiao
>>>> >
>>>> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
>>>> >>
>>>> >> +1 for what Marcelo and Hyukjin said.
>>>> >>
>>>> >> In particular, I agree that we can't expect Hive to release a
>>>> version that is now more than 3 years old just to solve a problem for
>>>> Spark. Maybe that would have been a reasonable ask instead of publishing a
>>>> fork years ago, but I think this is now Spark's problem.
>>>> >>
>>>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>>>> wrote:
>>>> >>>
>>>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to
>>>> fix a
>>>> >>> problem that we created.
>>>> >>>
>>>> >>> The current PR is basically a Spark-side fix for that bug. It does
>>>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I
>>>> think
>

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
Yes. When our support/integration with Hive 2.x becomes stable, we can do
it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
the risk and ensure the release stability and quality.

Hyukjin Kwon  于2019年2月4日周一 下午12:01写道:

> Xiao, to check if I understood correctly, do you mean the below?
>
> 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
> Hadoop 3.x profile.
> 2. Make another newer version of thrift server by Hive 2.x(?) in Spark
> side.
> 3. Target the transition to Hive 2.x completely and slowly later in the
> future.
>
>
>
> 2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:
>
>> To reduce the impact and risk of upgrading Hive execution JARs, we can
>> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
>> The support of Hadoop 3 will be still experimental in our next release.
>> That means, the impact and risk are very minimal for most users who are
>> still using Hadoop 2.x profile.
>>
>> The code changes in Spark thrift server are massive. It is risky and hard
>> to review. The original code of our Spark thrift server is from
>> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
>> new version. In the future, we can completely get rid of the thrift server,
>> and build our own high-performant JDBC server.
>>
>> Does this proposal sound good to you?
>>
>> In the last two weeks, Yuming was trying this proposal. Now, he is on
>> vacation. In China, today is already the lunar New Year. I would not expect
>> he will reply this email in the next 7 days.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>> Sean Owen  于2019年2月4日周一 上午7:56写道:
>>
>>> I was unclear from this thread what the objection to these PRs is:
>>>
>>> https://github.com/apache/spark/pull/23552
>>> https://github.com/apache/spark/pull/23553
>>>
>>> Would we like to specifically discuss whether to merge these or not? I
>>> hear support for it, concerns about continuing to support Hive too,
>>> but I wasn't clear whether those concerns specifically argue against
>>> these PRs.
>>>
>>>
>>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>>> wrote:
>>> >
>>> > What’s the update and next step on this?
>>> >
>>> > We have real users getting blocked by this issue.
>>> >
>>> >
>>> > 
>>> > From: Xiao Li 
>>> > Sent: Wednesday, January 16, 2019 9:37 AM
>>> > To: Ryan Blue
>>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming
>>> Wang; dev
>>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>> >
>>> > Thanks for your feedbacks!
>>> >
>>> > Working with Yuming to reduce the risk of stability and quality. Will
>>> keep you posted when the proposal is ready.
>>> >
>>> > Cheers,
>>> >
>>> > Xiao
>>> >
>>> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
>>> >>
>>> >> +1 for what Marcelo and Hyukjin said.
>>> >>
>>> >> In particular, I agree that we can't expect Hive to release a version
>>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>>> that would have been a reasonable ask instead of publishing a fork years
>>> ago, but I think this is now Spark's problem.
>>> >>
>>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>>> wrote:
>>> >>>
>>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix
>>> a
>>> >>> problem that we created.
>>> >>>
>>> >>> The current PR is basically a Spark-side fix for that bug. It does
>>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> >>> it's really the right path to take here.
>>> >>>
>>> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> wrote:
>>> >>> >
>>> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>>> fixes of our Hive fork (correct me if I am mistaken).
>>> >>> >
>>> >>> > Just to be honest by myself and as a personal opinion, that
>>> basically says Hive to take care of Spark'

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
Hadoop 3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the
future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:

> To reduce the impact and risk of upgrading Hive execution JARs, we can
> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
> The support of Hadoop 3 will be still experimental in our next release.
> That means, the impact and risk are very minimal for most users who are
> still using Hadoop 2.x profile.
>
> The code changes in Spark thrift server are massive. It is risky and hard
> to review. The original code of our Spark thrift server is from
> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
> new version. In the future, we can completely get rid of the thrift server,
> and build our own high-performant JDBC server.
>
> Does this proposal sound good to you?
>
> In the last two weeks, Yuming was trying this proposal. Now, he is on
> vacation. In China, today is already the lunar New Year. I would not expect
> he will reply this email in the next 7 days.
>
> Cheers,
>
> Xiao
>
>
>
> Sean Owen  于2019年2月4日周一 上午7:56写道:
>
>> I was unclear from this thread what the objection to these PRs is:
>>
>> https://github.com/apache/spark/pull/23552
>> https://github.com/apache/spark/pull/23553
>>
>> Would we like to specifically discuss whether to merge these or not? I
>> hear support for it, concerns about continuing to support Hive too,
>> but I wasn't clear whether those concerns specifically argue against
>> these PRs.
>>
>>
>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>> wrote:
>> >
>> > What’s the update and next step on this?
>> >
>> > We have real users getting blocked by this issue.
>> >
>> >
>> > ________________
>> > From: Xiao Li 
>> > Sent: Wednesday, January 16, 2019 9:37 AM
>> > To: Ryan Blue
>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
>> dev
>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>> >
>> > Thanks for your feedbacks!
>> >
>> > Working with Yuming to reduce the risk of stability and quality. Will
>> keep you posted when the proposal is ready.
>> >
>> > Cheers,
>> >
>> > Xiao
>> >
>> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
>> >>
>> >> +1 for what Marcelo and Hyukjin said.
>> >>
>> >> In particular, I agree that we can't expect Hive to release a version
>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>> that would have been a reasonable ask instead of publishing a fork years
>> ago, but I think this is now Spark's problem.
>> >>
>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> wrote:
>> >>>
>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>> >>> problem that we created.
>> >>>
>> >>> The current PR is basically a Spark-side fix for that bug. It does
>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>> >>> it's really the right path to take here.
>> >>>
>> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>> wrote:
>> >>> >
>> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>> fixes of our Hive fork (correct me if I am mistaken).
>> >>> >
>> >>> > Just to be honest by myself and as a personal opinion, that
>> basically says Hive to take care of Spark's dependency.
>> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
>> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
>> for instance,
>> >>> >
>> >>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>> Since Spark community is big enough, I was thinking we should try to fix it
>> by ourselves first.
>> >>> > I am not saying upgrading is the only way to get through this but I
>> think we should at least try first, and see what's next.
>> >>> >
>> >>> > It does, yes, sound more risky to upgrade it in our side but I
>> think it's worth to check and try it and see if it's possible.
>> >>> > I think this is a standard approach to upgrade the dependency than
>> using the fork or letting Hive side to release another 1.2.x.
>> >>> >
>> >>> > If we fail to upgrade it for critical or inevitable reasons
>> somehow, yes, we could find an alternative but that basically means
>> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
>> until Spark 4.0.0?).
>> >>> >
>> >>> > I know somehow it happened to be sensitive but to be just literally
>> honest to myself, I think we should make a try.
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>>
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
To reduce the impact and risk of upgrading Hive execution JARs, we can just
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The
support of Hadoop 3 will be still experimental in our next release. That
means, the impact and risk are very minimal for most users who are still
using Hadoop 2.x profile.

The code changes in Spark thrift server are massive. It is risky and hard
to review. The original code of our Spark thrift server is from
Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
new version. In the future, we can completely get rid of the thrift server,
and build our own high-performant JDBC server.

Does this proposal sound good to you?

In the last two weeks, Yuming was trying this proposal. Now, he is on
vacation. In China, today is already the lunar New Year. I would not expect
he will reply this email in the next 7 days.

Cheers,

Xiao



Sean Owen  于2019年2月4日周一 上午7:56写道:

> I was unclear from this thread what the objection to these PRs is:
>
> https://github.com/apache/spark/pull/23552
> https://github.com/apache/spark/pull/23553
>
> Would we like to specifically discuss whether to merge these or not? I
> hear support for it, concerns about continuing to support Hive too,
> but I wasn't clear whether those concerns specifically argue against
> these PRs.
>
>
> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
> wrote:
> >
> > What’s the update and next step on this?
> >
> > We have real users getting blocked by this issue.
> >
> >
> > 
> > From: Xiao Li 
> > Sent: Wednesday, January 16, 2019 9:37 AM
> > To: Ryan Blue
> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
> dev
> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
> >
> > Thanks for your feedbacks!
> >
> > Working with Yuming to reduce the risk of stability and quality. Will
> keep you posted when the proposal is ready.
> >
> > Cheers,
> >
> > Xiao
> >
> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
> >>
> >> +1 for what Marcelo and Hyukjin said.
> >>
> >> In particular, I agree that we can't expect Hive to release a version
> that is now more than 3 years old just to solve a problem for Spark. Maybe
> that would have been a reasonable ask instead of publishing a fork years
> ago, but I think this is now Spark's problem.
> >>
> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
> wrote:
> >>>
> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
> >>> problem that we created.
> >>>
> >>> The current PR is basically a Spark-side fix for that bug. It does
> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
> >>> it's really the right path to take here.
> >>>
> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
> wrote:
> >>> >
> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
> fixes of our Hive fork (correct me if I am mistaken).
> >>> >
> >>> > Just to be honest by myself and as a personal opinion, that
> basically says Hive to take care of Spark's dependency.
> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
> for instance,
> >>> >
> >>> > Frankly, my impression was that it's, honestly, our mistake to fix.
> Since Spark community is big enough, I was thinking we should try to fix it
> by ourselves first.
> >>> > I am not saying upgrading is the only way to get through this but I
> think we should at least try first, and see what's next.
> >>> >
> >>> > It does, yes, sound more risky to upgrade it in our side but I think
> it's worth to check and try it and see if it's possible.
> >>> > I think this is a standard approach to upgrade the dependency than
> using the fork or letting Hive side to release another 1.2.x.
> >>> >
> >>> > If we fail to upgrade it for critical or inevitable reasons somehow,
> yes, we could find an alternative but that basically means
> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
> until Spark 4.0.0?).
> >>> >
> >>> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >>> >
> >>>
> >>>
> >>> --
> >>> Marcelo
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Sean Owen
I was unclear from this thread what the objection to these PRs is:

https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553

Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I wasn't clear whether those concerns specifically argue against
these PRs.


On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung  wrote:
>
> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> 
> From: Xiao Li 
> Sent: Wednesday, January 16, 2019 9:37 AM
> To: Ryan Blue
> Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep 
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue  于2019年1月16日周三 上午9:27写道:
>>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version that 
>> is now more than 3 years old just to solve a problem for Spark. Maybe that 
>> would have been a reasonable ask instead of publishing a fork years ago, but 
>> I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin  wrote:
>>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes 
>>> > of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically 
>>> > says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer release 
>>> > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for 
>>> > instance,
>>> >
>>> > Frankly, my impression was that it's, honestly, our mistake to fix. Since 
>>> > Spark community is big enough, I was thinking we should try to fix it by 
>>> > ourselves first.
>>> > I am not saying upgrading is the only way to get through this but I think 
>>> > we should at least try first, and see what's next.
>>> >
>>> > It does, yes, sound more risky to upgrade it in our side but I think it's 
>>> > worth to check and try it and see if it's possible.
>>> > I think this is a standard approach to upgrade the dependency than using 
>>> > the fork or letting Hive side to release another 1.2.x.
>>> >
>>> > If we fail to upgrade it for critical or inevitable reasons somehow, yes, 
>>> > we could find an alternative but that basically means
>>> > we're going to stay in 1.2.x for, at least, a long time (say .. until 
>>> > Spark 4.0.0?).
>>> >
>>> > I know somehow it happened to be sensitive but to be just literally 
>>> > honest to myself, I think we should make a try.
>>> >
>>>
>>>
>>> --
>>> Marcelo
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-01 Thread Koert Kuipers
introducing hive serdes in sql core sounds a bit like a step back to me.
how can you build spark without hive support if there are imports for org.
apache.hadoop.hive.serde2 in sql core? are these imports very limited in
scope (and not suck all of hive into it)?

On Fri, Feb 1, 2019 at 3:03 PM Felix Cheung 
wrote:

> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> --
> *From:* Xiao Li 
> *Sent:* Wednesday, January 16, 2019 9:37 AM
> *To:* Ryan Blue
> *Cc:* Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
> dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue  于2019年1月16日周三 上午9:27写道:
>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version
>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>> that would have been a reasonable ask instead of publishing a fork years
>> ago, but I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> wrote:
>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>>> fixes of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically
>>> says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer
>>> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
>>> for instance,
>>> >
>>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>>> Since Spark community is big enough, I was thinking we should try to fix it
>>> by ourselves first.
>>> > I am not saying upgrading is the only way to get through this but I
>>> think we should at least try first, and see what's next.
>>> >
>>> > It does, yes, sound more risky to upgrade it in our side but I think
>>> it's worth to check and try it and see if it's possible.
>>> > I think this is a standard approach to upgrade the dependency than
>>> using the fork or letting Hive side to release another 1.2.x.
>>> >
>>> > If we fail to upgrade it for critical or inevitable reasons somehow,
>>> yes, we could find an alternative but that basically means
>>> > we're going to stay in 1.2.x for, at least, a long time (say .. until
>>> Spark 4.0.0?).
>>> >
>>> > I know somehow it happened to be sensitive but to be just literally
>>> honest to myself, I think we should make a try.
>>> >
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-01 Thread Felix Cheung
What’s the update and next step on this?

We have real users getting blocked by this issue.



From: Xiao Li 
Sent: Wednesday, January 16, 2019 9:37 AM
To: Ryan Blue
Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Thanks for your feedbacks!

Working with Yuming to reduce the risk of stability and quality. Will keep you 
posted when the proposal is ready.

Cheers,

Xiao

Ryan Blue mailto:rb...@netflix.com>> 于2019年1月16日周三 上午9:27写道:
+1 for what Marcelo and Hyukjin said.

In particular, I agree that we can't expect Hive to release a version that is 
now more than 3 years old just to solve a problem for Spark. Maybe that would 
have been a reasonable ask instead of publishing a fork years ago, but I think 
this is now Spark's problem.

On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
mailto:van...@cloudera.com>> wrote:
+1 to that. HIVE-16391 by itself means we're giving up things like
Hadoop 3, and we're also putting the burden on the Hive folks to fix a
problem that we created.

The current PR is basically a Spark-side fix for that bug. It does
mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
it's really the right path to take here.

On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
>
> Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of 
> our Hive fork (correct me if I am mistaken).
>
> Just to be honest by myself and as a personal opinion, that basically says 
> Hive to take care of Spark's dependency.
> Hive looks going ahead for 3.1.x and no one would use the newer release of 
> 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,
>
> Frankly, my impression was that it's, honestly, our mistake to fix. Since 
> Spark community is big enough, I was thinking we should try to fix it by 
> ourselves first.
> I am not saying upgrading is the only way to get through this but I think we 
> should at least try first, and see what's next.
>
> It does, yes, sound more risky to upgrade it in our side but I think it's 
> worth to check and try it and see if it's possible.
> I think this is a standard approach to upgrade the dependency than using the 
> fork or letting Hive side to release another 1.2.x.
>
> If we fail to upgrade it for critical or inevitable reasons somehow, yes, we 
> could find an alternative but that basically means
> we're going to stay in 1.2.x for, at least, a long time (say .. until Spark 
> 4.0.0?).
>
> I know somehow it happened to be sensitive but to be just literally honest to 
> myself, I think we should make a try.
>


--
Marcelo


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Xiao Li
Thanks for your feedbacks!

Working with Yuming to reduce the risk of stability and quality. Will keep
you posted when the proposal is ready.

Cheers,

Xiao

Ryan Blue  于2019年1月16日周三 上午9:27写道:

> +1 for what Marcelo and Hyukjin said.
>
> In particular, I agree that we can't expect Hive to release a version that
> is now more than 3 years old just to solve a problem for Spark. Maybe that
> would have been a reasonable ask instead of publishing a fork years ago,
> but I think this is now Spark's problem.
>
> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
> wrote:
>
>> +1 to that. HIVE-16391 by itself means we're giving up things like
>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>> problem that we created.
>>
>> The current PR is basically a Spark-side fix for that bug. It does
>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>> it's really the right path to take here.
>>
>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>> >
>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>> fixes of our Hive fork (correct me if I am mistaken).
>> >
>> > Just to be honest by myself and as a personal opinion, that basically
>> says Hive to take care of Spark's dependency.
>> > Hive looks going ahead for 3.1.x and no one would use the newer release
>> of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for
>> instance,
>> >
>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>> Since Spark community is big enough, I was thinking we should try to fix it
>> by ourselves first.
>> > I am not saying upgrading is the only way to get through this but I
>> think we should at least try first, and see what's next.
>> >
>> > It does, yes, sound more risky to upgrade it in our side but I think
>> it's worth to check and try it and see if it's possible.
>> > I think this is a standard approach to upgrade the dependency than
>> using the fork or letting Hive side to release another 1.2.x.
>> >
>> > If we fail to upgrade it for critical or inevitable reasons somehow,
>> yes, we could find an alternative but that basically means
>> > we're going to stay in 1.2.x for, at least, a long time (say .. until
>> Spark 4.0.0?).
>> >
>> > I know somehow it happened to be sensitive but to be just literally
>> honest to myself, I think we should make a try.
>> >
>>
>>
>> --
>> Marcelo
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
+1 for what Marcelo and Hyukjin said.

In particular, I agree that we can't expect Hive to release a version that
is now more than 3 years old just to solve a problem for Spark. Maybe that
would have been a reasonable ask instead of publishing a fork years ago,
but I think this is now Spark's problem.

On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin  wrote:

> +1 to that. HIVE-16391 by itself means we're giving up things like
> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
> problem that we created.
>
> The current PR is basically a Spark-side fix for that bug. It does
> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
> it's really the right path to take here.
>
> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
> >
> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes
> of our Hive fork (correct me if I am mistaken).
> >
> > Just to be honest by myself and as a personal opinion, that basically
> says Hive to take care of Spark's dependency.
> > Hive looks going ahead for 3.1.x and no one would use the newer release
> of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for
> instance,
> >
> > Frankly, my impression was that it's, honestly, our mistake to fix.
> Since Spark community is big enough, I was thinking we should try to fix it
> by ourselves first.
> > I am not saying upgrading is the only way to get through this but I
> think we should at least try first, and see what's next.
> >
> > It does, yes, sound more risky to upgrade it in our side but I think
> it's worth to check and try it and see if it's possible.
> > I think this is a standard approach to upgrade the dependency than using
> the fork or letting Hive side to release another 1.2.x.
> >
> > If we fail to upgrade it for critical or inevitable reasons somehow,
> yes, we could find an alternative but that basically means
> > we're going to stay in 1.2.x for, at least, a long time (say .. until
> Spark 4.0.0?).
> >
> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >
>
>
> --
> Marcelo
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
+1 to that. HIVE-16391 by itself means we're giving up things like
Hadoop 3, and we're also putting the burden on the Hive folks to fix a
problem that we created.

The current PR is basically a Spark-side fix for that bug. It does
mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
it's really the right path to take here.

On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>
> Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of 
> our Hive fork (correct me if I am mistaken).
>
> Just to be honest by myself and as a personal opinion, that basically says 
> Hive to take care of Spark's dependency.
> Hive looks going ahead for 3.1.x and no one would use the newer release of 
> 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,
>
> Frankly, my impression was that it's, honestly, our mistake to fix. Since 
> Spark community is big enough, I was thinking we should try to fix it by 
> ourselves first.
> I am not saying upgrading is the only way to get through this but I think we 
> should at least try first, and see what's next.
>
> It does, yes, sound more risky to upgrade it in our side but I think it's 
> worth to check and try it and see if it's possible.
> I think this is a standard approach to upgrade the dependency than using the 
> fork or letting Hive side to release another 1.2.x.
>
> If we fail to upgrade it for critical or inevitable reasons somehow, yes, we 
> could find an alternative but that basically means
> we're going to stay in 1.2.x for, at least, a long time (say .. until Spark 
> 4.0.0?).
>
> I know somehow it happened to be sensitive but to be just literally honest to 
> myself, I think we should make a try.
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Hyukjin Kwon
Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of
our Hive fork (correct me if I am mistaken).

Just to be honest by myself and as a personal opinion, that basically says
Hive to take care of Spark's dependency.
Hive looks going ahead for 3.1.x and no one would use the newer release of
1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,

Frankly, my impression was that it's, honestly, our mistake to fix. Since
Spark community is big enough, I was thinking we should try to fix it by
ourselves first.
I am not saying upgrading is the only way to get through this but I think
we should at least try first, and see what's next.

It does, yes, sound more risky to upgrade it in our side but I think it's
worth to check and try it and see if it's possible.
I think this is a standard approach to upgrade the dependency than using
the fork or letting Hive side to release another 1.2.x.

If we fail to upgrade it for critical or inevitable reasons somehow, yes,
we could find an alternative but that basically means
we're going to stay in 1.2.x for, at least, a long time (say .. until Spark
4.0.0?).

I know somehow it happened to be sensitive but to be just literally honest
to myself, I think we should make a try.


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
It's almost certainly needed just to get off the fork of Hive we're
not supposed to have. Yes it's going to impact dependencies, so would
need to happen at Spark 3.
Separately, its usage could be reduced or removed -- this I don't know
much about. But it doesn't really make it harder or easier.

On Tue, Jan 15, 2019 at 12:40 PM Xiao Li  wrote:
>
> Since Spark 2.0, we have been trying to move all the Hive-specific logics to 
> a separate package and make Hive as a data source like the other built-in 
> data sources. You might see a lot of refactoring PRs for this goal. Hive will 
> be still an important data source Spark supports for sure.
>
> Now, the upgrade of Hive execution JAR touches so many code and changes many 
> dependencies. Any PR like this looks very risky to me. Both quality and 
> stability are my major concern. This could impact the adoption rate of our 
> upcoming Spark 3.0 release, which will contain many important features. I 
> doubt whether upgrading the Hive execution JAR is really needed?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
If https://issues.apache.org/jira/browse/HIVE-16391 can be resolved, we do
not need to keep our fork of Hive.

Sean Owen  于2019年1月15日周二 上午10:44写道:

> It's almost certainly needed just to get off the fork of Hive we're
> not supposed to have. Yes it's going to impact dependencies, so would
> need to happen at Spark 3.
> Separately, its usage could be reduced or removed -- this I don't know
> much about. But it doesn't really make it harder or easier.
>
> On Tue, Jan 15, 2019 at 12:40 PM Xiao Li  wrote:
> >
> > Since Spark 2.0, we have been trying to move all the Hive-specific
> logics to a separate package and make Hive as a data source like the other
> built-in data sources. You might see a lot of refactoring PRs for this
> goal. Hive will be still an important data source Spark supports for sure.
> >
> > Now, the upgrade of Hive execution JAR touches so many code and changes
> many dependencies. Any PR like this looks very risky to me. Both quality
> and stability are my major concern. This could impact the adoption rate of
> our upcoming Spark 3.0 release, which will contain many important features.
> I doubt whether upgrading the Hive execution JAR is really needed?
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Since Spark 2.0, we have been trying to move all the Hive-specific logics
to a separate package and make Hive as a data source like the other
built-in data sources. You might see a lot of refactoring PRs for this
goal. Hive will be still an important data source Spark supports for sure.

Now, the upgrade of Hive execution JAR touches so many code and changes
many dependencies. Any PR like this looks very risky to me. Both quality
and stability are my major concern. This could impact the adoption rate of
our upcoming Spark 3.0 release, which will contain many important features.
I doubt whether upgrading the Hive execution JAR is really needed?


Ryan Blue  于2019年1月15日周二 上午10:15写道:

> Xiao, thanks for clarifying.
>
> There are a few use cases for metastore tables. Felix mentions a good one,
> custom metastore tables. There are also common formats that Spark doesn't
> support natively. Spark has CSV support, but the behavior is different from
> Hive's delimited format. Hive also supports Sequence file tables. We have a
> few of those that are old.
>
> The other use case that comes to mind is mixed-format tables. I don't
> think Spark supports a different format per partition without going through
> the Hive read path. We use this feature to convert old tables to Parquet by
> simply writing new partitions in Parquet format. Without this, it would be
> a much more painful migration process. Only the jobs that read older
> partitions need to go through Hive, so converting to a HadoopFs table
> usually works.
>
> rb
>
> On Tue, Jan 15, 2019 at 10:07 AM Felix Cheung 
> wrote:
>
>> One common case we have is a custom input format.
>>
>> In any case, even when Hive metatstore is protocol compatible we should
>> still upgrade or replace the hive jar from a fork, as Sean says, from a ASF
>> release process standpoint. Unless there is a plan for removing hive
>> integration (all of it) from the spark core project..
>>
>>
>> --
>> *From:* Xiao Li 
>> *Sent:* Tuesday, January 15, 2019 10:03 AM
>> *To:* Felix Cheung
>> *Cc:* rb...@netflix.com; Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> Let me take my words back. To read/write a table, Spark users do not use
>> the Hive execution JARs, unless they explicitly create the Hive serde
>> tables. Actually, I want to understand the motivation and use cases why
>> your usage scenarios need to create Hive serde tables instead of our Spark
>> native tables?
>>
>> BTW, we are still using Hive metastore as our metadata store. This does
>> not require the Hive execution JAR upgrade, based on my understanding.
>> Users can upgrade it to the newer version of Hive metastore.
>>
>> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>>
>>> And we are super 100% dependent on Hive...
>>>
>>>
>>> --
>>> *From:* Ryan Blue 
>>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>>> *To:* Xiao Li
>>> *Cc:* Yuming Wang; dev
>>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>>
>>> How do we know that most Spark users are not using Hive? I wouldn't be
>>> surprised either way, but I do want to make sure we aren't making decisions
>>> based on any one person's (or one company's) experience about what "most"
>>> Spark users do.
>>>
>>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>>
>>>> Hi, Yuming,
>>>>
>>>> Thank you for your contributions! The community aims at reducing the
>>>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>>>> changes looks risky to me.
>>>>
>>>> To support Hadoop 3.x, we just need to resolve this JIRA:
>>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>>>
>>>>> Dear Spark Developers and Users,
>>>>>
>>>>>
>>>>>
>>>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>>>>> <https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> to2.3.4
>>>>> <https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to
>>>>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>>>>> Parquet issues. This is the list:
>>>>>
>>>>> *Hive issues*:
>>>>>
>>>

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
Xiao, thanks for clarifying.

There are a few use cases for metastore tables. Felix mentions a good one,
custom metastore tables. There are also common formats that Spark doesn't
support natively. Spark has CSV support, but the behavior is different from
Hive's delimited format. Hive also supports Sequence file tables. We have a
few of those that are old.

The other use case that comes to mind is mixed-format tables. I don't think
Spark supports a different format per partition without going through the
Hive read path. We use this feature to convert old tables to Parquet by
simply writing new partitions in Parquet format. Without this, it would be
a much more painful migration process. Only the jobs that read older
partitions need to go through Hive, so converting to a HadoopFs table
usually works.

rb

On Tue, Jan 15, 2019 at 10:07 AM Felix Cheung 
wrote:

> One common case we have is a custom input format.
>
> In any case, even when Hive metatstore is protocol compatible we should
> still upgrade or replace the hive jar from a fork, as Sean says, from a ASF
> release process standpoint. Unless there is a plan for removing hive
> integration (all of it) from the spark core project..
>
>
> --
> *From:* Xiao Li 
> *Sent:* Tuesday, January 15, 2019 10:03 AM
> *To:* Felix Cheung
> *Cc:* rb...@netflix.com; Yuming Wang; dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Let me take my words back. To read/write a table, Spark users do not use
> the Hive execution JARs, unless they explicitly create the Hive serde
> tables. Actually, I want to understand the motivation and use cases why
> your usage scenarios need to create Hive serde tables instead of our Spark
> native tables?
>
> BTW, we are still using Hive metastore as our metadata store. This does
> not require the Hive execution JAR upgrade, based on my understanding.
> Users can upgrade it to the newer version of Hive metastore.
>
> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>
>> And we are super 100% dependent on Hive...
>>
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>> *To:* Xiao Li
>> *Cc:* Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> How do we know that most Spark users are not using Hive? I wouldn't be
>> surprised either way, but I do want to make sure we aren't making decisions
>> based on any one person's (or one company's) experience about what "most"
>> Spark users do.
>>
>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>
>>> Hi, Yuming,
>>>
>>> Thank you for your contributions! The community aims at reducing the
>>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>>> changes looks risky to me.
>>>
>>> To support Hadoop 3.x, we just need to resolve this JIRA:
>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>>
>>>> Dear Spark Developers and Users,
>>>>
>>>>
>>>>
>>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>>>> <https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> to2.3.4
>>>> <https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to
>>>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>>>> Parquet issues. This is the list:
>>>>
>>>> *Hive issues*:
>>>>
>>>> [SPARK-26332 
>>>> <https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790]
>>>> Spark sql write orc table on viewFS throws exception
>>>>
>>>> [SPARK-25193 
>>>> <https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505]
>>>> insert overwrite doesn't throw exception when drop old data fails
>>>>
>>>> [SPARK-26437 
>>>> <https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083]
>>>> Decimal data becomes bigint to query, unable to query
>>>>
>>>> [SPARK-25919 
>>>> <https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771]
>>>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>>>> table is Partitioned
>>>>
>>>> [SPARK-12014 
>>>> <https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100]
>>>> Spark SQL query containing semicolon is bro

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
The metastore interactions in Spark are currently based on APIs that
are in the Hive exec jar; so that makes it not possible to have Spark
work with Hadoop 3 until the exec jar is upgraded.

It could be possible to re-implement those interactions based solely
on the metastore client Hive publishes; but that would be a lot of
work IIRC.

I can't comment on how many people use Hive serde tables (although I
do know they use it, just not how extensively), but that's not the
only reason why Spark currently requires the hive-exec jar.

On Tue, Jan 15, 2019 at 10:03 AM Xiao Li  wrote:
>
> Let me take my words back. To read/write a table, Spark users do not use the 
> Hive execution JARs, unless they explicitly create the Hive serde tables. 
> Actually, I want to understand the motivation and use cases why your usage 
> scenarios need to create Hive serde tables instead of our Spark native tables?
>
> BTW, we are still using Hive metastore as our metadata store. This does not 
> require the Hive execution JAR upgrade, based on my understanding. Users can 
> upgrade it to the newer version of Hive metastore.
>
> Felix Cheung  于2019年1月15日周二 上午9:56写道:
>>
>> And we are super 100% dependent on Hive...
>>
>>
>> 
>> From: Ryan Blue 
>> Sent: Tuesday, January 15, 2019 9:53 AM
>> To: Xiao Li
>> Cc: Yuming Wang; dev
>> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> How do we know that most Spark users are not using Hive? I wouldn't be 
>> surprised either way, but I do want to make sure we aren't making decisions 
>> based on any one person's (or one company's) experience about what "most" 
>> Spark users do.
>>
>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>>>
>>> Hi, Yuming,
>>>
>>> Thank you for your contributions! The community aims at reducing the 
>>> dependence on Hive. Currently, most of Spark users are not using Hive. The 
>>> changes looks risky to me.
>>>
>>> To support Hadoop 3.x, we just need to resolve this JIRA: 
>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>>>
>>>> Dear Spark Developers and Users,
>>>>
>>>>
>>>>
>>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2 to 2.3.4 
>>>> to solve some critical issues, such as support Hadoop 3.x, solve some ORC 
>>>> and Parquet issues. This is the list:
>>>>
>>>> Hive issues:
>>>>
>>>> [SPARK-26332][HIVE-10790] Spark sql write orc table on viewFS throws 
>>>> exception
>>>>
>>>> [SPARK-25193][HIVE-12505] insert overwrite doesn't throw exception when 
>>>> drop old data fails
>>>>
>>>> [SPARK-26437][HIVE-13083] Decimal data becomes bigint to query, unable to 
>>>> query
>>>>
>>>> [SPARK-25919][HIVE-11771] Date value corrupts when tables are 
>>>> "ParquetHiveSerDe" formatted and target table is Partitioned
>>>>
>>>> [SPARK-12014][HIVE-11100] Spark SQL query containing semicolon is broken 
>>>> in Beeline
>>>>
>>>>
>>>>
>>>> Spark issues:
>>>>
>>>> [SPARK-23534] Spark run on Hadoop 3.0.0
>>>>
>>>> [SPARK-20202] Remove references to org.spark-project.hive
>>>>
>>>> [SPARK-18673] Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop 
>>>> version
>>>>
>>>> [SPARK-24766] CreateHiveTableAsSelect and InsertIntoHiveDir won't generate 
>>>> decimal column stats in parquet
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Since the code for the hive-thriftserver module has changed too much for 
>>>> this upgrade, I split it into two PRs for easy review.
>>>>
>>>> The first PR does not contain the changes of hive-thriftserver. Please 
>>>> ignore the failed test in hive-thriftserver.
>>>>
>>>> The second PR is complete changes.
>>>>
>>>>
>>>>
>>>> I have created a Spark distribution for Apache Hadoop 2.7, you might 
>>>> download it viaGoogle Drive or Baidu Pan.
>>>>
>>>> Please help review and test. Thanks.
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
One common case we have is a custom input format.

In any case, even when Hive metatstore is protocol compatible we should still 
upgrade or replace the hive jar from a fork, as Sean says, from a ASF release 
process standpoint. Unless there is a plan for removing hive integration (all 
of it) from the spark core project..



From: Xiao Li 
Sent: Tuesday, January 15, 2019 10:03 AM
To: Felix Cheung
Cc: rb...@netflix.com; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Let me take my words back. To read/write a table, Spark users do not use the 
Hive execution JARs, unless they explicitly create the Hive serde tables. 
Actually, I want to understand the motivation and use cases why your usage 
scenarios need to create Hive serde tables instead of our Spark native tables?

BTW, we are still using Hive metastore as our metadata store. This does not 
require the Hive execution JAR upgrade, based on my understanding. Users can 
upgrade it to the newer version of Hive metastore.

Felix Cheung mailto:felixcheun...@hotmail.com>> 
于2019年1月15日周二 上午9:56写道:
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> 
to2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on 
Hadoop 3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove 
references to org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> 
orBaidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Let me take my words back. To read/write a table, Spark users do not use
the Hive execution JARs, unless they explicitly create the Hive serde
tables. Actually, I want to understand the motivation and use cases why
your usage scenarios need to create Hive serde tables instead of our Spark
native tables?

BTW, we are still using Hive metastore as our metadata store. This does not
require the Hive execution JAR upgrade, based on my understanding. Users
can upgrade it to the newer version of Hive metastore.

Felix Cheung  于2019年1月15日周二 上午9:56写道:

> And we are super 100% dependent on Hive...
>
>
> --
> *From:* Ryan Blue 
> *Sent:* Tuesday, January 15, 2019 9:53 AM
> *To:* Xiao Li
> *Cc:* Yuming Wang; dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> How do we know that most Spark users are not using Hive? I wouldn't be
> surprised either way, but I do want to make sure we aren't making decisions
> based on any one person's (or one company's) experience about what "most"
> Spark users do.
>
> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:
>
>> Hi, Yuming,
>>
>> Thank you for your contributions! The community aims at reducing the
>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>> changes looks risky to me.
>>
>> To support Hadoop 3.x, we just need to resolve this JIRA:
>> https://issues.apache.org/jira/browse/HIVE-16391
>>
>> Cheers,
>>
>> Xiao
>>
>> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>
>>> Dear Spark Developers and Users,
>>>
>>>
>>>
>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>>> <https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> to 2.3.4
>>> <https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to
>>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>>> Parquet issues. This is the list:
>>>
>>> *Hive issues*:
>>>
>>> [SPARK-26332 
>>> <https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790]
>>> Spark sql write orc table on viewFS throws exception
>>>
>>> [SPARK-25193 
>>> <https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505]
>>> insert overwrite doesn't throw exception when drop old data fails
>>>
>>> [SPARK-26437 
>>> <https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083]
>>> Decimal data becomes bigint to query, unable to query
>>>
>>> [SPARK-25919 
>>> <https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771]
>>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>>> table is Partitioned
>>>
>>> [SPARK-12014 
>>> <https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100]
>>> Spark SQL query containing semicolon is broken in Beeline
>>>
>>>
>>>
>>> *Spark issues*:
>>>
>>> [SPARK-23534 <https://issues.apache.org/jira/browse/SPARK-23534>] Spark
>>> run on Hadoop 3.0.0
>>>
>>> [SPARK-20202 <https://issues.apache.org/jira/browse/SPARK-20202>]
>>> Remove references to org.spark-project.hive
>>>
>>> [SPARK-18673 <https://issues.apache.org/jira/browse/SPARK-18673>]
>>> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>>>
>>> [SPARK-24766 <https://issues.apache.org/jira/browse/SPARK-24766>]
>>> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
>>> stats in parquet
>>>
>>>
>>>
>>>
>>>
>>> Since the code for the *hive-thriftserver* module has changed too much
>>> for this upgrade, I split it into two PRs for easy review.
>>>
>>> The first PR <https://github.com/apache/spark/pull/23552> does not
>>> contain the changes of hive-thriftserver. Please ignore the failed test in
>>> hive-thriftserver.
>>>
>>> The second PR <https://github.com/apache/spark/pull/23553> is complete
>>> changes.
>>>
>>>
>>>
>>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>>> download it viaGoogle Drive
>>> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or 
>>> Baidu
>>> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
>>>
>>> Please help review and test. Thanks.
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
Unless it's going away entirely, and I don't think it is, we at least
have to do this to get off the fork of Hive that's being used now.
I do think we want to keep Hive from getting into the core though --
see comments on PR.

On Tue, Jan 15, 2019 at 11:44 AM Xiao Li  wrote:
>
> Hi, Yuming,
>
> Thank you for your contributions! The community aims at reducing the 
> dependence on Hive. Currently, most of Spark users are not using Hive. The 
> changes looks risky to me.
>
> To support Hadoop 3.x, we just need to resolve this JIRA: 
> https://issues.apache.org/jira/browse/HIVE-16391
>
> Cheers,
>
> Xiao
>
> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>>
>> Dear Spark Developers and Users,
>>
>>
>>
>> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2 to 2.3.4 
>> to solve some critical issues, such as support Hadoop 3.x, solve some ORC 
>> and Parquet issues. This is the list:
>>
>> Hive issues:
>>
>> [SPARK-26332][HIVE-10790] Spark sql write orc table on viewFS throws 
>> exception
>>
>> [SPARK-25193][HIVE-12505] insert overwrite doesn't throw exception when drop 
>> old data fails
>>
>> [SPARK-26437][HIVE-13083] Decimal data becomes bigint to query, unable to 
>> query
>>
>> [SPARK-25919][HIVE-11771] Date value corrupts when tables are 
>> "ParquetHiveSerDe" formatted and target table is Partitioned
>>
>> [SPARK-12014][HIVE-11100] Spark SQL query containing semicolon is broken in 
>> Beeline
>>
>>
>>
>> Spark issues:
>>
>> [SPARK-23534] Spark run on Hadoop 3.0.0
>>
>> [SPARK-20202] Remove references to org.spark-project.hive
>>
>> [SPARK-18673] Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop 
>> version
>>
>> [SPARK-24766] CreateHiveTableAsSelect and InsertIntoHiveDir won't generate 
>> decimal column stats in parquet
>>
>>
>>
>>
>>
>> Since the code for the hive-thriftserver module has changed too much for 
>> this upgrade, I split it into two PRs for easy review.
>>
>> The first PR does not contain the changes of hive-thriftserver. Please 
>> ignore the failed test in hive-thriftserver.
>>
>> The second PR is complete changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might 
>> download it via Google Drive or Baidu Pan.
>>
>> Please help review and test. Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> 
to 2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on 
Hadoop 3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove 
references to org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or 
Baidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
Resolving https://issues.apache.org/jira/browse/HIVE-16391 means to keep Spark 
on Hive 1.2?

I’m not sure that is reducing dependency on Hive - Hive is still there and it’s 
a very old Hive. IMO it is increasing the risk the longer we keep on this. (And 
it’s been years)

Looking at the two PR. They don’t seem very drastic to me, except for thrift 
server. Is there another, better approach to thrift server?



From: Xiao Li 
Sent: Tuesday, January 15, 2019 9:44 AM
To: Yuming Wang
Cc: dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive from 
1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> to 
2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to solve 
some critical issues, such as support Hadoop 3.x, solve some ORC and Parquet 
issues. This is the list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on 
Hadoop 3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove 
references to org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it via Google 
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or 
Baidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
How do we know that most Spark users are not using Hive? I wouldn't be
surprised either way, but I do want to make sure we aren't making decisions
based on any one person's (or one company's) experience about what "most"
Spark users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li  wrote:

> Hi, Yuming,
>
> Thank you for your contributions! The community aims at reducing the
> dependence on Hive. Currently, most of Spark users are not using Hive. The
> changes looks risky to me.
>
> To support Hadoop 3.x, we just need to resolve this JIRA:
> https://issues.apache.org/jira/browse/HIVE-16391
>
> Cheers,
>
> Xiao
>
> Yuming Wang  于2019年1月15日周二 上午8:41写道:
>
>> Dear Spark Developers and Users,
>>
>>
>>
>> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
>>  to 2.3.4
>>  to
>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>> Parquet issues. This is the list:
>>
>> *Hive issues*:
>>
>> [SPARK-26332 ][HIVE-10790]
>> Spark sql write orc table on viewFS throws exception
>>
>> [SPARK-25193 ][HIVE-12505]
>> insert overwrite doesn't throw exception when drop old data fails
>>
>> [SPARK-26437 ][HIVE-13083]
>> Decimal data becomes bigint to query, unable to query
>>
>> [SPARK-25919 ][HIVE-11771]
>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>> table is Partitioned
>>
>> [SPARK-12014 ][HIVE-11100]
>> Spark SQL query containing semicolon is broken in Beeline
>>
>>
>>
>> *Spark issues*:
>>
>> [SPARK-23534 ] Spark
>> run on Hadoop 3.0.0
>>
>> [SPARK-20202 ] Remove
>> references to org.spark-project.hive
>>
>> [SPARK-18673 ]
>> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>>
>> [SPARK-24766 ]
>> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
>> stats in parquet
>>
>>
>>
>>
>>
>> Since the code for the *hive-thriftserver* module has changed too much
>> for this upgrade, I split it into two PRs for easy review.
>>
>> The first PR  does not
>> contain the changes of hive-thriftserver. Please ignore the failed test in
>> hive-thriftserver.
>>
>> The second PR  is complete
>> changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>> download it via Google Drive
>>  or Baidu
>> Pan .
>>
>> Please help review and test. Thanks.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the
dependence on Hive. Currently, most of Spark users are not using Hive. The
changes looks risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA:
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang  于2019年1月15日周二 上午8:41写道:

> Dear Spark Developers and Users,
>
>
>
> Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
>  to 2.3.4
>  to
> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
> Parquet issues. This is the list:
>
> *Hive issues*:
>
> [SPARK-26332 ][HIVE-10790]
> Spark sql write orc table on viewFS throws exception
>
> [SPARK-25193 ][HIVE-12505]
> insert overwrite doesn't throw exception when drop old data fails
>
> [SPARK-26437 ][HIVE-13083]
> Decimal data becomes bigint to query, unable to query
>
> [SPARK-25919 ][HIVE-11771]
> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
> table is Partitioned
>
> [SPARK-12014 ][HIVE-11100]
> Spark SQL query containing semicolon is broken in Beeline
>
>
>
> *Spark issues*:
>
> [SPARK-23534 ] Spark
> run on Hadoop 3.0.0
>
> [SPARK-20202 ] Remove
> references to org.spark-project.hive
>
> [SPARK-18673 ]
> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>
> [SPARK-24766 ]
> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
> stats in parquet
>
>
>
>
>
> Since the code for the *hive-thriftserver* module has changed too much
> for this upgrade, I split it into two PRs for easy review.
>
> The first PR  does not
> contain the changes of hive-thriftserver. Please ignore the failed test in
> hive-thriftserver.
>
> The second PR  is complete
> changes.
>
>
>
> I have created a Spark distribution for Apache Hadoop 2.7, you might
> download it via Google Drive
>  or Baidu
> Pan .
>
> Please help review and test. Thanks.
>


[DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Yuming Wang
Dear Spark Developers and Users,



Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
 to 2.3.4
 to solve
some critical issues, such as support Hadoop 3.x, solve some ORC and
Parquet issues. This is the list:

*Hive issues*:

[SPARK-26332 ][HIVE-10790]
Spark sql write orc table on viewFS throws exception

[SPARK-25193 ][HIVE-12505]
insert overwrite doesn't throw exception when drop old data fails

[SPARK-26437 ][HIVE-13083]
Decimal data becomes bigint to query, unable to query

[SPARK-25919 ][HIVE-11771]
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
table is Partitioned

[SPARK-12014 ][HIVE-11100]
Spark SQL query containing semicolon is broken in Beeline



*Spark issues*:

[SPARK-23534 ] Spark run
on Hadoop 3.0.0

[SPARK-20202 ] Remove
references to org.spark-project.hive

[SPARK-18673 ]
Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

[SPARK-24766 ]
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
stats in parquet





Since the code for the *hive-thriftserver* module has changed too much for
this upgrade, I split it into two PRs for easy review.

The first PR  does not contain
the changes of hive-thriftserver. Please ignore the failed test in
hive-thriftserver.

The second PR  is complete
changes.



I have created a Spark distribution for Apache Hadoop 2.7, you might
download it via Google Drive
 or Baidu
Pan .

Please help review and test. Thanks.