Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hi Ryan,

On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue  wrote:
>
> To partition by a condition, you would need to create a column with the 
> result of that condition. Then you would partition by that column. The sort 
> option would also work here.

We actually do something similar to filter based on physics properties
by running a python UDF to create a column then filtering on that
column. Doing something similar to sort/partition would also require a
shuffle though, right?

>
> I don't think that there is much of a use case for this. You have a set of 
> conditions on which to partition your data, and partitioning is already 
> supported. The idea to use conditions to create separate data frames would 
> actually make that harder because you'd need to create and name tables for 
> each one.

At the end, however, we do need separate dataframes for each of these
subsamples, unless there's something basic I'm missing in how the
partitioning works. After the input datasets are split into signal and
background regions, we still need to perform further (different)
computations on each of the subsamples. e.g. for subsamples with
exactly 2 electrons, we'll need to calculate the sum of their 4-d
momenta, while samples with <2 electrons will need subtract two
different physical quantities -- several more steps before we get to
the point where we'll histogram the different subsamples for the
outputs.

Cheers
Andrew


>
> On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo  wrote:
>>
>> Hello Ryan,
>>
>> On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue  wrote:
>> >
>> > Andrew, can you give us more information about why partitioning the output 
>> > data doesn't work for your use case?
>> >
>> > It sounds like all you need to do is to create a table partitioned by A 
>> > and B, then you would automatically get the divisions you want. If what 
>> > you're looking for is a way to scale the number of combinations then you 
>> > can use formats that support more partitions, or you could sort by the 
>> > fields and rely on Parquet row group pruning to filter out data you don't 
>> > want.
>> >
>>
>> TBH, I don't understand what that would look like in pyspark and what
>> the consequences would be. Looking at the docs, it doesn't appear to
>> be the syntax for partitioning on a condition (most of our conditions
>> are of the form 'X > 30'). The use of Spark is still somewhat new in
>> our field, so it's possible we're not using it correctly.
>>
>> Cheers
>> Andrew
>>
>> > rb
>> >
>> > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo  wrote:
>> >>
>> >> Hello
>> >>
>> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini  wrote:
>> >> >
>> >> > I've seen many application need to split dataset to multiple datasets 
>> >> > based on some conditions. As there is no method to do it in one place, 
>> >> > developers use filter method multiple times. I think it can be useful 
>> >> > to have method to split dataset based on condition in one iteration, 
>> >> > something like partition method of scala (of-course scala partition 
>> >> > just split list into two list, but something more general can be more 
>> >> > useful).
>> >> > If you think it can be helpful, I can create Jira issue and work on it 
>> >> > to send PR.
>> >>
>> >> This would be a really useful feature for our use case (processing
>> >> collision data from the LHC). We typically want to take some sort of
>> >> input and split into multiple disjoint outputs based on some
>> >> conditions. E.g. if we have two conditions A and B, we'll end up with
>> >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> >> combinatorics explode like n^2, when we could produce them all up
>> >> front with this "multi filter" (or however it would be called).
>> >>
>> >> Cheers
>> >> Andrew
>> >>
>> >> >
>> >> > Best Regards
>> >> > Moein
>> >> >
>> >> > --
>> >> >
>> >> > Moein Hosseini
>> >> > Data Engineer
>> >> > mobile: +98 912 468 1859
>> >> > site: www.moein.xyz
>> >> > email: moein...@gmail.com
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Feature request: split dataset based on condition

2019-02-04 Thread Thakrar, Jayesh
Just wondering if this is what you are implying Ryan (example only):

val data = (dataset to be partitionned)

val splitCondition =
s"""
CASE
   WHEN …. THEN ….
   WHEN …. THEN …..
  END partition_condition
"""
val partitionedData = data.withColumn("partitionColumn", expr(splitCondition))

In this case there might be a need to cache/persist the partitionedData dataset 
to avoid recomputation as each "partition" is processed (e.g. saved, etc.) 
later on, correct?

From: Ryan Blue 
Reply-To: 
Date: Monday, February 4, 2019 at 12:16 PM
To: Andrew Melo 
Cc: Moein Hosseini , dev 
Subject: Re: Feature request: split dataset based on condition

To partition by a condition, you would need to create a column with the result 
of that condition. Then you would partition by that column. The sort option 
would also work here.

I don't think that there is much of a use case for this. You have a set of 
conditions on which to partition your data, and partitioning is already 
supported. The idea to use conditions to create separate data frames would 
actually make that harder because you'd need to create and name tables for each 
one.

On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo 
mailto:andrew.m...@gmail.com>> wrote:
Hello Ryan,

On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue 
mailto:rb...@netflix.com>> wrote:
>
> Andrew, can you give us more information about why partitioning the output 
> data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and 
> B, then you would automatically get the divisions you want. If what you're 
> looking for is a way to scale the number of combinations then you can use 
> formats that support more partitions, or you could sort by the fields and 
> rely on Parquet row group pruning to filter out data you don't want.
>

TBH, I don't understand what that would look like in pyspark and what
the consequences would be. Looking at the docs, it doesn't appear to
be the syntax for partitioning on a condition (most of our conditions
are of the form 'X > 30'). The use of Spark is still somewhat new in
our field, so it's possible we're not using it correctly.

Cheers
Andrew

> rb
>
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo 
> mailto:andrew.m...@gmail.com>> wrote:
>>
>> Hello
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini 
>> mailto:moein...@gmail.com>> wrote:
>> >
>> > I've seen many application need to split dataset to multiple datasets 
>> > based on some conditions. As there is no method to do it in one place, 
>> > developers use filter method multiple times. I think it can be useful to 
>> > have method to split dataset based on condition in one iteration, 
>> > something like partition method of scala (of-course scala partition just 
>> > split list into two list, but something more general can be more useful).
>> > If you think it can be helpful, I can create Jira issue and work on it to 
>> > send PR.
>>
>> This would be a really useful feature for our use case (processing
>> collision data from the LHC). We typically want to take some sort of
>> input and split into multiple disjoint outputs based on some
>> conditions. E.g. if we have two conditions A and B, we'll end up with
>> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> combinatorics explode like n^2, when we could produce them all up
>> front with this "multi filter" (or however it would be called).
>>
>> Cheers
>> Andrew
>>
>> >
>> > Best Regards
>> > Moein
>> >
>> > --
>> >
>> > Moein Hosseini
>> > Data Engineer
>> > mobile: +98 912 468 1859
>> > site: www.moein.xyz
>> > email: moein...@gmail.com
>> >
>>
>> -
>> To unsubscribe e-mail: 
>> dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Felix Cheung
Likely need a shim (which we should have anyway) because of namespace/import 
changes.

I’m huge +1 on this.



From: Hyukjin Kwon 
Sent: Monday, February 4, 2019 12:27 PM
To: Xiao Li
Cc: Sean Owen; Felix Cheung; Ryan Blue; Marcelo Vanzin; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

I should check the details and feasiablity by myself but to me it sounds fine 
if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li 
mailto:gatorsm...@gmail.com> wrote:
Yes. When our support/integration with Hive 2.x becomes stable, we can do it in 
Hadoop 2.x profile too, if needed. The whole proposal is to minimize the risk 
and ensure the release stability and quality.

Hyukjin Kwon mailto:gurwls...@gmail.com>> 于2019年2月4日周一 
下午12:01写道:
Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with Hadoop 
3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 
mailto:gatorsm...@gmail.com>>님이 작성:
To reduce the impact and risk of upgrading Hive execution JARs, we can just 
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The 
support of Hadoop 3 will be still experimental in our next release. That means, 
the impact and risk are very minimal for most users who are still using Hadoop 
2.x profile.

The code changes in Spark thrift server are massive. It is risky and hard to 
review. The original code of our Spark thrift server is from Hive-service 
1.2.1. To reduce the risk of the upgrade, we can inline the new version. In the 
future, we can completely get rid of the thrift server, and build our own 
high-performant JDBC server.

Does this proposal sound good to you?

In the last two weeks, Yuming was trying this proposal. Now, he is on vacation. 
In China, today is already the lunar New Year. I would not expect he will reply 
this email in the next 7 days.

Cheers,

Xiao



Sean Owen mailto:sro...@gmail.com>> 于2019年2月4日周一 上午7:56写道:
I was unclear from this thread what the objection to these PRs is:

https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553

Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I wasn't clear whether those concerns specifically argue against
these PRs.


On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
>
> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> 
> From: Xiao Li mailto:gatorsm...@gmail.com>>
> Sent: Wednesday, January 16, 2019 9:37 AM
> To: Ryan Blue
> Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep 
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue mailto:rb...@netflix.com>> 于2019年1月16日周三 
> 上午9:27写道:
>>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version that 
>> is now more than 3 years old just to solve a problem for Spark. Maybe that 
>> would have been a reasonable ask instead of publishing a fork years ago, but 
>> I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> mailto:van...@cloudera.com>> wrote:
>>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> mailto:gurwls...@gmail.com>> wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes 
>>> > of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically 
>>> > says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer release 
>>> > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for 
>>> > instance,
>>> >
>>> > Frankly, my impression was that it's, honestly, our mistake to fix. Since 
>>> > Spark community is big enough, I was thinking we should try to fix it by 
>>> > ourselves first.
>>> > I am not saying upgrading is the only way to get through this but I think 
>>> > we should at least try first, and see what's next.
>>> >
>>> > It does, yes, sound more risky to upgrade it in our side but I think 

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
I should check the details and feasiablity by myself but to me it sounds
fine if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li  Yes. When our support/integration with Hive 2.x becomes stable, we can do
> it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
> the risk and ensure the release stability and quality.
>
> Hyukjin Kwon  于2019年2月4日周一 下午12:01写道:
>
>> Xiao, to check if I understood correctly, do you mean the below?
>>
>> 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
>> Hadoop 3.x profile.
>> 2. Make another newer version of thrift server by Hive 2.x(?) in Spark
>> side.
>> 3. Target the transition to Hive 2.x completely and slowly later in the
>> future.
>>
>>
>>
>> 2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:
>>
>>> To reduce the impact and risk of upgrading Hive execution JARs, we can
>>> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
>>> The support of Hadoop 3 will be still experimental in our next release.
>>> That means, the impact and risk are very minimal for most users who are
>>> still using Hadoop 2.x profile.
>>>
>>> The code changes in Spark thrift server are massive. It is risky and
>>> hard to review. The original code of our Spark thrift server is from
>>> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
>>> new version. In the future, we can completely get rid of the thrift server,
>>> and build our own high-performant JDBC server.
>>>
>>> Does this proposal sound good to you?
>>>
>>> In the last two weeks, Yuming was trying this proposal. Now, he is on
>>> vacation. In China, today is already the lunar New Year. I would not expect
>>> he will reply this email in the next 7 days.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>> Sean Owen  于2019年2月4日周一 上午7:56写道:
>>>
 I was unclear from this thread what the objection to these PRs is:

 https://github.com/apache/spark/pull/23552
 https://github.com/apache/spark/pull/23553

 Would we like to specifically discuss whether to merge these or not? I
 hear support for it, concerns about continuing to support Hive too,
 but I wasn't clear whether those concerns specifically argue against
 these PRs.


 On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
 wrote:
 >
 > What’s the update and next step on this?
 >
 > We have real users getting blocked by this issue.
 >
 >
 > 
 > From: Xiao Li 
 > Sent: Wednesday, January 16, 2019 9:37 AM
 > To: Ryan Blue
 > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming
 Wang; dev
 > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
 >
 > Thanks for your feedbacks!
 >
 > Working with Yuming to reduce the risk of stability and quality. Will
 keep you posted when the proposal is ready.
 >
 > Cheers,
 >
 > Xiao
 >
 > Ryan Blue  于2019年1月16日周三 上午9:27写道:
 >>
 >> +1 for what Marcelo and Hyukjin said.
 >>
 >> In particular, I agree that we can't expect Hive to release a
 version that is now more than 3 years old just to solve a problem for
 Spark. Maybe that would have been a reasonable ask instead of publishing a
 fork years ago, but I think this is now Spark's problem.
 >>
 >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
 wrote:
 >>>
 >>> +1 to that. HIVE-16391 by itself means we're giving up things like
 >>> Hadoop 3, and we're also putting the burden on the Hive folks to
 fix a
 >>> problem that we created.
 >>>
 >>> The current PR is basically a Spark-side fix for that bug. It does
 >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I
 think
 >>> it's really the right path to take here.
 >>>
 >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
 wrote:
 >>> >
 >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains
 the fixes of our Hive fork (correct me if I am mistaken).
 >>> >
 >>> > Just to be honest by myself and as a personal opinion, that
 basically says Hive to take care of Spark's dependency.
 >>> > Hive looks going ahead for 3.1.x and no one would use the newer
 release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
 for instance,
 >>> >
 >>> > Frankly, my impression was that it's, honestly, our mistake to
 fix. Since Spark community is big enough, I was thinking we should try to
 fix it by ourselves first.
 >>> > I am not saying upgrading is the only way to get through this but
 I think we should at least try first, and see what's next.
 >>> >
 >>> > It does, yes, sound more risky to upgrade it in our side but I
 think it's worth to check and try it and see if it's possible.
 >>> > I think this is a standard approach to upgrade the dependency
 than using the fork or letting Hive side 

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
Yes. When our support/integration with Hive 2.x becomes stable, we can do
it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
the risk and ensure the release stability and quality.

Hyukjin Kwon  于2019年2月4日周一 下午12:01写道:

> Xiao, to check if I understood correctly, do you mean the below?
>
> 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
> Hadoop 3.x profile.
> 2. Make another newer version of thrift server by Hive 2.x(?) in Spark
> side.
> 3. Target the transition to Hive 2.x completely and slowly later in the
> future.
>
>
>
> 2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:
>
>> To reduce the impact and risk of upgrading Hive execution JARs, we can
>> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
>> The support of Hadoop 3 will be still experimental in our next release.
>> That means, the impact and risk are very minimal for most users who are
>> still using Hadoop 2.x profile.
>>
>> The code changes in Spark thrift server are massive. It is risky and hard
>> to review. The original code of our Spark thrift server is from
>> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
>> new version. In the future, we can completely get rid of the thrift server,
>> and build our own high-performant JDBC server.
>>
>> Does this proposal sound good to you?
>>
>> In the last two weeks, Yuming was trying this proposal. Now, he is on
>> vacation. In China, today is already the lunar New Year. I would not expect
>> he will reply this email in the next 7 days.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>> Sean Owen  于2019年2月4日周一 上午7:56写道:
>>
>>> I was unclear from this thread what the objection to these PRs is:
>>>
>>> https://github.com/apache/spark/pull/23552
>>> https://github.com/apache/spark/pull/23553
>>>
>>> Would we like to specifically discuss whether to merge these or not? I
>>> hear support for it, concerns about continuing to support Hive too,
>>> but I wasn't clear whether those concerns specifically argue against
>>> these PRs.
>>>
>>>
>>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>>> wrote:
>>> >
>>> > What’s the update and next step on this?
>>> >
>>> > We have real users getting blocked by this issue.
>>> >
>>> >
>>> > 
>>> > From: Xiao Li 
>>> > Sent: Wednesday, January 16, 2019 9:37 AM
>>> > To: Ryan Blue
>>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming
>>> Wang; dev
>>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>> >
>>> > Thanks for your feedbacks!
>>> >
>>> > Working with Yuming to reduce the risk of stability and quality. Will
>>> keep you posted when the proposal is ready.
>>> >
>>> > Cheers,
>>> >
>>> > Xiao
>>> >
>>> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
>>> >>
>>> >> +1 for what Marcelo and Hyukjin said.
>>> >>
>>> >> In particular, I agree that we can't expect Hive to release a version
>>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>>> that would have been a reasonable ask instead of publishing a fork years
>>> ago, but I think this is now Spark's problem.
>>> >>
>>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>>> wrote:
>>> >>>
>>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix
>>> a
>>> >>> problem that we created.
>>> >>>
>>> >>> The current PR is basically a Spark-side fix for that bug. It does
>>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> >>> it's really the right path to take here.
>>> >>>
>>> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> wrote:
>>> >>> >
>>> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>>> fixes of our Hive fork (correct me if I am mistaken).
>>> >>> >
>>> >>> > Just to be honest by myself and as a personal opinion, that
>>> basically says Hive to take care of Spark's dependency.
>>> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
>>> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
>>> for instance,
>>> >>> >
>>> >>> > Frankly, my impression was that it's, honestly, our mistake to
>>> fix. Since Spark community is big enough, I was thinking we should try to
>>> fix it by ourselves first.
>>> >>> > I am not saying upgrading is the only way to get through this but
>>> I think we should at least try first, and see what's next.
>>> >>> >
>>> >>> > It does, yes, sound more risky to upgrade it in our side but I
>>> think it's worth to check and try it and see if it's possible.
>>> >>> > I think this is a standard approach to upgrade the dependency than
>>> using the fork or letting Hive side to release another 1.2.x.
>>> >>> >
>>> >>> > If we fail to upgrade it for critical or inevitable reasons
>>> somehow, yes, we could find an alternative but that basically means
>>> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
>>> until Spark 4.0.0?).
>>> >>> 

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
Hadoop 3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the
future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:

> To reduce the impact and risk of upgrading Hive execution JARs, we can
> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
> The support of Hadoop 3 will be still experimental in our next release.
> That means, the impact and risk are very minimal for most users who are
> still using Hadoop 2.x profile.
>
> The code changes in Spark thrift server are massive. It is risky and hard
> to review. The original code of our Spark thrift server is from
> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
> new version. In the future, we can completely get rid of the thrift server,
> and build our own high-performant JDBC server.
>
> Does this proposal sound good to you?
>
> In the last two weeks, Yuming was trying this proposal. Now, he is on
> vacation. In China, today is already the lunar New Year. I would not expect
> he will reply this email in the next 7 days.
>
> Cheers,
>
> Xiao
>
>
>
> Sean Owen  于2019年2月4日周一 上午7:56写道:
>
>> I was unclear from this thread what the objection to these PRs is:
>>
>> https://github.com/apache/spark/pull/23552
>> https://github.com/apache/spark/pull/23553
>>
>> Would we like to specifically discuss whether to merge these or not? I
>> hear support for it, concerns about continuing to support Hive too,
>> but I wasn't clear whether those concerns specifically argue against
>> these PRs.
>>
>>
>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>> wrote:
>> >
>> > What’s the update and next step on this?
>> >
>> > We have real users getting blocked by this issue.
>> >
>> >
>> > 
>> > From: Xiao Li 
>> > Sent: Wednesday, January 16, 2019 9:37 AM
>> > To: Ryan Blue
>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
>> dev
>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>> >
>> > Thanks for your feedbacks!
>> >
>> > Working with Yuming to reduce the risk of stability and quality. Will
>> keep you posted when the proposal is ready.
>> >
>> > Cheers,
>> >
>> > Xiao
>> >
>> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
>> >>
>> >> +1 for what Marcelo and Hyukjin said.
>> >>
>> >> In particular, I agree that we can't expect Hive to release a version
>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>> that would have been a reasonable ask instead of publishing a fork years
>> ago, but I think this is now Spark's problem.
>> >>
>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> wrote:
>> >>>
>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>> >>> problem that we created.
>> >>>
>> >>> The current PR is basically a Spark-side fix for that bug. It does
>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>> >>> it's really the right path to take here.
>> >>>
>> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>> wrote:
>> >>> >
>> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>> fixes of our Hive fork (correct me if I am mistaken).
>> >>> >
>> >>> > Just to be honest by myself and as a personal opinion, that
>> basically says Hive to take care of Spark's dependency.
>> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
>> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
>> for instance,
>> >>> >
>> >>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>> Since Spark community is big enough, I was thinking we should try to fix it
>> by ourselves first.
>> >>> > I am not saying upgrading is the only way to get through this but I
>> think we should at least try first, and see what's next.
>> >>> >
>> >>> > It does, yes, sound more risky to upgrade it in our side but I
>> think it's worth to check and try it and see if it's possible.
>> >>> > I think this is a standard approach to upgrade the dependency than
>> using the fork or letting Hive side to release another 1.2.x.
>> >>> >
>> >>> > If we fail to upgrade it for critical or inevitable reasons
>> somehow, yes, we could find an alternative but that basically means
>> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
>> until Spark 4.0.0?).
>> >>> >
>> >>> > I know somehow it happened to be sensitive but to be just literally
>> honest to myself, I think we should make a try.
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>>
>


Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
To partition by a condition, you would need to create a column with the
result of that condition. Then you would partition by that column. The sort
option would also work here.

I don't think that there is much of a use case for this. You have a set of
conditions on which to partition your data, and partitioning is already
supported. The idea to use conditions to create separate data frames would
actually make that harder because you'd need to create and name tables for
each one.

On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo  wrote:

> Hello Ryan,
>
> On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue  wrote:
> >
> > Andrew, can you give us more information about why partitioning the
> output data doesn't work for your use case?
> >
> > It sounds like all you need to do is to create a table partitioned by A
> and B, then you would automatically get the divisions you want. If what
> you're looking for is a way to scale the number of combinations then you
> can use formats that support more partitions, or you could sort by the
> fields and rely on Parquet row group pruning to filter out data you don't
> want.
> >
>
> TBH, I don't understand what that would look like in pyspark and what
> the consequences would be. Looking at the docs, it doesn't appear to
> be the syntax for partitioning on a condition (most of our conditions
> are of the form 'X > 30'). The use of Spark is still somewhat new in
> our field, so it's possible we're not using it correctly.
>
> Cheers
> Andrew
>
> > rb
> >
> > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo 
> wrote:
> >>
> >> Hello
> >>
> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini 
> wrote:
> >> >
> >> > I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use filter method multiple times. I think it can be useful to
> have method to split dataset based on condition in one iteration, something
> like partition method of scala (of-course scala partition just split list
> into two list, but something more general can be more useful).
> >> > If you think it can be helpful, I can create Jira issue and work on
> it to send PR.
> >>
> >> This would be a really useful feature for our use case (processing
> >> collision data from the LHC). We typically want to take some sort of
> >> input and split into multiple disjoint outputs based on some
> >> conditions. E.g. if we have two conditions A and B, we'll end up with
> >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
> >> combinatorics explode like n^2, when we could produce them all up
> >> front with this "multi filter" (or however it would be called).
> >>
> >> Cheers
> >> Andrew
> >>
> >> >
> >> > Best Regards
> >> > Moein
> >> >
> >> > --
> >> >
> >> > Moein Hosseini
> >> > Data Engineer
> >> > mobile: +98 912 468 1859
> >> > site: www.moein.xyz
> >> > email: moein...@gmail.com
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thx Xiao!

On Mon, Feb 4, 2019 at 9:04 AM Xiao Li  wrote:

> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge  于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
>> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
>> wrote:
>>
>>> The scheduler has been pretty error-prone and hard to work on, and I
>>> feel like there may be a dwindling core of active experts.  I'm sure its
>>> very discouraging to folks trying to make what seem like simple changes,
>>> and then find they are in a rats nest of complex issues they weren't
>>> expecting.  But for those who are still trying, THANK YOU!  more
>>> involvement and more folks becoming experts is definitely needed.
>>>
>>> I put together a doc going over the architecture of the scheduler, and
>>> things I've seen us get bitten by in the past.  Its sort of a brain dump,
>>> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
>>> more experts will chime in -- there are places in the doc I know I've
>>> missed things, and called that out, but there are probably even more that
>>> should be discussed, & mistakes I've made.  All input welcome.
>>>
>>>
>>> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread sujith chacko
Thanks Li and Imran for providing us  an overview about one of the complex
module in spark  Excellent sharing.

Regards
Sujith.

On Mon, 4 Feb 2019 at 10:54 PM, Xiao Li  wrote:

> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge  于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
>> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
>> wrote:
>>
>>> The scheduler has been pretty error-prone and hard to work on, and I
>>> feel like there may be a dwindling core of active experts.  I'm sure its
>>> very discouraging to folks trying to make what seem like simple changes,
>>> and then find they are in a rats nest of complex issues they weren't
>>> expecting.  But for those who are still trying, THANK YOU!  more
>>> involvement and more folks becoming experts is definitely needed.
>>>
>>> I put together a doc going over the architecture of the scheduler, and
>>> things I've seen us get bitten by in the past.  Its sort of a brain dump,
>>> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
>>> more experts will chime in -- there are places in the doc I know I've
>>> missed things, and called that out, but there are probably even more that
>>> should be discussed, & mistakes I've made.  All input welcome.
>>>
>>>
>>> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>>>
>>
>>
>> --
>> John Zhuge
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread Parth Gandhi
Thank you Imran, this is quite helpful.

Regards,
Parth Kamlesh Gandhi


On Mon, Feb 4, 2019 at 11:01 AM Rubén Berenguel 
wrote:

> Thanks Imran, will definitely give it a look (even if just out of sheer
> interest on how the sausage is done)
>
> R
>
>
> On 4 February 2019 at 17:59:33, John Zhuge (jzh...@apache.org) wrote:
>
> Thanks Imran!
>
> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
> wrote:
>
>> The scheduler has been pretty error-prone and hard to work on, and I feel
>> like there may be a dwindling core of active experts.  I'm sure its very
>> discouraging to folks trying to make what seem like simple changes, and
>> then find they are in a rats nest of complex issues they weren't
>> expecting.  But for those who are still trying, THANK YOU!  more
>> involvement and more folks becoming experts is definitely needed.
>>
>> I put together a doc going over the architecture of the scheduler, and
>> things I've seen us get bitten by in the past.  Its sort of a brain dump,
>> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
>> more experts will chime in -- there are places in the doc I know I've
>> missed things, and called that out, but there are probably even more that
>> should be discussed, & mistakes I've made.  All input welcome.
>>
>>
>> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>>
>
>
> --
> John Zhuge
>
>


Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello Ryan,

On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue  wrote:
>
> Andrew, can you give us more information about why partitioning the output 
> data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and 
> B, then you would automatically get the divisions you want. If what you're 
> looking for is a way to scale the number of combinations then you can use 
> formats that support more partitions, or you could sort by the fields and 
> rely on Parquet row group pruning to filter out data you don't want.
>

TBH, I don't understand what that would look like in pyspark and what
the consequences would be. Looking at the docs, it doesn't appear to
be the syntax for partitioning on a condition (most of our conditions
are of the form 'X > 30'). The use of Spark is still somewhat new in
our field, so it's possible we're not using it correctly.

Cheers
Andrew

> rb
>
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo  wrote:
>>
>> Hello
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini  wrote:
>> >
>> > I've seen many application need to split dataset to multiple datasets 
>> > based on some conditions. As there is no method to do it in one place, 
>> > developers use filter method multiple times. I think it can be useful to 
>> > have method to split dataset based on condition in one iteration, 
>> > something like partition method of scala (of-course scala partition just 
>> > split list into two list, but something more general can be more useful).
>> > If you think it can be helpful, I can create Jira issue and work on it to 
>> > send PR.
>>
>> This would be a really useful feature for our use case (processing
>> collision data from the LHC). We typically want to take some sort of
>> input and split into multiple disjoint outputs based on some
>> conditions. E.g. if we have two conditions A and B, we'll end up with
>> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> combinatorics explode like n^2, when we could produce them all up
>> front with this "multi filter" (or however it would be called).
>>
>> Cheers
>> Andrew
>>
>> >
>> > Best Regards
>> > Moein
>> >
>> > --
>> >
>> > Moein Hosseini
>> > Data Engineer
>> > mobile: +98 912 468 1859
>> > site: www.moein.xyz
>> > email: moein...@gmail.com
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
To reduce the impact and risk of upgrading Hive execution JARs, we can just
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The
support of Hadoop 3 will be still experimental in our next release. That
means, the impact and risk are very minimal for most users who are still
using Hadoop 2.x profile.

The code changes in Spark thrift server are massive. It is risky and hard
to review. The original code of our Spark thrift server is from
Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
new version. In the future, we can completely get rid of the thrift server,
and build our own high-performant JDBC server.

Does this proposal sound good to you?

In the last two weeks, Yuming was trying this proposal. Now, he is on
vacation. In China, today is already the lunar New Year. I would not expect
he will reply this email in the next 7 days.

Cheers,

Xiao



Sean Owen  于2019年2月4日周一 上午7:56写道:

> I was unclear from this thread what the objection to these PRs is:
>
> https://github.com/apache/spark/pull/23552
> https://github.com/apache/spark/pull/23553
>
> Would we like to specifically discuss whether to merge these or not? I
> hear support for it, concerns about continuing to support Hive too,
> but I wasn't clear whether those concerns specifically argue against
> these PRs.
>
>
> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
> wrote:
> >
> > What’s the update and next step on this?
> >
> > We have real users getting blocked by this issue.
> >
> >
> > 
> > From: Xiao Li 
> > Sent: Wednesday, January 16, 2019 9:37 AM
> > To: Ryan Blue
> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
> dev
> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
> >
> > Thanks for your feedbacks!
> >
> > Working with Yuming to reduce the risk of stability and quality. Will
> keep you posted when the proposal is ready.
> >
> > Cheers,
> >
> > Xiao
> >
> > Ryan Blue  于2019年1月16日周三 上午9:27写道:
> >>
> >> +1 for what Marcelo and Hyukjin said.
> >>
> >> In particular, I agree that we can't expect Hive to release a version
> that is now more than 3 years old just to solve a problem for Spark. Maybe
> that would have been a reasonable ask instead of publishing a fork years
> ago, but I think this is now Spark's problem.
> >>
> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
> wrote:
> >>>
> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
> >>> problem that we created.
> >>>
> >>> The current PR is basically a Spark-side fix for that bug. It does
> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
> >>> it's really the right path to take here.
> >>>
> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
> wrote:
> >>> >
> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
> fixes of our Hive fork (correct me if I am mistaken).
> >>> >
> >>> > Just to be honest by myself and as a personal opinion, that
> basically says Hive to take care of Spark's dependency.
> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
> for instance,
> >>> >
> >>> > Frankly, my impression was that it's, honestly, our mistake to fix.
> Since Spark community is big enough, I was thinking we should try to fix it
> by ourselves first.
> >>> > I am not saying upgrading is the only way to get through this but I
> think we should at least try first, and see what's next.
> >>> >
> >>> > It does, yes, sound more risky to upgrade it in our side but I think
> it's worth to check and try it and see if it's possible.
> >>> > I think this is a standard approach to upgrade the dependency than
> using the fork or letting Hive side to release another 1.2.x.
> >>> >
> >>> > If we fail to upgrade it for critical or inevitable reasons somehow,
> yes, we could find an alternative but that basically means
> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
> until Spark 4.0.0?).
> >>> >
> >>> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >>> >
> >>>
> >>>
> >>> --
> >>> Marcelo
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
>


Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thanks Imran!

On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid 
wrote:

> The scheduler has been pretty error-prone and hard to work on, and I feel
> like there may be a dwindling core of active experts.  I'm sure its very
> discouraging to folks trying to make what seem like simple changes, and
> then find they are in a rats nest of complex issues they weren't
> expecting.  But for those who are still trying, THANK YOU!  more
> involvement and more folks becoming experts is definitely needed.
>
> I put together a doc going over the architecture of the scheduler, and
> things I've seen us get bitten by in the past.  Its sort of a brain dump,
> but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
> more experts will chime in -- there are places in the doc I know I've
> missed things, and called that out, but there are probably even more that
> should be discussed, & mistakes I've made.  All input welcome.
>
>
> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>


-- 
John Zhuge


Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
Andrew, can you give us more information about why partitioning the output
data doesn't work for your use case?

It sounds like all you need to do is to create a table partitioned by A and
B, then you would automatically get the divisions you want. If what you're
looking for is a way to scale the number of combinations then you can use
formats that support more partitions, or you could sort by the fields and
rely on Parquet row group pruning to filter out data you don't want.

rb

On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo  wrote:

> Hello
>
> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini  wrote:
> >
> > I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use filter method multiple times. I think it can be useful to
> have method to split dataset based on condition in one iteration, something
> like partition method of scala (of-course scala partition just split list
> into two list, but something more general can be more useful).
> > If you think it can be helpful, I can create Jira issue and work on it
> to send PR.
>
> This would be a really useful feature for our use case (processing
> collision data from the LHC). We typically want to take some sort of
> input and split into multiple disjoint outputs based on some
> conditions. E.g. if we have two conditions A and B, we'll end up with
> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
> combinatorics explode like n^2, when we could produce them all up
> front with this "multi filter" (or however it would be called).
>
> Cheers
> Andrew
>
> >
> > Best Regards
> > Moein
> >
> > --
> >
> > Moein Hosseini
> > Data Engineer
> > mobile: +98 912 468 1859
> > site: www.moein.xyz
> > email: moein...@gmail.com
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread Imran Rashid
The scheduler has been pretty error-prone and hard to work on, and I feel
like there may be a dwindling core of active experts.  I'm sure its very
discouraging to folks trying to make what seem like simple changes, and
then find they are in a rats nest of complex issues they weren't
expecting.  But for those who are still trying, THANK YOU!  more
involvement and more folks becoming experts is definitely needed.

I put together a doc going over the architecture of the scheduler, and
things I've seen us get bitten by in the past.  Its sort of a brain dump,
but I'm hopeful it'll help orient new folks to the scheduler.  I also hope
more experts will chime in -- there are places in the doc I know I've
missed things, and called that out, but there are probably even more that
should be discussed, & mistakes I've made.  All input welcome.

https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing


Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello

On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini  wrote:
>
> I've seen many application need to split dataset to multiple datasets based 
> on some conditions. As there is no method to do it in one place, developers 
> use filter method multiple times. I think it can be useful to have method to 
> split dataset based on condition in one iteration, something like partition 
> method of scala (of-course scala partition just split list into two list, but 
> something more general can be more useful).
> If you think it can be helpful, I can create Jira issue and work on it to 
> send PR.

This would be a really useful feature for our use case (processing
collision data from the LHC). We typically want to take some sort of
input and split into multiple disjoint outputs based on some
conditions. E.g. if we have two conditions A and B, we'll end up with
4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
combinatorics explode like n^2, when we could produce them all up
front with this "multi filter" (or however it would be called).

Cheers
Andrew

>
> Best Regards
> Moein
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859
> site: www.moein.xyz
> email: moein...@gmail.com
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Sean Owen
I was unclear from this thread what the objection to these PRs is:

https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553

Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I wasn't clear whether those concerns specifically argue against
these PRs.


On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung  wrote:
>
> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> 
> From: Xiao Li 
> Sent: Wednesday, January 16, 2019 9:37 AM
> To: Ryan Blue
> Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep 
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue  于2019年1月16日周三 上午9:27写道:
>>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version that 
>> is now more than 3 years old just to solve a problem for Spark. Maybe that 
>> would have been a reasonable ask instead of publishing a fork years ago, but 
>> I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin  wrote:
>>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon  wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes 
>>> > of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically 
>>> > says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer release 
>>> > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for 
>>> > instance,
>>> >
>>> > Frankly, my impression was that it's, honestly, our mistake to fix. Since 
>>> > Spark community is big enough, I was thinking we should try to fix it by 
>>> > ourselves first.
>>> > I am not saying upgrading is the only way to get through this but I think 
>>> > we should at least try first, and see what's next.
>>> >
>>> > It does, yes, sound more risky to upgrade it in our side but I think it's 
>>> > worth to check and try it and see if it's possible.
>>> > I think this is a standard approach to upgrade the dependency than using 
>>> > the fork or letting Hive side to release another 1.2.x.
>>> >
>>> > If we fail to upgrade it for critical or inevitable reasons somehow, yes, 
>>> > we could find an alternative but that basically means
>>> > we're going to stay in 1.2.x for, at least, a long time (say .. until 
>>> > Spark 4.0.0?).
>>> >
>>> > I know somehow it happened to be sensitive but to be just literally 
>>> > honest to myself, I think we should make a try.
>>> >
>>>
>>>
>>> --
>>> Marcelo
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org