Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hi Ryan, On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue wrote: > > To partition by a condition, you would need to create a column with the > result of that condition. Then you would partition by that column. The sort > option would also work here. We actually do something similar to filter based

Re: Feature request: split dataset based on condition

2019-02-04 Thread Thakrar, Jayesh
Just wondering if this is what you are implying Ryan (example only): val data = (dataset to be partitionned) val splitCondition = s""" CASE WHEN …. THEN …. WHEN …. THEN ….. END partition_condition """ val partitionedData = data.withColumn("partitionColumn",

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Felix Cheung
Likely need a shim (which we should have anyway) because of namespace/import changes. I’m huge +1 on this. From: Hyukjin Kwon Sent: Monday, February 4, 2019 12:27 PM To: Xiao Li Cc: Sean Owen; Felix Cheung; Ryan Blue; Marcelo Vanzin; Yuming Wang; dev Subject:

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
I should check the details and feasiablity by myself but to me it sounds fine if it doesn't need extra big efforts. On Tue, 5 Feb 2019, 4:15 am Xiao Li Yes. When our support/integration with Hive 2.x becomes stable, we can do > it in Hadoop 2.x profile too, if needed. The whole proposal is to

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
Yes. When our support/integration with Hive 2.x becomes stable, we can do it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize the risk and ensure the release stability and quality. Hyukjin Kwon 于2019年2月4日周一 下午12:01写道: > Xiao, to check if I understood correctly, do you

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon
Xiao, to check if I understood correctly, do you mean the below? 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with Hadoop 3.x profile. 2. Make another newer version of thrift server by Hive 2.x(?) in Spark side. 3. Target the transition to Hive 2.x completely and slowly later

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
To partition by a condition, you would need to create a column with the result of that condition. Then you would partition by that column. The sort option would also work here. I don't think that there is much of a use case for this. You have a set of conditions on which to partition your data,

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thx Xiao! On Mon, Feb 4, 2019 at 9:04 AM Xiao Li wrote: > Thank you, Imran! > > Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark". > > Cheers, > > Xiao > > > > John Zhuge 于2019年2月4日周一 上午8:59写道: > >> Thanks Imran! >> >> On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid >> wrote:

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread sujith chacko
Thanks Li and Imran for providing us an overview about one of the complex module in spark  Excellent sharing. Regards Sujith. On Mon, 4 Feb 2019 at 10:54 PM, Xiao Li wrote: > Thank you, Imran! > > Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark". > > Cheers, > > Xiao > >

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread Parth Gandhi
Thank you Imran, this is quite helpful. Regards, Parth Kamlesh Gandhi On Mon, Feb 4, 2019 at 11:01 AM Rubén Berenguel wrote: > Thanks Imran, will definitely give it a look (even if just out of sheer > interest on how the sausage is done) > > R > > > On 4 February 2019 at 17:59:33, John Zhuge

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello Ryan, On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue wrote: > > Andrew, can you give us more information about why partitioning the output > data doesn't work for your use case? > > It sounds like all you need to do is to create a table partitioned by A and > B, then you would automatically

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Xiao Li
To reduce the impact and risk of upgrading Hive execution JARs, we can just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The support of Hadoop 3 will be still experimental in our next release. That means, the impact and risk are very minimal for most users who are still

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thanks Imran! On Mon, Feb 4, 2019 at 8:42 AM Imran Rashid wrote: > The scheduler has been pretty error-prone and hard to work on, and I feel > like there may be a dwindling core of active experts. I'm sure its very > discouraging to folks trying to make what seem like simple changes, and >

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
Andrew, can you give us more information about why partitioning the output data doesn't work for your use case? It sounds like all you need to do is to create a table partitioned by A and B, then you would automatically get the divisions you want. If what you're looking for is a way to scale the

scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread Imran Rashid
The scheduler has been pretty error-prone and hard to work on, and I feel like there may be a dwindling core of active experts. I'm sure its very discouraging to folks trying to make what seem like simple changes, and then find they are in a rats nest of complex issues they weren't expecting.

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: > > I've seen many application need to split dataset to multiple datasets based > on some conditions. As there is no method to do it in one place, developers > use filter method multiple times. I think it can be useful to have method

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Sean Owen
I was unclear from this thread what the objection to these PRs is: https://github.com/apache/spark/pull/23552 https://github.com/apache/spark/pull/23553 Would we like to specifically discuss whether to merge these or not? I hear support for it, concerns about continuing to support Hive too, but