Local Repartition

2015-07-20 Thread Daniel Haviv
Hi, My data is constructed from a lot of small files which results in a lot of partitions per RDD. Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ? Thank you. Daniel

Re: Local Repartition

2015-07-20 Thread Doug Balog
Hi Daniel, Take a look at .coalesce() I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the optimal number of partitions per executor. To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1) Cheers, Doug On Jul 20, 2015,

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting is a feature but it definitely should be. Daniel On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog d...@balog.net wrote: Hi Daniel, Take a look at .coalesce() I’ve seen good results by coalescing to num

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
to be aware of that as you decide when to call coalesce. Thanks, Silvio From: Daniel Haviv Date: Monday, July 20, 2015 at 4:59 PM To: Doug Balog Cc: user Subject: Re: Local Repartition Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting

Re: Local Repartition

2015-07-20 Thread Silvio Fiorito
To: Doug Balog Cc: user Subject: Re: Local Repartition Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting is a feature but it definitely should be. Daniel On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog d...@balog.netmailto:d...@balog.net wrote: Hi Daniel, Take a look