Hi Daniel, Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true).
It will attempt to keep colocated partitions together (as you describe) on the same executor. What may happen is you lose data locality if you reduce the partitions to fewer than the number of executors. You obviously also reduce parallelism so you need to be aware of that as you decide when to call coalesce. Thanks, Silvio From: Daniel Haviv Date: Monday, July 20, 2015 at 4:59 PM To: Doug Balog Cc: user Subject: Re: Local Repartition Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting is a feature but it definitely should be. Daniel On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <d...@balog.net<mailto:d...@balog.net>> wrote: Hi Daniel, Take a look at .coalesce() I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the optimal number of partitions per executor. To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1) Cheers, Doug > On Jul 20, 2015, at 5:04 AM, Daniel Haviv > <daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>> > wrote: > > Hi, > My data is constructed from a lot of small files which results in a lot of > partitions per RDD. > Is there some way to locally repartition the RDD without shuffling so that > all of the partitions that reside on a specific node will become X partitions > on the same node ? > > Thank you. > Daniel