Hi Daniel,

Coalesce, by default will not cause a shuffle. The second parameter when set to 
true will cause a full shuffle. This is actually what repartition does (calls 
coalesce with shuffle=true).

It will attempt to keep colocated partitions together (as you describe) on the 
same executor. What may happen is you lose data locality if you reduce the 
partitions to fewer than the number of executors. You obviously also reduce 
parallelism so you need to be aware of that as you decide when to call coalesce.

Thanks,
Silvio

From: Daniel Haviv
Date: Monday, July 20, 2015 at 4:59 PM
To: Doug Balog
Cc: user
Subject: Re: Local Repartition

Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.

Daniel

On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog 
<d...@balog.net<mailto:d...@balog.net>> wrote:
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num executors * 10, but I’m still 
trying to figure out the
optimal number of partitions per executor.
To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)


Cheers,

Doug

> On Jul 20, 2015, at 5:04 AM, Daniel Haviv 
> <daniel.ha...@veracity-group.com<mailto:daniel.ha...@veracity-group.com>> 
> wrote:
>
> Hi,
> My data is constructed from a lot of small files which results in a lot of 
> partitions per RDD.
> Is there some way to locally repartition the RDD without shuffling so that 
> all of the partitions that reside on a specific node will become X partitions 
> on the same node ?
>
> Thank you.
> Daniel


Reply via email to