Hi,
My data is constructed from a lot of small files which results in a lot of
partitions per RDD.
Is there some way to locally repartition the RDD without shuffling so that
all of the partitions that reside on a specific node will become X
partitions on the same node ?
Thank you.
Daniel
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num executors * 10, but I’m still
trying to figure out the
optimal number of partitions per executor.
To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)
Cheers,
Doug
On Jul 20, 2015,
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.
Daniel
On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog d...@balog.net wrote:
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num
to be aware of that as you decide when to call
coalesce.
Thanks,
Silvio
From: Daniel Haviv
Date: Monday, July 20, 2015 at 4:59 PM
To: Doug Balog
Cc: user
Subject: Re: Local Repartition
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting
To: Doug Balog
Cc: user
Subject: Re: Local Repartition
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.
Daniel
On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog
d...@balog.netmailto:d...@balog.net wrote:
Hi Daniel,
Take a look