In general you should first figure out how many task slots are in the
cluster and then repartition the RDD to maybe 2x that #. So if you have a
100 slots, then maybe RDDs with partition count of 100-300 would be normal.

But also size of each partition can matter. You want a task to operate on a
partition for at least 200ms, but no longer than around 20 seconds.

Even if you have 100 slots, it could be okay to have a RDD with 10,000
partitions if you've read in a large file.

So don't repartition your RDD to match the # of Worker JVMs, but rather
align it to the total # of task slots in the Executors.

If you're running on a single node, shuffle operations become almost free
(because there's no network movement), so don't read into any performance
metrics you've collected to extrapolate what may happen at scale.


On Monday, February 23, 2015, Deep Pradhan <pradhandeep1...@gmail.com>
wrote:

> Hi,
> If I repartition my data by a factor equal to the number of worker
> instances, will the performance be better or worse?
> As far as I understand, the performance should be better, but in my case
> it is becoming worse.
> I have a single node standalone cluster, is it because of this?
> Am I guaranteed to have a better performance if I do the same thing in a
> multi-node cluster?
>
> Thank You
>

Reply via email to