ReduceByKey performance optimisation

Julien Carme Sat, 13 Sep 2014 02:47:49 -0700

Hello,

I am facing performance issues with reduceByKey. In know that this topic
has already been covered but I did not really find answers to my question.


I am using reduceByKey to remove entries with identical keys, using, as
reduce function, (a,b) => a. It seems to be a relatively straightforward
use of reduceByKey, but performances on moderately big RDDs (some tens of
millions of line) are very low, far from what you can reach with
mono-server computing packages like R for example.

I have read on other threads on the topic that reduceByKey always entirely
shuffle the whole data. Is that true ? So it means that a custom
partitionning could not help, right? In my case, I could relatively easily
grant that two identical keys would always be on the same partition,
therefore an option could by to use mapPartition and reeimplement reduce
locally, but I would like to know if there are simpler / more elegant
alternatives.

Thanks for your help,

ReduceByKey performance optimisation

Reply via email to