Hello,

I am facing performance issues with reduceByKey. In know that this topic
has already been covered but I did not really find answers to my question.

I am using reduceByKey to remove entries with identical keys, using, as
reduce function, (a,b) => a. It seems to be a relatively straightforward
use of reduceByKey, but performances on moderately big RDDs (some tens of
millions of line) are very low, far from what you can reach with
mono-server computing packages like R for example.

I have read on other threads on the topic that reduceByKey always entirely
shuffle the whole data. Is that true ? So it means that a custom
partitionning could not help, right? In my case, I could relatively easily
grant that two identical keys would always be on the same partition,
therefore an option could by to use mapPartition and reeimplement reduce
locally, but I would like to know if there are simpler / more elegant
alternatives.

Thanks for your help,

Reply via email to