Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question.
I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) => a. It seems to be a relatively straightforward use of reduceByKey, but performances on moderately big RDDs (some tens of millions of line) are very low, far from what you can reach with mono-server computing packages like R for example. I have read on other threads on the topic that reduceByKey always entirely shuffle the whole data. Is that true ? So it means that a custom partitionning could not help, right? In my case, I could relatively easily grant that two identical keys would always be on the same partition, therefore an option could by to use mapPartition and reeimplement reduce locally, but I would like to know if there are simpler / more elegant alternatives. Thanks for your help,