I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily equal, though (but I can dump any on the ones that have identical key).
2014-09-13 12:50 GMT+02:00 Sean Owen <so...@cloudera.com>: > If you are just looking for distinct keys, .keys.distinct() should be > much better. > > On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com> > wrote: > > Hello, > > > > I am facing performance issues with reduceByKey. In know that this topic > has > > already been covered but I did not really find answers to my question. > > > > I am using reduceByKey to remove entries with identical keys, using, as > > reduce function, (a,b) => a. It seems to be a relatively straightforward > use > > of reduceByKey, but performances on moderately big RDDs (some tens of > > millions of line) are very low, far from what you can reach with > mono-server > > computing packages like R for example. > > > > I have read on other threads on the topic that reduceByKey always > entirely > > shuffle the whole data. Is that true ? So it means that a custom > > partitionning could not help, right? In my case, I could relatively > easily > > grant that two identical keys would always be on the same partition, > > therefore an option could by to use mapPartition and reeimplement reduce > > locally, but I would like to know if there are simpler / more elegant > > alternatives. > > > > Thanks for your help, >