ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) => a. It seems to be a relatively straightforwa

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
If you are just looking for distinct keys, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme wrote: > Hello, > > I am facing performance issues with reduceByKey. In know that this topic has > already been covered but I did not really find answers to my questio

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily equal, though (but I can dump any on the ones that have identical key). 2014-09-13 12:50 GMT+02:00 Sean Owen : > If you are just looking for distinct keys, .keys.distinct()

Re: ReduceByKey performance optimisation

2014-09-13 Thread Gary Malouf
You need something like: val x: RDD[MyAwesomeObject] x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme wrote: > I need to remove objects with duplicate key, but I need the whole object. > Object which ha

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
This is more concise: x.groupBy(obj.fieldtobekey).values.map(_.head) ... but I doubt it's faster. If all objects with the same fieldtobekey are within the same partition, then yes I imagine your biggest speedup comes from exploiting that. How about ... x.mapPartitions(_.map(obj => (obj.fieldtob

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
OK, mapPartition seems to be the way to go. Thanks for the help! Le 13 sept. 2014 16:41, "Sean Owen" a écrit : > This is more concise: > > x.groupBy(obj.fieldtobekey).values.map(_.head) > > ... but I doubt it's faster. > > If all objects with the same fieldtobekey are within the same > partition