Re: ReduceByKey performance optimisation

Julien Carme Sat, 13 Sep 2014 04:29:21 -0700

I need to remove objects with duplicate key, but I need the whole object.
Object which have the same key are not necessarily equal, though (but I can
dump any on the ones that have identical key).


2014-09-13 12:50 GMT+02:00 Sean Owen <so...@cloudera.com>:

> If you are just looking for distinct keys, .keys.distinct() should be
> much better.
>
> On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com>
> wrote:
> > Hello,
> >
> > I am facing performance issues with reduceByKey. In know that this topic
> has
> > already been covered but I did not really find answers to my question.
> >
> > I am using reduceByKey to remove entries with identical keys, using, as
> > reduce function, (a,b) => a. It seems to be a relatively straightforward
> use
> > of reduceByKey, but performances on moderately big RDDs (some tens of
> > millions of line) are very low, far from what you can reach with
> mono-server
> > computing packages like R for example.
> >
> > I have read on other threads on the topic that reduceByKey always
> entirely
> > shuffle the whole data. Is that true ? So it means that a custom
> > partitionning could not help, right? In my case, I could relatively
> easily
> > grant that two identical keys would always be on the same partition,
> > therefore an option could by to use mapPartition and reeimplement reduce
> > locally, but I would like to know if there are simpler / more elegant
> > alternatives.
> >
> > Thanks for your help,
>

Re: ReduceByKey performance optimisation

Reply via email to