Hi Dave,

You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where as groupByKey will not perform map side combine.

combineByKey is more general then aggregateByKey. Actually, the
implementation of aggregateByKey, reduceByKey and groupByKey is achieved by
combineByKey. aggregateByKey is similar to reduceByKey but you can provide
initial values when performing aggregation.

As the name suggests, aggregateByKey is suitable for compute aggregations
for keys, example aggregations such as sum, avg, etc. The rule here is that
the extra computation spent for map side combine can reduce the size sent
out to other nodes and driver. If your func has satisfies this rule, you
probably should use aggregateByKey.

combineByKey is more general and you have the flexibility to specify
whether you'd like to perform map side combine. However, it is more complex
to use. At minimum, you need to implement three functions: createCombiner,
mergeValue, mergeCombiners.

Hope this helps!
Liquan

On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <davidr...@gmail.com> wrote:

> Hi All,
>
> After some hair pulling, I've reached the realisation that an operation I
> am currently doing via:
>
> myRDD.groupByKey.mapValues(func)
>
> should be done more efficiently using aggregateByKey or combineByKey. Both
> of these methods would do, and they seem very similar to me in terms of
> their function.
>
> My question is, what are the differences between these two methods (other
> than the slight differences in their type signatures)? Under what
> circumstances should I use one or the other?
>
> Thanks
>
> Dave
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Reply via email to