Hi Dave, You can replace groupByKey with reduceByKey to improve performance in some cases. reduceByKey performs map side combine which can reduce Network IO and shuffle size where as groupByKey will not perform map side combine.
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation. As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func has satisfies this rule, you probably should use aggregateByKey. combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners. Hope this helps! Liquan On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <davidr...@gmail.com> wrote: > Hi All, > > After some hair pulling, I've reached the realisation that an operation I > am currently doing via: > > myRDD.groupByKey.mapValues(func) > > should be done more efficiently using aggregateByKey or combineByKey. Both > of these methods would do, and they seem very similar to me in terms of > their function. > > My question is, what are the differences between these two methods (other > than the slight differences in their type signatures)? Under what > circumstances should I use one or the other? > > Thanks > > Dave > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst