Thanks Liquan, that was really helpful. On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei <liquan...@gmail.com> wrote:
> Hi Dave, > > You can replace groupByKey with reduceByKey to improve performance in some > cases. reduceByKey performs map side combine which can reduce Network IO > and shuffle size where as groupByKey will not perform map side combine. > > combineByKey is more general then aggregateByKey. Actually, the > implementation of aggregateByKey, reduceByKey and groupByKey is achieved by > combineByKey. aggregateByKey is similar to reduceByKey but you can provide > initial values when performing aggregation. > > As the name suggests, aggregateByKey is suitable for compute aggregations > for keys, example aggregations such as sum, avg, etc. The rule here is that > the extra computation spent for map side combine can reduce the size sent > out to other nodes and driver. If your func has satisfies this rule, you > probably should use aggregateByKey. > > combineByKey is more general and you have the flexibility to specify > whether you'd like to perform map side combine. However, it is more complex > to use. At minimum, you need to implement three functions: createCombiner, > mergeValue, mergeCombiners. > > Hope this helps! > Liquan > > On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <davidr...@gmail.com> wrote: > >> Hi All, >> >> After some hair pulling, I've reached the realisation that an operation I >> am currently doing via: >> >> myRDD.groupByKey.mapValues(func) >> >> should be done more efficiently using aggregateByKey or combineByKey. >> Both of these methods would do, and they seem very similar to me in terms >> of their function. >> >> My question is, what are the differences between these two methods (other >> than the slight differences in their type signatures)? Under what >> circumstances should I use one or the other? >> >> Thanks >> >> Dave >> >> >> > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst >