*ByKey aggregations: performance + order

Tobias Pfeiffer Tue, 13 Jan 2015 19:14:48 -0800

Hi,

I have an RDD[(Long, MyData)] where I want to compute various functions on
lists of MyData items with the same key (this will in general be a rather
short lists, around 10 items per key).


Naturally I was thinking of groupByKey() but was a bit intimidated by the
warning: "This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each key,
using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey will
provide much better performance."

Now I don't know (yet) if all of the functions I want to compute can be
expressed in this way and I was wondering about *how much* more expensive
we are talking about. Say I have something like

  rdd.zipWithIndex.map(kv => (kv._2/10, kv._1)).groupByKey(),

i.e. items that will be grouped will 99% live in the same partition (do
they?), does this change the performance?

Also, if my operations depend on the order in the original RDD (say, string
concatenation), how could I make sure the order of the original RDD is
respected?

Thanks
Tobias

*ByKey aggregations: performance + order

Reply via email to