Hi, I have an RDD[(Long, MyData)] where I want to compute various functions on lists of MyData items with the same key (this will in general be a rather short lists, around 10 items per key).
Naturally I was thinking of groupByKey() but was a bit intimidated by the warning: "This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey will provide much better performance." Now I don't know (yet) if all of the functions I want to compute can be expressed in this way and I was wondering about *how much* more expensive we are talking about. Say I have something like rdd.zipWithIndex.map(kv => (kv._2/10, kv._1)).groupByKey(), i.e. items that will be grouped will 99% live in the same partition (do they?), does this change the performance? Also, if my operations depend on the order in the original RDD (say, string concatenation), how could I make sure the order of the original RDD is respected? Thanks Tobias