Hi, On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > > Now I don't know (yet) if all of the functions I want to compute can be > expressed in this way and I was wondering about *how much* more expensive > we are talking about. >
OK, it seems like even on a local machine (with no network overhead), the groupByKey version is about 5 times slower than any of the other (reduceByKey, combineByKey etc.) functions... val rdd = sc.parallelize(1 to 5000000) val withKeys = rdd.zipWithIndex.map(kv => (kv._2/10, kv._1)) withKeys.cache() withKeys.count // around 850-1100 ms for (i <- 1 to 5) yield { val start = System.currentTimeMillis withKeys.reduceByKey(_ + _).count() System.currentTimeMillis - start } // around 800-1100 ms for (i <- 1 to 5) yield { val start = System.currentTimeMillis withKeys.combineByKey((x: Int) => x, (x: Int, y: Int) => x + y, (x: Int, y: Int) => x + y).count() System.currentTimeMillis - start } // around 1500-1900 ms for (i <- 1 to 5) yield { val start = System.currentTimeMillis withKeys.foldByKey(0)(_ + _).count() System.currentTimeMillis - start } // around 1400-1800 ms for (i <- 1 to 5) yield { val start = System.currentTimeMillis withKeys.aggregateByKey(0)(_ + _, _ + _).count() System.currentTimeMillis - start } // around 5500-6200 ms for (i <- 1 to 5) yield { val start = System.currentTimeMillis withKeys.groupByKey().mapValues(_.reduceLeft(_ + _)).count() System.currentTimeMillis - start } Tobias