Re: *ByKey aggregations: performance + order

Tobias Pfeiffer Tue, 13 Jan 2015 20:57:00 -0800

Hi,

On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>
> Now I don't know (yet) if all of the functions I want to compute can be
> expressed in this way and I was wondering about *how much* more expensive
> we are talking about.
>


OK, it seems like even on a local machine (with no network overhead), the
groupByKey version is about 5 times slower than any of the other
(reduceByKey, combineByKey etc.) functions...

  val rdd = sc.parallelize(1 to 5000000)
  val withKeys = rdd.zipWithIndex.map(kv => (kv._2/10, kv._1))
  withKeys.cache()
  withKeys.count

  // around 850-1100 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.reduceByKey(_ + _).count()
    System.currentTimeMillis - start
  }

  // around 800-1100 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.combineByKey((x: Int) => x, (x: Int, y: Int) => x + y,
      (x: Int, y: Int) => x + y).count()
    System.currentTimeMillis - start
  }

  // around 1500-1900 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.foldByKey(0)(_ + _).count()
    System.currentTimeMillis - start
  }

  // around 1400-1800 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.aggregateByKey(0)(_ + _, _ + _).count()
    System.currentTimeMillis - start
  }

  // around 5500-6200 ms
  for (i <- 1 to 5) yield {
    val start = System.currentTimeMillis
    withKeys.groupByKey().mapValues(_.reduceLeft(_ + _)).count()
    System.currentTimeMillis - start
  }

Tobias

Re: *ByKey aggregations: performance + order

Reply via email to