Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a "combiner" in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > The point is that in many cases the operation passed to reduceByKey > aggregates data into much smaller size, say + and * for integer. String > concatenation doesn’t actually “shrink” data, thus in your case, > rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance > issue. In general, don’t do these unless you have to. > > And in Konstantin’s case, I guess he knows what he’s doing. At least we > can’t know whether we can help to optimize without further information > about the "business logic” is provided. > > On Aug 7, 2014, at 10:22 PM, chutium <teng....@gmail.com> wrote: > > > a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk > about > > performance > > ( > http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/ > ) > > > > that, reduceByKey will be more efficient than groupByKey... he mentioned > > groupByKey "copies all data over network". > > > > is that still true? which one should we choice? because actually we can > > replace all of groupByKey with reduceByKey > > > > for example, if we want to use groupByKey on a RDD[ String, String ], to > get > > a RDD[ String, Seq[String] ], > > > > we can also do it with reduceByKey: > > at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] > > then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >