Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a "combiner" in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions




On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> The point is that in many cases the operation passed to reduceByKey
> aggregates data into much smaller size, say + and * for integer. String
> concatenation doesn’t actually “shrink” data, thus in your case,
> rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance
> issue. In general, don’t do these unless you have to.
>
> And in Konstantin’s case, I guess he knows what he’s doing. At least we
> can’t know whether we can help to optimize without further information
> about the "business logic” is provided.
>
> On Aug 7, 2014, at 10:22 PM, chutium <teng....@gmail.com> wrote:
>
> > a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk
> about
> > performance
> > (
> http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/
> )
> >
> > that, reduceByKey will be more efficient than groupByKey... he mentioned
> > groupByKey "copies all data over network".
> >
> > is that still true? which one should we choice? because actually we can
> > replace all of groupByKey with reduceByKey
> >
> > for example, if we want to use groupByKey on a RDD[ String, String ], to
> get
> > a RDD[ String, Seq[String] ],
> >
> > we can also do it with reduceByKey:
> > at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
> > then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to