All the *ByKey aggregations perform an efficient shuffle and preserve partitioning on the output. If all you need is to call reduceByKey, then don’t bother with groupBy. You should use groupBy if you really need all the datapoints from a key for a very custom operation.
From the docs: Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. What you should worry about in more complex pipelines is that you’re actually preserving the partitioner between stages. For example, if you use a custom partitioner between a partitionBy and an updateStateBy key. Or if you use .map or .flatMap instead of .mapValues and .flatMapValues. By the way, learn to use the Spark UI to understand the DAG / Execution plan and try to navigate the source code - I found the comments and the various preservePartitioner options very educational. -adrian On 9/23/15, 8:43 AM, "swetha" <swethakasire...@gmail.com> wrote: >Hi, > >How to make Group By more efficient? Is it recommended to use a custom >partitioner and then do a Group By? And can we use a custom partitioner and >then use a reduceByKey for optimization? > > >Thanks, >Swetha > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Group-By-reduceByKey-more-efficient-tp24780.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >