All the *ByKey aggregations perform an efficient shuffle and preserve 
partitioning on the output. If all you need is to call reduceByKey, then don’t 
bother with groupBy. You should use groupBy if you really need all the 
datapoints from a key for a very custom operation.


From the docs:

Note: If you are grouping in order to perform an aggregation (such as a sum or 
average) over each key, using reduceByKey or aggregateByKey will yield much 
better performance. 


What you should worry about in more complex pipelines is that you’re actually 
preserving the partitioner between stages. For example, if you use a custom 
partitioner between a partitionBy and an updateStateBy key. Or if you use .map 
or .flatMap instead of .mapValues and .flatMapValues.

By the way, learn to use the Spark UI to understand the DAG / Execution plan 
and try to navigate the source code - I found the comments and the various 
preservePartitioner options very educational.

-adrian





On 9/23/15, 8:43 AM, "swetha" <swethakasire...@gmail.com> wrote:

>Hi,
>
>How to make Group By more efficient? Is it recommended to use a custom
>partitioner and then do a Group By? And can we use a custom partitioner and
>then use a  reduceByKey for optimization?
>
>
>Thanks,
>Swetha
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Group-By-reduceByKey-more-efficient-tp24780.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to