All the *ByKey aggregations perform an efficient shuffle and preserve
partitioning on the output. If all you need is to call reduceByKey, then don’t
bother with groupBy. You should use groupBy if you really need all the
datapoints from a key for a very custom operation.
From the docs:
Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much
better performance.
What you should worry about in more complex pipelines is that you’re actually
preserving the partitioner between stages. For example, if you use a custom
partitioner between a partitionBy and an updateStateBy key. Or if you use .map
or .flatMap instead of .mapValues and .flatMapValues.
By the way, learn to use the Spark UI to understand the DAG / Execution plan
and try to navigate the source code - I found the comments and the various
preservePartitioner options very educational.
-adrian
On 9/23/15, 8:43 AM, "swetha" <swethakasire...@gmail.com> wrote:
>Hi,
>
>How to make Group By more efficient? Is it recommended to use a custom
>partitioner and then do a Group By? And can we use a custom partitioner and
>then use a reduceByKey for optimization?
>
>
>Thanks,
>Swetha
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Group-By-reduceByKey-more-efficient-tp24780.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>