When you calls the groupByKey() try providing the number of partitions like
groupByKey(100) depending on your data/cluster size.

Thanks
Best Regards

On Wed, Nov 12, 2014 at 6:45 AM, ankits <ankitso...@gmail.com> wrote:

> Im running a job that uses groupByKey(), so it generates a lot of shuffle
> data. Then it processes this and writes files to HDFS in a forEachPartition
> block. Looking at the forEachPartition stage details in the web console,
> all
> but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a
> huge shuffle read and takes a long time to finish.
>
> Can someone explain why the read is all on one node and how to parallelize
> this better?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Imbalanced-shuffle-read-tp18648.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to