When you calls the groupByKey() try providing the number of partitions like groupByKey(100) depending on your data/cluster size.
Thanks Best Regards On Wed, Nov 12, 2014 at 6:45 AM, ankits <ankitso...@gmail.com> wrote: > Im running a job that uses groupByKey(), so it generates a lot of shuffle > data. Then it processes this and writes files to HDFS in a forEachPartition > block. Looking at the forEachPartition stage details in the web console, > all > but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a > huge shuffle read and takes a long time to finish. > > Can someone explain why the read is all on one node and how to parallelize > this better? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Imbalanced-shuffle-read-tp18648.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >