I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple but shuffle-heavy operation:
val g = Graph.fromEdges(edges, ...).partitionBy(...) val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors() subs.saveAsObjectFile("hdfs://...") The job gets divided into 9 stages. My cluster has 3 workers in the same local network. Even though Spark 1.5.0 works much faster and first several stages run on the full load, starting from one of the stages, a single machine suddenly grabs takes 99% of the tasks while others take as many tasks as they have cores and wait until the one machine finishes everything. Interestingly, on Spark 1.3.1, all stages get their tasks distributed evenly among the cluster machines. I'm suspecting that this could be a bug in 1.5.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Uneven-distribution-of-tasks-among-workers-in-Spark-GraphX-1-5-0-tp24763.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org