Hello all, I upgraded from spark 1.3.1 to 1.4.0, but I'm experiencing a massive drop in performance for the application I'm running. I've (somewhat) reproduced this behaviour in the attached file.
My current spark setup may not be optimal exactly for this reproduction, but I see that Spark 1.4.0 takes 12 minute to complete, while 1.3.1 finishes in 8 minutes in this test. I've found that when you play about with subtraction and sampling of JavaRDDs (see attached reproduction test), tasks do not seem to be properly distributed among the workers when you're doing additional operations on the data. I derive this from the admin view, where I clearly see that in 1.4.0, tasks are distributed differently, and specifically, one task consists of almost all the data, while the other tasks are tiny. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n23858/1.jpg> Do any of you know of any changes to 1.4.0 that could explain this behaviour? When submitting the same application to Spark 1.3.1, the tasks are distributed uniformly, and the application is therefore much quicker. Thanks, Gisle ReproduceHang.java <http://apache-spark-user-list.1001560.n3.nabble.com/file/n23858/ReproduceHang.java> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Tasks-unevenly-distributed-in-Spark-1-4-0-tp23858.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org