Uneven distribution of tasks among workers in Spark/GraphX 1.5.0

2015-09-22 Thread dmytro
I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple
but
shuffle-heavy operation:

val g = Graph.fromEdges(edges, ...).partitionBy(...)
val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors()
subs.saveAsObjectFile("hdfs://...")

The job gets divided into 9 stages. My cluster has 3 workers in the same
local network.
Even though Spark 1.5.0 works much faster and first several stages run on
the full load,
starting from one of the stages, a single machine suddenly grabs takes 99%
of the tasks
while others take as many tasks as they have cores and wait until the one
machine
finishes everything. Interestingly, on Spark 1.3.1, all stages get their
tasks distributed
evenly among the cluster machines. I'm suspecting that this could be a bug
in 1.5.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Uneven-distribution-of-tasks-among-workers-in-Spark-GraphX-1-5-0-tp24763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column
types?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org