Hi everyone,

I have a small Scala test project which uses GraphX and for some reason has
extreme scheduler delay when executed on the cluster. The problem is not
related to the cluster configuration, as other GraphX applications run
without any issue.
I have attached the source code ( MatrixTest.scala
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.scala>
 
), it creates a sort of a  GraphGenerators.gridGraph
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$>
  
(but with diagonal edges too) using data from a matrix inside the Map class.
There are in reality only 4 lines related to GraphX itself: creating a
VertexRDD, creating an EdgeRDD, creating a Graph and then calling
graph.edges.count. 
As you can see on the  Spark History Server
<http://cdhdns-mn0.westeurope.cloudapp.azure.com:18088/history/application_1480677653852_0050/jobs/>
 
, the task has very significant scheduler delay. There is also the following
warning in the logs (I have attached them too:  MatrixTest.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28162/MatrixTest.log>
 
) : "WARN scheduler.TaskSetManager: Stage 0 contains a task of very large
size (2905 KB). The maximum recommended task size is 100 KB."
This also happens with .aggregateMessages.collect and Pregel. I have tested
with Spark 1.6 and 2.0, different levels of parallelism, different number of
executors, etc but the scheduler delay is still there and grows more and
more extreme as the number of vertices and edges grows.

Does anyone have any idea as to what could be the source of the issue?
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Extreme-scheduler-delay-tp28162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to