Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster.
The issues I am having, which are present for all three algorithms, is that (1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to be heavily unbalanced with some machines doing the majority of the computation. PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node clusters I get the following runtimes (wallclock): 192s, 154s, and 154s. This results is potentially understandable, though the times are significantly worse than the results in the paper https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where this algorithm ran in ~75s on a worse cluster. My main concern is that the computation seems to be heavily unbalanced. I have measured the CPU time of all the process associated with GraphX during its execution and for a 4-node cluster it yielded the following CPU times (for each machine): 724s, 697s, 2216s, 694s. Is this normal? Should I expect a more even distribution of work across machines? I am using the stock pagerank code found here: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala. I use the configurations "spark.executor.memory=40g" and "spark.cores.max=128" for the 4-node case. I also set the number of edge partitions to be 64. Could you please let me know if these results are reasonable, or if I am doing something wrong. I really appreciate the help. Thanks, Steve -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-unbalanced-computation-and-slow-runtime-on-livejournal-network-tp22565.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org