GraphX twitter

tom85 Tue, 18 Nov 2014 13:31:26 -0800

I'm having problems running the twitter graph on a cluster with 4 nodes, each
having over 100GB of RAM and 32 virtual cores per node.


I do have a pre-installed spark version (built against hadoop 2.3, because
it didn't compile on my system), but I'm loading my graph file from disk
without hdfs. The twitter graph is around 25GB big and I'm loading the graph
with GraphLoader.edgeListFile(..,.., minEdgePartitions = 128 ).
I assume that 128 partitions is optimal because that's the total number of
cores that I have. 

Now I've started one executor on each node, which has >100GB of RAM and
spark.executor.memory=32 to enjoy full parallelism. 4 workers, each having
one executor, each executor using 32 cores --> 128 cores, 128 partitions.

Is there any configuration that can be used for replicating results as given
in  this <http://arxiv.org/pdf/1402.2394v1.pdf>   paper?

The paper states a running time of under 500s. I can't get any results after
more than 1 hour. I'm running the built-in algorithm with 15 iterations.

Am I using too many partitions? Is there a bottleneck I can't see? 

Turned on GC logging, it looks like that:

3.785: [GC [PSYoungGen: 62914560K->7720K(73400320K)]
62914560K->7792K(241172480K), 0.0151130 secs] [Times: user=0.27 sys=0.02,
real=0.02 secs]
9.209: [GC [PSYoungGen: 62922280K->1943393K(73400320K)]
62922352K->1943473K(241172480K), 0.6108790 secs] [Times: user=5.95 sys=8.04,
real=0.62 secs]
13.316: [GC [PSYoungGen: 64857953K->4283906K(73400320K)]
64858033K->4283994K(241172480K), 1.1567380 secs] [Times: user=10.84
sys=15.73, real=1.16 secs]
17.931: [GC [PSYoungGen: 67198466K->6808418K(73400320K)]
67198554K->6808514K(241172480K), 1.9807690 secs] [Times: user=16.21
sys=29.29, real=1.99 secs]
26.112: [GC [PSYoungGen: 69722978K->7211955K(73400320K)]
69723074K->7212059K(241172480K), 2.1325980 secs] [Times: user=15.66
sys=33.33, real=2.14 secs]
64.833: [GC [PSYoungGen: 70126515K->5378991K(74105216K)]
70126619K->5379103K(241877376K), 0.3315500 secs] [Times: user=7.53 sys=0.00,
real=0.33 secs]


In the stderr log, I wonder about these logs:
INFO HadoopRDD: Input split: file:/.../twitter.edge:25065160704+33554432
INFO HadoopRDD: Input split: file:/.../twitter.edge:19696451584+33554432
....
Why does it even split the file as a HadoopRDD?

I also wonder about this error:

 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId($SPARK_MASTER,59331)
java.io.IOException: sendMessageReliably failed without being ACK'd


Any help would be highly appreciated.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-twitter-tp19222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

GraphX twitter

Reply via email to