I'm having problems running the twitter graph on a cluster with 4 nodes, each having over 100GB of RAM and 32 virtual cores per node.
I do have a pre-installed spark version (built against hadoop 2.3, because it didn't compile on my system), but I'm loading my graph file from disk without hdfs. The twitter graph is around 25GB big and I'm loading the graph with GraphLoader.edgeListFile(..,.., minEdgePartitions = 128 ). I assume that 128 partitions is optimal because that's the total number of cores that I have. Now I've started one executor on each node, which has >100GB of RAM and spark.executor.memory=32 to enjoy full parallelism. 4 workers, each having one executor, each executor using 32 cores --> 128 cores, 128 partitions. Is there any configuration that can be used for replicating results as given in this <http://arxiv.org/pdf/1402.2394v1.pdf> paper? The paper states a running time of under 500s. I can't get any results after more than 1 hour. I'm running the built-in algorithm with 15 iterations. Am I using too many partitions? Is there a bottleneck I can't see? Turned on GC logging, it looks like that: 3.785: [GC [PSYoungGen: 62914560K->7720K(73400320K)] 62914560K->7792K(241172480K), 0.0151130 secs] [Times: user=0.27 sys=0.02, real=0.02 secs] 9.209: [GC [PSYoungGen: 62922280K->1943393K(73400320K)] 62922352K->1943473K(241172480K), 0.6108790 secs] [Times: user=5.95 sys=8.04, real=0.62 secs] 13.316: [GC [PSYoungGen: 64857953K->4283906K(73400320K)] 64858033K->4283994K(241172480K), 1.1567380 secs] [Times: user=10.84 sys=15.73, real=1.16 secs] 17.931: [GC [PSYoungGen: 67198466K->6808418K(73400320K)] 67198554K->6808514K(241172480K), 1.9807690 secs] [Times: user=16.21 sys=29.29, real=1.99 secs] 26.112: [GC [PSYoungGen: 69722978K->7211955K(73400320K)] 69723074K->7212059K(241172480K), 2.1325980 secs] [Times: user=15.66 sys=33.33, real=2.14 secs] 64.833: [GC [PSYoungGen: 70126515K->5378991K(74105216K)] 70126619K->5379103K(241877376K), 0.3315500 secs] [Times: user=7.53 sys=0.00, real=0.33 secs] In the stderr log, I wonder about these logs: INFO HadoopRDD: Input split: file:/.../twitter.edge:25065160704+33554432 INFO HadoopRDD: Input split: file:/.../twitter.edge:19696451584+33554432 .... Why does it even split the file as a HadoopRDD? I also wonder about this error: ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId($SPARK_MASTER,59331) java.io.IOException: sendMessageReliably failed without being ACK'd Any help would be highly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-twitter-tp19222.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org