I was attempting to use the graphx triangle count method on a 2B edge graph 
(Friendster dataset on SNAP)  . I have access to a 60 node cluster with 90GB 
memory and 30v cores per node .
I am running into memory issues


 I am using 1000 partitions using the RandomVertexCut. Here’s my submit script :

spark-submit --executor-cores 5 --num-executors 100 --executor-memory 32g 
--driver-memory 6g --conf spark.yarn.executor.memoryOverhead=8000  --conf 
"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit”  
trianglecount_2.10-1.0.jar

There was one particular stage where it shuffled 3.7 TB

Active Stages (1)

Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Output  Shuffle Read    Shuffle Write
11      (kill 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage/kill?id=11&terminate=true>)mapPartitions
 at VertexRDDImpl.scala:218 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage?id=11&attempt=0>+details
 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/storage/rdd?id=38>
 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/storage/rdd?id=24>
  2015/11/12 01:33:06     7.3 min 
316/344
22.6 GB         57.0 GB 3.7 TB
In this subsequent stage it fails reading the Shuffle around the half terabyte 
mark with a java.lang.OutOfMemoryError: Java heap space


Active Stages (1)

Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Output  Shuffle Read    Shuffle Write
12      (kill 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage/kill?id=12&terminate=true>)mapPartitions
 at GraphImpl.scala:235 
<http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage?id=12&attempt=0>+details
2015/11/12 01:41:25     5.2 min 
0/1000
26.3 GB         533.8 GB        




Compared to the benchmarking (http://arxiv.org/pdf/1402.2394v1.pdf 
<http://arxiv.org/pdf/1402.2394v1.pdf>) cluster used on the twitter dataset 
(2.5B edges) the resources i am providing for the job seem to be reasonable. 
Can anyone point out any optimization or other tweaks i need to perform to get 
this to work ?

Thanks!
Vinod

Reply via email to