Hey all,

I’m currently trying to run connected components using GraphX on a large graph 
(~1.8b vertices and ~3b edges, most of them are self edges where the only edge 
that exists for vertex v is v->v) on emr using 50 m3.xlarge nodes. As the 
program runs I’m seeing each iteration take longer and longer to complete, this 
seems counter intuitive to me, especially since I am seeing the shuffle 
read/write amounts decrease with each iteration. I would think that as more and 
more vertices converged the iterations should take a shorter amount of time. I 
can run on up to 150 of the 500 part files (stored on s3) and it finishes in 
about 12 minutes, but with all the data I’ve let it run up to 4 hours and it 
still doesn’t complete. Does anyone have ideas for approaches to trouble 
shooting this, spark parameters that might need to be tuned, etc?

Best Regards,

Jeffrey Picard

Reply via email to