Hey all, I’m currently trying to run connected components using GraphX on a large graph (~1.8b vertices and ~3b edges, most of them are self edges where the only edge that exists for vertex v is v->v) on emr using 50 m3.xlarge nodes. As the program runs I’m seeing each iteration take longer and longer to complete, this seems counter intuitive to me, especially since I am seeing the shuffle read/write amounts decrease with each iteration. I would think that as more and more vertices converged the iterations should take a shorter amount of time. I can run on up to 150 of the 500 part files (stored on s3) and it finishes in about 12 minutes, but with all the data I’ve let it run up to 4 hours and it still doesn’t complete. Does anyone have ideas for approaches to trouble shooting this, spark parameters that might need to be tuned, etc?
Best Regards, Jeffrey Picard