Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 That's how i see it: First you have workGraph defined and marked to cache: `var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache()` then it is materialized `var numVertices = sccWorkGraph.numVertices` It is also materialized every iteration of both loops: `while (sccWorkGraph.numVertices > 0 && iter < numIter) {` and `while (sccWorkGraph.numVertices < numVertices)` Now looking inside the loop: sccWorkGraph is defined upon it's previous "version" `sccWorkGraph = sccWorkGraph.outerJoinVertices(sccWorkGraph.outDegrees) { (vid, data, degreeOpt) => if (degreeOpt.isDefined) data else (vid, true) }.outerJoinVertices(sccWorkGraph.inDegrees) { (vid, data, degreeOpt) => if (degreeOpt.isDefined) data else (vid, true) }.cache()` Let's assume we don't have cache(). We end up here `while (sccWorkGraph.numVertices < numVertices)` numVertices is `graph.vertices.count()` With first iteration sccWorkGraph is just computed and we go on. With second iteration we have to compute sccWorkGraph that was created in first iteration and then copmute second that is based on the first one and so on. It looks to me like that (in brackets there is iteration from which RDD is): 0: sccWorkGraph[0].vertices.count() 1: sccWorkGraph[0]->sccWorkgraph[1].vertices.count() 2: sccWorkGraph[0]->sccWorkGraph[1]->sccWorkGraph[2].vertices.count() 3:... If I understand it correctly with each iteration and no cache, count forces to compute whole lineage whereas if we have previous sccWorkgraph cached, only last few stages have to be computed. With caches it looks like this 0: sccWorkGraph[0].cache().vertices.count() 1: sccWorkGraph[0]->sccWorkgraph[1].cache().vertices.count() 2: sccWorkGraph[0]->sccWorkGraph[1]->sccWorkGraph[2].cache().vertices.count() The same story goes for sccGraph that depends on previous' iteration sccGraph that depends on finalVertices that depends on sccWorkGraph. If I had infinite memory I would have all intermediary sccWorkGraph RDDs cached and materializing sccGraph would be quite quick, but since I don't it is extremely slow. The problem is not seen inside run() method directly because sccGraph is not materialized inside. I hope that clarifies something. Without some kind of fix current version of strongly connected components is sufficient only for small graphs relative to available memory.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org