Github user wesolowskim commented on the issue:

    https://github.com/apache/spark/pull/14137
  
    That's how i see it:
    First you have workGraph defined and marked to cache:
    
    `var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()`
    
    then it is materialized 
    
    `var numVertices = sccWorkGraph.numVertices`
    
    It is also materialized every iteration of both loops:
    
    `while (sccWorkGraph.numVertices > 0 && iter < numIter) {`
    
    and
    
    `while (sccWorkGraph.numVertices < numVertices)`
    
    Now looking inside the loop:
    sccWorkGraph is defined upon it's previous "version"
    
    `sccWorkGraph = sccWorkGraph.outerJoinVertices(sccWorkGraph.outDegrees) {
              (vid, data, degreeOpt) => if (degreeOpt.isDefined) data else 
(vid, true)
            }.outerJoinVertices(sccWorkGraph.inDegrees) {
              (vid, data, degreeOpt) => if (degreeOpt.isDefined) data else 
(vid, true)
            }.cache()`
    
    Let's assume we don't have cache(). We end up here 
    `while (sccWorkGraph.numVertices < numVertices)`
    
    numVertices is `graph.vertices.count()`
    
    With first iteration sccWorkGraph is just computed and we go on. 
    With second iteration we have to compute sccWorkGraph that was created in 
first iteration and then copmute second that is based on the first one and so 
on. 
    
    It looks to me like that (in brackets there is iteration from which RDD is):
    0: sccWorkGraph[0].vertices.count() 
    1: sccWorkGraph[0]->sccWorkgraph[1].vertices.count()
    2: sccWorkGraph[0]->sccWorkGraph[1]->sccWorkGraph[2].vertices.count()
    3:...
    
    If I understand it correctly with each iteration and no cache, count forces 
to compute whole lineage whereas if we have previous sccWorkgraph cached, only 
last few stages have to be computed. 
    
    With caches it looks like this
    0: sccWorkGraph[0].cache().vertices.count() 
    1: sccWorkGraph[0]->sccWorkgraph[1].cache().vertices.count()
    2: 
sccWorkGraph[0]->sccWorkGraph[1]->sccWorkGraph[2].cache().vertices.count()
    
    The same story goes for sccGraph that depends on previous' iteration 
sccGraph that depends on finalVertices that depends on sccWorkGraph. If I had 
infinite memory I would have all intermediary sccWorkGraph RDDs cached and 
materializing sccGraph would be quite quick, but since I don't it is extremely 
slow. 
    
    The problem is not seen inside run() method directly because sccGraph is 
not materialized inside. 
    
    I hope that clarifies something. Without some kind of fix current version 
of strongly connected components is sufficient only for small graphs relative 
to available memory. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to