Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14137
  
    Oh good point, it does get materialized after `cache()` already because 
`numVertices` will call `count`. That does mean there's more than one call to 
evaluate the RDD, and that quite changes things. I see why `sccWorkGraph` works 
that way.
    
    I'm still a little hazy on `sccGraph` because the same isn't true of it in 
the current code. I agree that the dependency on `finalVertices` is relevant. 
As you say, the underlying `sccWorkGraph`s are cached. I think there does end 
up being a problem because it still means all of the lineage of cached 
`sccWorkGraph` evaluate at once and try to cache.
    
    I think your change is probably the right thing, then. What we should 
really do is add `unpersist` calls in the right places for `sccGraph` and 
`sccWorkGraph`, which is tricky. That would fully optimize this.
    
    (No my point was that the variable `sccGraphCountVertices` itself doesn't 
do anything. You don't need to store this value. `.count` does something of 
course.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to