Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Oh good point, it does get materialized after `cache()` already because `numVertices` will call `count`. That does mean there's more than one call to evaluate the RDD, and that quite changes things. I see why `sccWorkGraph` works that way. I'm still a little hazy on `sccGraph` because the same isn't true of it in the current code. I agree that the dependency on `finalVertices` is relevant. As you say, the underlying `sccWorkGraph`s are cached. I think there does end up being a problem because it still means all of the lineage of cached `sccWorkGraph` evaluate at once and try to cache. I think your change is probably the right thing, then. What we should really do is add `unpersist` calls in the right places for `sccGraph` and `sccWorkGraph`, which is tricky. That would fully optimize this. (No my point was that the variable `sccGraphCountVertices` itself doesn't do anything. You don't need to store this value. `.count` does something of course.)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org