Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14137
  
    You are right, the code already had a `.cache()` call without 
`.unpersist()`. It eventually gets cleaned up, but not good form and probably 
didn't actually do much either. Except perhaps break the lineage.
    
    The problem with cache and unpersist is that if nothing materializes the 
RDD between the two calls, it does nothing. Yes, I know that's why you forced 
materialization with an extra call to count.
    
    The thing is, caching by itself can't make the underlying computation 
faster and none of the RDDs that are now cached are used a second time 
(right?). I think something else is at work here.
    
    One guess is that the current calls to cache(), which don't seem quite 
right, are the problem. It computes the whole lineage at once and attempts to 
persist all RDDs at once, competing with others and evicting them and generally 
wasting time because the RDDs aren't reused at all (I think). If that's true 
then actually removing all the caching should also help a lot. If not, then, 
that's not it.
    
    (PS `sccGraphCountVertices` here is superfluous, you can omit it)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to