Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen 
    The `bcNewCenters` in `KMeans` has some problem.
    Check the code logic in detail, we can find that in each loop, it should 
destroy the broadcast var `bcNewCenters` generated in the previous loop, not 
the one generated in current loop. Like what
    is done to the `costs: RDD`, which use a `preCosts` var to save that 
generated in previous loop.
    I update the code. 
    
    The second problem, what's the meaning of `broadcast.unpersist`, eh, I 
think, there is another senario, suppose there is a RDD lineage, when executing 
in normal case, it executed successfully, and in code we can unpersist useless 
broadcast var in time, but, if some exception happened, the spark can recovery  
from it, it need to recovery the broken RDD from the RDD lineage and in such 
case may re-use the broadcast var we had unpersisted. If we simply destroy it, 
the broadcast var cannot be recover
    so that the recovery will fail.
    
    So that I think the safe place to use `broadcast.destroy` is the place 
where some action to RDD has successfully executed, and the whole RDD lineage 
is no longer needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to