Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The `bcNewCenters` in `KMeans` has some problem. Check the code logic in detail, we can find that in each loop, it should destroy the broadcast var `bcNewCenters` generated in the previous loop, not the one generated in current loop. Like what is done to the `costs: RDD`, which use a `preCosts` var to save that generated in previous loop. I update the code. The second problem, what's the meaning of `broadcast.unpersist`, eh, I think, there is another senario, suppose there is a RDD lineage, when executing in normal case, it executed successfully, and in code we can unpersist useless broadcast var in time, but, if some exception happened, the spark can recovery from it, it need to recovery the broken RDD from the RDD lineage and in such case may re-use the broadcast var we had unpersisted. If we simply destroy it, the broadcast var cannot be recover so that the recovery will fail. So that I think the safe place to use `broadcast.destroy` is the place where some action to RDD has successfully executed, and the whole RDD lineage is no longer needed.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org