GitHub user tsliwowicz opened a pull request: https://github.com/apache/spark/pull/2854
Block Manager - Double Register Crash In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. https://issues.apache.org/jira/browse/SPARK-4006 You can merge this pull request into a Git repository by running: $ git pull https://github.com/taboola/spark branch-0.9.2-block-mgr-removal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2854.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2854 ---- commit efd93f2026ddc427e84fa03e8a595ded2b1a81ce Author: Tal Sliwowicz <ta...@taboola.com> Date: 2014-10-12T08:35:20Z In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. commit 81d69f088e421b19e47495d06e8b187a0ec29075 Author: Tal Sliwowicz <ta...@taboola.com> Date: 2014-10-12T08:41:53Z fixed comment ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org