GitHub user markhamstra opened a pull request: https://github.com/apache/spark/pull/1360
SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors. Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors. This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markhamstra/spark SPARK-2425 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1360.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1360 ---- commit 5b85534d376d682b7e1f97f98acd532a305349f8 Author: Mark Hamstra <markhams...@gmail.com> Date: 2014-07-09T23:02:43Z SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---