Github user squito commented on the pull request: https://github.com/apache/spark/pull/8760#issuecomment-161100737 @kayousterhout this is pretty important for users running clusters with a larger clusters, eg. a few hundred nodes. We've seen cases where there is some weird misconfiguration on a node, which leads to all tasks failing on that node. The job will eventually succeed if you use the existing blacklist behavior, but with even one bad node you get so many task failure messages its very hard to make any sense of what is going on -- apps with thousands of tasks per stage and tens of stages generate tons of task failures. This leads to users complaining that things are broken, when its actually things are "ok" in the sense that they run successfully after completing, and also users being unable to find the "real" failures hidden in all the bogus msgs. Cluster managers don't help here -- the problem is when there is a gap between the manager's notion of a "healthy" node and the particular spark application (eg., one particular library fails to load, or bad memory or something). There is also an efficiency advantage to blacklisting the entire executor, rather than just letting tasks fail on your bad executors. In general the tasks will fail quickly and just get rescheduled (I've only actually seen one case where the tasks did not fail almost immediately), so maybe this is more minor, but seems like a nice thing to do right. But beyond just the initial task scheduling, this PR adds the ability for the cluster manager to avoid giving more executors on the bad node. Plus this opens the ability for spark to actively request a new executor -- eg., if you request 5 executors and one is bad, rather than running with 80% of your requested resources, you should just throw away that executor and get another one (if the cluster has capacity). though that is not directly included in this patch -- in earlier discussion @mwws thought it best to push off to follow up work. I also like the idea of minimizing user configuration, but also recognize that we are working from a position of some ignorance on the failure modes people encounter and how they want fixes. Eg., I imagine that everyone would want to use the `ExecutorAndNode` strategy and not change any other configuration, but perhaps for some users they want the old behavior and more knobs. That's also why I'm reluctant to change the default behavior (but not drawing a hard line either). After providing some options we could try to collect some more feedback on what works. You've got a good point about only blacklisting executors across jobs if the tasks succeed elsewhere. I was initially only thinking about a linear spark application -- in that case, if some tasks in a job fail everywhere, then the job will fail and the spark app is done in any case. But of course apps may try to recover from failed jobs, or even have completely independent jobs, in a "job server" style deployment. Another point you made on another pr is that we have to ensure that we don't prevent any progress from being made if all executors get blacklisted and there is nowhere for any tasks to run. We should make sure we have some sensible behavior there, eg., fail the job.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org