Github user squito commented on the issue: https://github.com/apache/spark/pull/15249 @mridulm on yarn's bad disk detection -- yes, you are right, it is very rudimentary check for bad disks. It really can't catch everything (and we've seen that in practice). I was just pointing out at least one case where you know some executors will be good and some won't. You certainly still need node-level blacklisting. On the bigger topic of the what to do about the timeouts -- I'm now thinking that we should really think of the legacy "spark.scheduler.executorTaskBlacklistTime" as orthogonal to the new "spark.blacklist.*". The new feature is about dealing w/ resources that are bad for a long period of time (eg., hardware failure). The old feature was about trying to cope w/ resource contention. I may have been using (abusing) it to deal w/ bad hardware, but that is only b/c it is the only thing there was. Rather than try to shoe-horn the resoure contention back in this at the 11th hour might be a mistake. Perhaps it makes the most sense to just leave the old feature in, beside this one. It'll still be undocumented (and I'll remove the logic that ties the configs together), so it can still wait for a cleaner fix, but existing use cases aren't broken. Maybe that is short timeouts for taskset-level blacklisting, or maybe its something else entirely. When I put the old feature back, I can update names & add comments to make this distinction clear.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org