Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15249 @squito Thanks for clarifying, that makes some of the choices clearer ! A few points to better understand : a) timeout does not seem to be enforced in this pr. Which means it is not compatible with the earlier blacklisting we supported (which was primarily to handle executor shutdown, transient resource issues, etc). b) So if I am not wrong, is the motivation to also blacklist at an executor level (instead of only at node level) to handle cases where we have executor using a resource (disk/gpu/etc) which is 'broken'/full causing tasks to fail only on that executor - while other executors on the node are fine ? c) If the node or executor is not seeing a transient resource issue, but is in a more permanent failure state, should we think about blacklist'ing it at a driver level ? c.1) Follow up on that would be to push this info to yarn/mesos/standlone - to blacklist future acquisition of executors on that node - until it is 'resolved' (via timeout ? some other notifications ?)). I might have missed some of the context behind the change (particularly given it was spun off from an earlier pr). Thanks !
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org