Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    @squito Thanks for clarifying, that makes some of the choices clearer !
    
    A few points to better understand :
    
    a) timeout does not seem to be enforced in this pr.
    Which means it is not compatible with the earlier blacklisting we supported 
(which was primarily to handle executor shutdown, transient resource issues, 
etc).
    
    
    b) So if I am not wrong, is the motivation to also blacklist at an executor 
level (instead of only at node level) to handle cases where we have executor 
using a resource (disk/gpu/etc) which is 'broken'/full causing tasks to fail 
only on that executor - while other executors on the node are fine ?
    
    
    c) If the node or executor is not seeing a transient resource issue, but is 
in a more permanent failure state, should we think about blacklist'ing it at a 
driver level ?
    
    
    c.1) Follow up on that would be to push this info to yarn/mesos/standlone - 
to blacklist future acquisition of executors on that node - until it is 
'resolved' (via timeout ? some other notifications ?)).
    
    I might have missed some of the context behind the change (particularly 
given it was spun off from an earlier pr). Thanks !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to