Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    @tgravescs no decision here yet.
    
    @mridulm the main question for (2), though, is are the consequences a 
deal-breaker?  It doesn't seem disastrous if a task needs to run on a non-local 
machine instead of getting re-tried on a machine where it already failed but 
might succeed later on.  Also, it seems likely that the task has a higher 
probability of completing sooner if it runs on another machine compared to 
re-running (after a delay) on a machine where it already failed.  What are the 
situations you're most concerned about with the new approach?
    
    If we leave the existing mechanism in, one concern (besides the additional 
complexity) is the interaction between the new host-level blacklisting and the 
old executor-level blacklisting.  There could be a scenario where the 
executor-level timeout keeps tasks from getting re-tried on the same executor 
for some period of time, so they run on other executors on the same host, which 
causes the host to be permanently blacklisted, so the fact that the executor 
blacklist would eventually re-allow the task is irrelevant.  I think we'd need 
to change the old executor blacklist timeout to be a host blacklist timeout for 
this to work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to