Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    The question to me comes down to how many and how often do you expect 
temporary resource issues. At some point if its just from that much skew you 
should probably fix your configs and it would be by luck if you work or don't 
(based on other tasks finishing in executor)  before you hit the max task 
failures.  If its transient temporary network issues, retry could work, if they 
are long lived network issue I expect a lot of things to fail. 
    
    Unfortunately I don't have enough data on how often this happens or what 
exactly happens on Spark to know what would definitely help.  I know on MR that 
allowing multiple attempts on the same node sometimes works for at least some 
of these temporary conditions.  I would think the same would apply to Spark 
although as mentioned the difference might be in the re-launch container time 
vs just send task retry. Perhaps we should just change the defaults to allow 
more then one task attempt per executor and accordingly increase 
maxTaskAttemptsPerNode.  then lets run with it and see what happens so we can 
get some more data and enhance from there. If we want to be paranoid, leave the 
existing blacklisting functionality in as a fallback in case new doesn't work.
    
    Also if we are worried about resource limits, ie we blacklist to many 
executors/nodes for to long then perhaps we should add in a fail safe that is 
like a max % blacklisted. Once you hit that percent you don't blacklist anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to