wangshengjie created SPARK-42766:
------------------------------------

             Summary: YarnAllocator should filter excluded nodes when launching 
allocated containers
                 Key: SPARK-42766
                 URL: https://issues.apache.org/jira/browse/SPARK-42766
             Project: Spark
          Issue Type: Improvement
          Components: YARN
    Affects Versions: 3.3.2
            Reporter: wangshengjie


In production environment, we hit an issue like this:

If we request 10 containers form nodeA and nodeB, first response from Yarn 
return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second 
response from Yarn maybe return some containers from nodeA and launching 
containers, but when containers(Executor) setup and send register request to 
Driver, it will be rejected and this failure will be counted to 
{code:java}
spark.yarn.max.executor.failures {code}
, and will casue app failed.
{code:java}
Max number of executor failures ($maxNumExecutorFailures) reached{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to