wangshengjie created SPARK-42766: ------------------------------------ Summary: YarnAllocator should filter excluded nodes when launching allocated containers Key: SPARK-42766 URL: https://issues.apache.org/jira/browse/SPARK-42766 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.3.2 Reporter: wangshengjie
In production environment, we hit an issue like this: If we request 10 containers form nodeA and nodeB, first response from Yarn return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second response from Yarn maybe return some containers from nodeA and launching containers, but when containers(Executor) setup and send register request to Driver, it will be rejected and this failure will be counted to {code:java} spark.yarn.max.executor.failures {code} , and will casue app failed. {code:java} Max number of executor failures ($maxNumExecutorFailures) reached{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org