[ https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-42766: ------------------------------------ Assignee: Apache Spark > YarnAllocator should filter excluded nodes when launching allocated containers > ------------------------------------------------------------------------------ > > Key: SPARK-42766 > URL: https://issues.apache.org/jira/browse/SPARK-42766 > Project: Spark > Issue Type: Improvement > Components: YARN > Affects Versions: 3.3.2 > Reporter: wangshengjie > Assignee: Apache Spark > Priority: Major > > In production environment, we hit an issue like this: > If we request 10 containers form nodeA and nodeB, first response from Yarn > return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second > response from Yarn maybe return some containers from nodeA and launching > containers, but when containers(Executor) setup and send register request to > Driver, it will be rejected and this failure will be counted to > {code:java} > spark.yarn.max.executor.failures {code} > , and will casue app failed. > {code:java} > Max number of executor failures ($maxNumExecutorFailures) reached{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org