Prabhu Joseph created SPARK-22199: ------------------------------------- Summary: Spark Job on YARN fails with executors "Slave registration failed" Key: SPARK-22199 URL: https://issues.apache.org/jira/browse/SPARK-22199 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.6.3 Reporter: Prabhu Joseph
Spark Job on YARN Failed with max executors Failed. ApplicationMaster logs: {code} 17/09/28 04:18:27 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Max number of executor failures (3) reached) {code} Checking the failed container logs shows "Slave registration failed: Duplicate executor ID" whereas the Driver logs shows it has removed those executors as they are idle for spark.dynamicAllocation.executorIdleTimeout Executor Logs: {code} 17/09/28 04:18:26 ERROR CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID: 122 {code} Driver logs: {code} 17/09/28 04:18:21 INFO ExecutorAllocationManager: Removing executor 122 because it has been idle for 60 seconds (new desired total will be 133) {code} There are two issues here: 1. Error Message in executor is misleading "Slave registration failed: Duplicate executor ID" as the actual error is it was idle 2. The job failed as there are executors idle for spark.dynamicAllocation.executorIdleTimeout -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org