Hua Liu created SPARK-20564: ------------------------------- Summary: a lot of executor failures when the executor number is more than 2000 Key: SPARK-20564 URL: https://issues.apache.org/jira/browse/SPARK-20564 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.1.0, 1.6.2 Reporter: Hua Liu
When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they are marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org