[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996165#comment-15996165 ]
Apache Spark commented on SPARK-20564: -------------------------------------- User 'mariahualiu' has created a pull request for this issue: https://github.com/apache/spark/pull/17854 > a lot of executor failures when the executor number is more than 2000 > --------------------------------------------------------------------- > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy > Affects Versions: 1.6.2, 2.1.0 > Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org