[ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20564:
------------------------------------

    Assignee:     (was: Apache Spark)

> a lot of executor failures when the executor number is more than 2000
> ---------------------------------------------------------------------
>
>                 Key: SPARK-20564
>                 URL: https://issues.apache.org/jira/browse/SPARK-20564
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 1.6.2, 2.1.0
>            Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to