Hua Liu created SPARK-20564:
-------------------------------

             Summary: a lot of executor failures when the executor number is 
more than 2000
                 Key: SPARK-20564
                 URL: https://issues.apache.org/jira/browse/SPARK-20564
             Project: Spark
          Issue Type: Improvement
          Components: Deploy
    Affects Versions: 2.1.0, 1.6.2
            Reporter: Hua Liu


When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they are 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to