[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-05 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Priority: Minor  (was: Major)

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>Priority: Minor
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal number 
of containers that driver can ask for in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver. However, driver handles executor 
registration, stop, removal and spark props retrieval in one thread, and it can 
not handle such a large number of RPCs within a short period of time. As a 
result, some executors cannot retrieve spark props and/or register. These 
failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, cause 
> executor removal and aggravate the overloading of driver, which causes more 
> executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver. However, driver handles executor 
registration, stop, removal and spark props retrieval in one thread, and it can 
not handle such a large number of RPCs within a short period of time. As a 
result, some executors cannot retrieve spark props and/or register. These 
failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver. However, driver handles 
> executor registration, stop, removal and spark props retrieval in one thread, 
> and it can not handle such a large number of RPCs within a short period of 
> time. As a result, some executors cannot retrieve spark props and/or 
> register. These failed executors are then marked as failed, cause executor 
> removal and aggravate the overloading of driver, which causes more executor 
> failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them. These thousands of executors try to retrieve spark props 
> and register with driver. However, driver handles executor registration, 
> stop, removal and spark props retrieval in one thread, and it can not handle 
> such a large number of RPCs within a short period of time. As a result, some 
> executors cannot retrieve spark props and/or register. These failed executors 
> are then marked as failed, cause executor removal and aggravate the 
> overloading of driver, which causes more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they are 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get 2000 containers in one request, and then 
> launch them. These thousands of executors try to retrieve spark props and 
> register with driver. However, driver handles executor registration, stop, 
> removal and spark props retrieval in one thread, and it can not handle such a 
> large number of RPCs within a short period of time. As a result, some 
> executors cannot retrieve spark props and/or register. These failed executors 
> are then marked as failed, cause executor removal and aggravate the 
> overloading of driver, which causes more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by 

[jira] [Created] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-02 Thread Hua Liu (JIRA)
Hua Liu created SPARK-20564:
---

 Summary: a lot of executor failures when the executor number is 
more than 2000
 Key: SPARK-20564
 URL: https://issues.apache.org/jira/browse/SPARK-20564
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.1.0, 1.6.2
Reporter: Hua Liu


When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they are 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org