[ 
https://issues.apache.org/jira/browse/SPARK-12516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-12516:
--------------------------------
    Description: 
Failure of NodeManager will make all the executors belong to that NM exit 
silently.

Currently in the implementation of YarnSchedulerBackend, driver will receive 
onDisconnect event when executor is lost, which will further ask AM to get the 
lost reason, AM will hold this query connection until RM report back the status 
of lost container, and reply back to driver. In the case of NM failure, RM 
cannot detect this failure immediately until timeout (10 mins by default), so 
the driver query of lost reason will be timed out (120 seconds), after timed 
out the executor states in the driver side will be cleaned out, but in the AM 
side, this states will still be maintained until NM heartbeat timeout. So this 
will potentially introduce some unexpected behaviors:

---

* In the dynamic allocation disabled situation, executor number in the driver 
side is less than the number in the AM side after timeout (from 120 seconds to 
10 minutes), and cannot be ramped up to the expected number until RM detect the 
failure of NM and make the related containers as complected.

{quote}
For example the target executor number is 10, with 5 NMs (each NM has 2 
executors). So when 1 NM is failed, 2 related executors are lost. After driver 
side query timeout, the executor number in driver side is 8, but in AM side it 
is still 10, so AM will not request additional containers until the number in 
AM reaches to 8 (after 10 minutes).
{quote}

---

* When dynamic allocation is enabled, the number of target executor is 
maintained both in the driver and AM side and synced between them. The target 
executor number will be correct after driver query timeout (120 seconds), but 
this number is incorrect in the AM side until NM failure is detected (10 
minutes). In such case the actual executor number is less than the calculated 
one.

{quote}
For example, current target executor number in driver is N, and in AM side is 
M, so M - N is the lost number.

When the executor number needs to ramp up to A, so the actual number will be A 
- (M - N).

When the executor number needs to bring down to B, so the actual number will be 
max(0, B - (M - N)). when the actual number of executors is 0, the whole system 
is hang, will only be recovered if driver request more resources, or after 10 
minutes timeout.

{quote}
---

Possbile solutions:

* Sync the actual executor number from the driver to AM after RPC timeout (120 
seconds), also clean the related states in the AM.
* Every time when querying the loss reason of executor, query the status from 
NM first, if NM connection is lost, directly mark the related containers as 
failed.


  was:
Failure of NodeManager will make all the executors belong to that NM exit 
silently.

Currently in the implementation of YarnSchedulerBackend, driver will receive 
onDisconnect event when executor is lost, which will further ask AM to get the 
lost reason, AM will hold this query connection until RM report back the status 
of lost container, and reply back to driver. In the case of NM failure, RM 
cannot detect this failure immediately until timeout (10 mins by default), so 
the driver query of lost reason will be timed out (120 seconds), after timed 
out the executor states in the driver side will be cleaned out, but in the AM 
side, this states will still be maintained until NM heartbeat timeout. So this 
will potentially introduce some unexpected behaviors:

---

* In the dynamic allocation disabled situation, executor number in the driver 
side is less than the number in the AM side after timeout (from 120 seconds to 
10 minutes), and cannot be ramped up to the expected number until RM detect the 
failure of NM and make the related containers as complected.

{quote}
For example the target executor number is 10, with 5 NMs (each NM has 2 
executors). So when 1 NM is failed, 2 related executors are lost. After driver 
side query timeout, the executor number in driver side is 8, but in AM side it 
is still 10, so AM will not request additional containers until the number in 
AM reaches to 8 (after 10 minutes).
{quote}

---

* When dynamic allocation is enabled, the number of target executor is 
maintained both in the driver and AM side and synced between them. The target 
executor number will be correct after driver query timeout (120 seconds), but 
this number is incorrect in the AM side until NM failure is detected (10 
minutes). In such case the actual executor number is less than the calculated 
one.

{quote}
For example, current target executor number in driver is N, and in AM side is 
M, so M - N is the lost number.

When the executor number needs to ramp up to A, so the actual number will be A 
- (M - N).

When the executor number needs to bring down to B, so the actual number will be 
max(0, B - (M - N)). when the actual number of executors is 0, the whole system 
is hang, will only be recovered if driver request more resources, or after 10 
minutes timeout.

{quote}
---

Solution:

* Sync the actual executor number from the driver to AM after RPC timeout (120 
seconds), also clean the related states in the AM.
* Every time when querying the loss reason of executor, query the status from 
NM first, if NM connection is lost, directly mark the related containers as 
failed.



> Properly handle NM failure situation for Spark on Yarn
> ------------------------------------------------------
>
>                 Key: SPARK-12516
>                 URL: https://issues.apache.org/jira/browse/SPARK-12516
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.0
>            Reporter: Saisai Shao
>
> Failure of NodeManager will make all the executors belong to that NM exit 
> silently.
> Currently in the implementation of YarnSchedulerBackend, driver will receive 
> onDisconnect event when executor is lost, which will further ask AM to get 
> the lost reason, AM will hold this query connection until RM report back the 
> status of lost container, and reply back to driver. In the case of NM 
> failure, RM cannot detect this failure immediately until timeout (10 mins by 
> default), so the driver query of lost reason will be timed out (120 seconds), 
> after timed out the executor states in the driver side will be cleaned out, 
> but in the AM side, this states will still be maintained until NM heartbeat 
> timeout. So this will potentially introduce some unexpected behaviors:
> ---
> * In the dynamic allocation disabled situation, executor number in the driver 
> side is less than the number in the AM side after timeout (from 120 seconds 
> to 10 minutes), and cannot be ramped up to the expected number until RM 
> detect the failure of NM and make the related containers as complected.
> {quote}
> For example the target executor number is 10, with 5 NMs (each NM has 2 
> executors). So when 1 NM is failed, 2 related executors are lost. After 
> driver side query timeout, the executor number in driver side is 8, but in AM 
> side it is still 10, so AM will not request additional containers until the 
> number in AM reaches to 8 (after 10 minutes).
> {quote}
> ---
> * When dynamic allocation is enabled, the number of target executor is 
> maintained both in the driver and AM side and synced between them. The target 
> executor number will be correct after driver query timeout (120 seconds), but 
> this number is incorrect in the AM side until NM failure is detected (10 
> minutes). In such case the actual executor number is less than the calculated 
> one.
> {quote}
> For example, current target executor number in driver is N, and in AM side is 
> M, so M - N is the lost number.
> When the executor number needs to ramp up to A, so the actual number will be 
> A - (M - N).
> When the executor number needs to bring down to B, so the actual number will 
> be max(0, B - (M - N)). when the actual number of executors is 0, the whole 
> system is hang, will only be recovered if driver request more resources, or 
> after 10 minutes timeout.
> {quote}
> ---
> Possbile solutions:
> * Sync the actual executor number from the driver to AM after RPC timeout 
> (120 seconds), also clean the related states in the AM.
> * Every time when querying the loss reason of executor, query the status from 
> NM first, if NM connection is lost, directly mark the related containers as 
> failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to