[jira] [Updated] (YARN-3571) AM does not re-blacklist NMs after ignoring-blacklist event happens?

Hao Zhu (JIRA) Sat, 02 May 2015 13:17:29 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hao Zhu updated YARN-3571:
--------------------------
    Description: 
Detailed analysis are in item "3 Will AM re-blacklist NMs after 
ignoring-blacklist event happens?" of below link:
http://www.openkb.info/2015/05/when-will-application-master-blacklist.html

The current behavior is : if that Node Manager has ever been blacklisted 
before, then it will not be blacklisted again after ignore-blacklist happens; 
Else, it will be blacklisted.
However I think the right behavior should be : AM can re-blacklist NMs even 
after ignoring-blacklist happens once.

 The code logic is in function containerFailedOnHost(String hostName) of 
RMContainerRequestor.java:
{code}
  protected void containerFailedOnHost(String hostName) {
    if (!nodeBlacklistingEnabled) {
      return;
    }
    if (blacklistedNodes.contains(hostName)) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Host " + hostName + " is already blacklisted.");
      }
      return; //already blacklisted
{code}

The reason of above behavior is in above item 2: when ignoring-blacklist 
happens, it only ask RM to clear "blacklistAdditions", however it dose not 
clear the "blacklistedNodes" variable.

This behavior may cause the whole job/application to fail if the previous 
blacklisted NM was released after ignoring-blacklist event happens.
Imagine a serial murder is released from prison just because the prison is 33% 
full, and horribly he/she will never be put in prison again. Only new murder 
will be put in prison.


Example to prove:
Test 1:
One node(h4) has issue, other 3 nodes are healthy.
The job failed with below AM logs:
{code}
[root@h1 container_1430425729977_0006_01_000001]# egrep -i 'failures on 
node|blacklist|FATAL' syslog
2015-05-02 18:38:41,246 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
nodeBlacklistingEnabled:true
2015-05-02 18:38:41,246 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
blacklistDisablePercent is 1
2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:07,297 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node 
h4.poc.com
2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000008_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:07,954 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on node 
h4.poc.com
2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000007_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:08,152 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on node 
h4.poc.com
2015-05-02 18:39:08,152 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host 
h4.poc.com
2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
blacklist for application_1430425729977_0006: blacklistAdditions=1 
blacklistRemovals=0
2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore blacklisting 
set to true. Known: 4, Blacklisted: 1, 25%
2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
blacklist for application_1430425729977_0006: blacklistAdditions=0 
blacklistRemovals=1
2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:35,133 FATAL [IPC Server handler 5 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000008_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:57,308 FATAL [IPC Server handler 17 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_2 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:00,174 FATAL [IPC Server handler 10 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:00,227 FATAL [IPC Server handler 12 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000007_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:22,905 FATAL [IPC Server handler 3 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000018_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:24,413 FATAL [IPC Server handler 19 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_2 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:26,086 FATAL [IPC Server handler 16 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_3 - exited : java.io.IOException: Spill 
failed
{code}

>From above logs, we can see the node h4 got blacklisted after 3 task failures.
Immediately after that, the igoring-blacklist event happened.
Then node h4 will never be blacklisted again.
When task 1430425729977_0006_m_000002 failed for 4 times, the whole job failed


  was:
Detailed analysis are in item "3 Will AM re-blacklist NMs after 
ignoring-blacklist event happens?" of below link:
http://www.openkb.info/2015/05/when-will-application-master-blacklist.html

The current behavior is : if that Node Manager has ever been blacklisted 
before, then it will not be blacklisted again after ignore-blacklist happens; 
Else, it will be blacklisted.

 The code logic is in function containerFailedOnHost(String hostName) of 
RMContainerRequestor.java:
{code}
  protected void containerFailedOnHost(String hostName) {
    if (!nodeBlacklistingEnabled) {
      return;
    }
    if (blacklistedNodes.contains(hostName)) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Host " + hostName + " is already blacklisted.");
      }
      return; //already blacklisted
{code}

The reason of above behavior is in above item 2: when ignoring-blacklist 
happens, it only ask RM to clear "blacklistAdditions", however it dose not 
clear the "blacklistedNodes" variable.

This behavior may cause the whole job/application to fail if the previous 
blacklisted NM was released after ignoring-blacklist event happens.
Imagine a serial murder is released from prison just because the prison is 33% 
full, and horribly he/she will never be put in prison again. Only new murder 
will be put in prison.


Example to prove:
Test 1:
One node(h4) has issue, other 3 nodes are healthy.
The job failed with below AM logs:
{code}
[root@h1 container_1430425729977_0006_01_000001]# egrep -i 'failures on 
node|blacklist|FATAL' syslog
2015-05-02 18:38:41,246 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
nodeBlacklistingEnabled:true
2015-05-02 18:38:41,246 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
blacklistDisablePercent is 1
2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:07,297 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node 
h4.poc.com
2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000008_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:07,954 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on node 
h4.poc.com
2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000007_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:08,152 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on node 
h4.poc.com
2015-05-02 18:39:08,152 INFO [Thread-49] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host 
h4.poc.com
2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
blacklist for application_1430425729977_0006: blacklistAdditions=1 
blacklistRemovals=0
2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore blacklisting 
set to true. Known: 4, Blacklisted: 1, 25%
2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
blacklist for application_1430425729977_0006: blacklistAdditions=0 
blacklistRemovals=1
2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:35,133 FATAL [IPC Server handler 5 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000008_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:39:57,308 FATAL [IPC Server handler 17 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_2 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:00,174 FATAL [IPC Server handler 10 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:00,227 FATAL [IPC Server handler 12 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000007_1 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:22,905 FATAL [IPC Server handler 3 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000018_0 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:24,413 FATAL [IPC Server handler 19 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000009_2 - exited : java.io.IOException: Spill 
failed
2015-05-02 18:40:26,086 FATAL [IPC Server handler 16 on 41696] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1430425729977_0006_m_000002_3 - exited : java.io.IOException: Spill 
failed
{code}

>From above logs, we can see the node h4 got blacklisted after 3 task failures.
Immediately after that, the igoring-blacklist event happened.
Then node h4 will never be blacklisted again.
When task 1430425729977_0006_m_000002 failed for 4 times, the whole job failed



> AM does not re-blacklist NMs after ignoring-blacklist event happens?
> --------------------------------------------------------------------
>
>                 Key: YARN-3571
>                 URL: https://issues.apache.org/jira/browse/YARN-3571
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, resourcemanager
>    Affects Versions: 2.5.1
>            Reporter: Hao Zhu
>
> Detailed analysis are in item "3 Will AM re-blacklist NMs after 
> ignoring-blacklist event happens?" of below link:
> http://www.openkb.info/2015/05/when-will-application-master-blacklist.html
> The current behavior is : if that Node Manager has ever been blacklisted 
> before, then it will not be blacklisted again after ignore-blacklist happens; 
> Else, it will be blacklisted.
> However I think the right behavior should be : AM can re-blacklist NMs even 
> after ignoring-blacklist happens once.
>  The code logic is in function containerFailedOnHost(String hostName) of 
> RMContainerRequestor.java:
> {code}
>   protected void containerFailedOnHost(String hostName) {
>     if (!nodeBlacklistingEnabled) {
>       return;
>     }
>     if (blacklistedNodes.contains(hostName)) {
>       if (LOG.isDebugEnabled()) {
>         LOG.debug("Host " + hostName + " is already blacklisted.");
>       }
>       return; //already blacklisted
> {code}
> The reason of above behavior is in above item 2: when ignoring-blacklist 
> happens, it only ask RM to clear "blacklistAdditions", however it dose not 
> clear the "blacklistedNodes" variable.
> This behavior may cause the whole job/application to fail if the previous 
> blacklisted NM was released after ignoring-blacklist event happens.
> Imagine a serial murder is released from prison just because the prison is 
> 33% full, and horribly he/she will never be put in prison again. Only new 
> murder will be put in prison.
> Example to prove:
> Test 1:
> One node(h4) has issue, other 3 nodes are healthy.
> The job failed with below AM logs:
> {code}
> [root@h1 container_1430425729977_0006_01_000001]# egrep -i 'failures on 
> node|blacklist|FATAL' syslog
> 2015-05-02 18:38:41,246 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
> nodeBlacklistingEnabled:true
> 2015-05-02 18:38:41,246 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 
> blacklistDisablePercent is 1
> 2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000002_0 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:07,297 INFO [Thread-49] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on 
> node h4.poc.com
> 2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000008_0 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:07,954 INFO [Thread-49] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on 
> node h4.poc.com
> 2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000007_0 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:08,152 INFO [Thread-49] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on 
> node h4.poc.com
> 2015-05-02 18:39:08,152 INFO [Thread-49] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host 
> h4.poc.com
> 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
> blacklist for application_1430425729977_0006: blacklistAdditions=1 
> blacklistRemovals=0
> 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore 
> blacklisting set to true. Known: 4, Blacklisted: 1, 25%
> 2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the 
> blacklist for application_1430425729977_0006: blacklistAdditions=0 
> blacklistRemovals=1
> 2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000002_1 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000009_0 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:35,133 FATAL [IPC Server handler 5 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000008_1 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:39:57,308 FATAL [IPC Server handler 17 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000002_2 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:40:00,174 FATAL [IPC Server handler 10 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000009_1 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:40:00,227 FATAL [IPC Server handler 12 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000007_1 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:40:22,905 FATAL [IPC Server handler 3 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000018_0 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:40:24,413 FATAL [IPC Server handler 19 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000009_2 - exited : java.io.IOException: Spill 
> failed
> 2015-05-02 18:40:26,086 FATAL [IPC Server handler 16 on 41696] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1430425729977_0006_m_000002_3 - exited : java.io.IOException: Spill 
> failed
> {code}
> From above logs, we can see the node h4 got blacklisted after 3 task failures.
> Immediately after that, the igoring-blacklist event happened.
> Then node h4 will never be blacklisted again.
> When task 1430425729977_0006_m_000002 failed for 4 times, the whole job failed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3571) AM does not re-blacklist NMs after ignoring-blacklist event happens?

Reply via email to