Hao Zhu created YARN-3571: ----------------------------- Summary: AM does not re-blacklist NMs after ignoring-blacklist event happens? Key: YARN-3571 URL: https://issues.apache.org/jira/browse/YARN-3571 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.5.1 Reporter: Hao Zhu
Detailed analysis are in item "3 Will AM re-blacklist NMs after ignoring-blacklist event happens?" of below link: http://www.openkb.info/2015/05/when-will-application-master-blacklist.html The current behavior is : if that Node Manager has ever been blacklisted before, then it will not be blacklisted again after ignore-blacklist happens; Else, it will be blacklisted. The code logic is in function containerFailedOnHost(String hostName) of RMContainerRequestor.java: {code} protected void containerFailedOnHost(String hostName) { if (!nodeBlacklistingEnabled) { return; } if (blacklistedNodes.contains(hostName)) { if (LOG.isDebugEnabled()) { LOG.debug("Host " + hostName + " is already blacklisted."); } return; //already blacklisted {code} The reason of above behavior is in above item 2: when ignoring-blacklist happens, it only ask RM to clear "blacklistAdditions", however it dose not clear the "blacklistedNodes" variable. This behavior may cause the whole job/application to fail if the previous blacklisted NM was released after ignoring-blacklist event happens. Imagine a serial murder is released from prison just because the prison is 33% full, and horribly he/she will never be put in prison again. Only new murder will be put in prison. Example to prove: Test 1: One node(h4) has issue, other 3 nodes are healthy. The job failed with below AM logs: {code} [root@h1 container_1430425729977_0006_01_000001]# egrep -i 'failures on node|blacklist|FATAL' syslog 2015-05-02 18:38:41,246 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true 2015-05-02 18:38:41,246 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 1 2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_0 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:07,297 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node h4.poc.com 2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000008_0 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:07,954 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on node h4.poc.com 2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000007_0 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:08,152 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on node h4.poc.com 2015-05-02 18:39:08,152 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host h4.poc.com 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the blacklist for application_1430425729977_0006: blacklistAdditions=1 blacklistRemovals=0 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore blacklisting set to true. Known: 4, Blacklisted: 1, 25% 2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the blacklist for application_1430425729977_0006: blacklistAdditions=0 blacklistRemovals=1 2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_1 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_0 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:35,133 FATAL [IPC Server handler 5 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000008_1 - exited : java.io.IOException: Spill failed 2015-05-02 18:39:57,308 FATAL [IPC Server handler 17 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_2 - exited : java.io.IOException: Spill failed 2015-05-02 18:40:00,174 FATAL [IPC Server handler 10 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_1 - exited : java.io.IOException: Spill failed 2015-05-02 18:40:00,227 FATAL [IPC Server handler 12 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000007_1 - exited : java.io.IOException: Spill failed 2015-05-02 18:40:22,905 FATAL [IPC Server handler 3 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000018_0 - exited : java.io.IOException: Spill failed 2015-05-02 18:40:24,413 FATAL [IPC Server handler 19 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000009_2 - exited : java.io.IOException: Spill failed 2015-05-02 18:40:26,086 FATAL [IPC Server handler 16 on 41696] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1430425729977_0006_m_000002_3 - exited : java.io.IOException: Spill failed {code} >From above logs, we can see the node h4 got blacklisted after 3 task failures. Immediately after that, the igoring-blacklist event happened. Then node h4 will never be blacklisted again. When task 1430425729977_0006_m_000002 failed for 4 times, the whole job failed -- This message was sent by Atlassian JIRA (v6.3.4#6332)