[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209597#comment-17209597
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~adam.antal]!

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileD

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209595#comment-17209595
 ] 

Adam Antal commented on YARN-10393:
---

Committed to branch-2.10. Thanks [~Jim_Brennan].

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209517#comment-17209517
 ] 

Hadoop QA commented on YARN-10393:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 20m 
39s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 1 new or modified 
test files. {color} |
|| || || || {color:brown} branch-2.10 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 1s{color} |  | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
25s{color} |  | {color:green} branch-2.10 passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} |  | {color:green} branch-2.10 passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~16.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} |  | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} |  | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} |  | {color:green} branch-2.10 passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} |  | {color:green} branch-2.10 passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~16.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m  
6s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} |  | {color:green} branch-2.10 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
23s{color} |  | {color:green} the patch passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
23s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~16.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
7s{color} |  | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 25s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/217/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt]
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 3 new + 115 unchanged - 0 fixed = 118 total (was 115) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} |  | {color:green} the patch passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~16.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
55s{color} |  | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
39s{color} |  | {color:green} hadoop-yarn-server-nodemanager in the pat

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209483#comment-17209483
 ] 

Adam Antal commented on YARN-10393:
---

Reuploaded patch for branch-2.10, pending on jenkins.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
>  

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208867#comment-17208867
 ] 

Jim Brennan commented on YARN-10393:


[~adam.antal] I have uploaded a patch for branch-2.10.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, 
> YARN-10393.002.patch, YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImp

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208236#comment-17208236
 ] 

Jim Brennan commented on YARN-10393:


I will upload a patch for branch-2.10.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208104#comment-17208104
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~adam.antal] for committing, and thanks [~wzzdreamer] for finding this 
bug!

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.re

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207941#comment-17207941
 ] 

Adam Antal commented on YARN-10393:
---

Cherry-picked to branch-3.3, branch-3.2, branch-3.1, branch-3.0. I had 
conflicts with the branch-2.10 patch, if you'd like to commit to that branch, 
please upload a patch for it.

Thanks for the conversation and the solution [~Jim_Brennan] [~wzzdreamer] 
[~yuanbo].

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeSt

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-05 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207912#comment-17207912
 ] 

Adam Antal commented on YARN-10393:
---

Committed to trunk, I will cherry-pick this to other branches now.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatch

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-04 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207806#comment-17207806
 ] 

Yuanbo Liu commented on YARN-10393:
---

+1

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-04 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207797#comment-17207797
 ] 

zhenzhao wang commented on YARN-10393:
--

+1, LGTM, thanks.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205385#comment-17205385
 ] 

Adam Antal commented on YARN-10393:
---

Agreed, +1 to v2 patch.

Any comments [~yuanbo], [~wzzdreamer]?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-30 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204926#comment-17204926
 ] 

Hadoop QA commented on YARN-10393:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
10s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 1 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 38m 
 9s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
46s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
24m  7s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
19s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
15s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
53s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
53s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
44s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
44s{color} |  | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 35s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/205/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt]
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 1 new + 120 unchanged - 0 fixed = 121 total (was 120) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
54s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 54s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} |  | {color:green} the patch passed with JDK Private 
Build-

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-30 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204821#comment-17204821
 ] 

Jim Brennan commented on YARN-10393:


[~adam.antal] patch 002 adds the log line you suggested.

I did not try to fix this checkstyle issue:
{noformat}
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java:1300:
@Override:5: Method length is 161 lines (max allowed is 150). [MethodLength]
{noformat}
I don't think it makes sense to try to move part of this change out of that 
method, nor to reduce the multi-line comment I added.I also think it is 
beyond the scope of this change to attempt to refactor this method to reduce 
the length.


> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-30 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204739#comment-17204739
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~adam.antal]!  I will update the patch.  Sorry for missing the logging 
- I should have read back through the earlier comments.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.draft.2.patch, 
> YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> a

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-30 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204736#comment-17204736
 ] 

Adam Antal commented on YARN-10393:
---

Hi [~Jim_Brennan],

Thanks for the patch, overall looks good. As I wrote in my earlier review I'd 
suggest to some extra log when we skip a heartbeat to make it explicit, and 
there's one last checkstyle to be taken care of. Otherwise we're fine.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.draft.2.patch, 
> YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOEx

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204370#comment-17204370
 ] 

Hadoop QA commented on YARN-10393:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
22s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  8s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
30s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 29s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 1 new + 120 unchanged - 0 fixed = 121 total (was 120) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 52s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artif

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-29 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204272#comment-17204272
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~wzzdreamer]!  I will put up a full patch soon.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNat

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-26 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202738#comment-17202738
 ] 

zhenzhao wang commented on YARN-10393:
--

[~Jim_Brennan] And feel free to re-assign the ticket to you if you are 
interested. I guess you are contributing more to the discussion and solution 
recently.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-26 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202737#comment-17202737
 ] 

zhenzhao wang commented on YARN-10393:
--

[~Jim_Brennan] Sorry, I missed the msg. Thanks a lot for all the discussion and 
suggestions. Feel free to put up the patch. 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201040#comment-17201040
 ] 

Jim Brennan commented on YARN-10393:


[~wzzdreamer], if you would prefer, I can put up a patch for this.  Let me know 
if you want me to do that.  We wouldn't be discussing this at all without your 
detailed analysis of the problem.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatch

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197981#comment-17197981
 ] 

Jim Brennan commented on YARN-10393:


[~wzzdreamer], did you want to put up a new patch (or PR) based on draft.2 
patch?  It will need to address [~adam.antal]'s comment above, and fix existing 
tests and maybe add another test for this case.

One other change I would make that is not in draft.2 patch is to add this line 
to removeOrTrackCompletedContainersFromContext(), like you had in your original 
PR:
{noformat}
  pendingCompletedContainers.remove(containerId);
{noformat}
It may be mostly redundant to the conditional 
{{pendingCompletedContainers.clear()}} in the calling run() method, but I think 
it makes sense to do a remove here because these are containers for which we 
have affirmatively received an ack from the RM.  It also "fixes" at least one 
unit test.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197715#comment-17197715
 ] 

Adam Antal commented on YARN-10393:
---

Thanks for the valuable comments [~Jim_Brennan] [~yuanbo] [~wzzdreamer].

I got through the reasoning and I agree with the solution. As Jim explained 
{{lastHeartBeatID}} is indeed not the right approach, but the conditional clear 
based on the missedHeartbeat looks good to me. Being honest, I feel like I 
couldn't give a confident +1 to this patch, but since we're having a draft 
patch now, maybe it's time to involve some senior folks to give a +1 to this 
patch.

Anyways, the draft looks good - one comment is to add a DEBUG or an INFO level 
logging when the {{missedHearbeat}} is true - something like this:
{code:java}
if (!missedHearbeat) {
  pendingCompletedContainers.clear();
} else {
  LOG.info("Skip clearing pending completed containers due to missed 
heartbeat.");
}
missedHearbeat = false;
{code}
Of course, you can figure this out from the logs anyways, but let's keep it 
explicit - it may also help investigating such cases in the future.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195694#comment-17195694
 ] 

Jim Brennan commented on YARN-10393:


[~wzzdreamer], [~adam.antal], what do you think of the draft patches?  Would be 
good to come to a resolution on this.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(So

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194592#comment-17194592
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing.  
Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194402#comment-17194402
 ] 

Jim Brennan commented on YARN-10393:


[~yuanbo] thanks for uploading your patch and restarting the conversation.

There are two ways we can miss a heartbeat, RM never gets it, or NM doesn't get 
response.  Existing code already handles the first case, because 
pendingCompletedContainers doesn't get cleared due to the exception.   This 
Jira is dealing with the second case, where the heartbeat after a missed 
heartbeat is seen as a duplicate by the RM, so any newly completed containers 
never get processed by the RM (because NM doesn't know it was treated as a 
duplicate by RM, and we clear pendingCompletedContainers).

I believe your draft patch will work and it improves on the original PR by 
limiting the change to only completed container statuses.  My one issue with it 
is that it will not return the most up to date status for containers on the 
heartbeat after the missed one.   In the case described by this Jira, where the 
RM got the request but the NM failed to get the response, this has no impact 
because this one will be treated as a duplicate by the RM.  But if the RM never 
got the last request, then we are deferring the most recent container statuses 
until the next heartbeat.  We are also deferring adding newly completed 
containers to recentlyStoppedContainers.

I put up another patch, draft.2, to show what I was proposing.  The main 
downside to the draft.2 patch is if the one after the missed hearbeat is not 
seen as a duplicate by the RM,  it will resend completed containers from this 
heartbeat on the next one.

Another question is how would these two proposals behave if the NM misses 
multiple heartbeats in a row?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exc

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-10 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193975#comment-17193975
 ] 

Yuanbo Liu commented on YARN-10393:
---

Also, we should avoid adding new container to the same heartBeat id as 
[~wzzdreamer] has clarified in the descrption. Resending old containers is not 
avoidable and changing protocol is not a good idea, so we could use  
pendingCompletedContainers to fix it.
I've attached a draft patch for this issue so that we can speed up and conclude 
our ideas. [~wzzdreamer] feel free to attach a new pr if you have it.

[~wzzdreamer]  [~Jim_Brennan] [~adam.antal] 
Any comment will be welcome. 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodem

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-02 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189046#comment-17189046
 ] 

zhenzhao wang commented on YARN-10393:
--

And one more thing to clarify. The following code in the current PR could be 
avoided. This is because it only calls getNodeStatus() on heartbeatId change. 
The pendingCompletedCantainers won't be updated on retry as getNodeStatus() is 
not called twice. I added it because I want to add some safeguards. It will 
keep sending completed containers until it's confirmed in response. This could 
prevent potential errors from RM or RM-AM communication. But as [~Jim_Brennan] 
pointed out, it might cause duplicate reports for the same completed containers.

{quote}

pendingCompletedContainers.remove(containerId); 

{quote}

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInv

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189009#comment-17189009
 ] 

zhenzhao wang commented on YARN-10393:
--

Thanks all for the great discussion. 

As stated earlier, I guess we could think of the problem could be discussed in 
two aspects:

{{{quote}}}
 # RM and NM has a different understanding of heartbeat. RM uses the 
heartbeatId to distinguish the heartbeat. However, NM might generate different 
requests with the same heartbeat id on heartbeat failure.  
 # The cache for containers inside NM is not maintained correctly on heartbeat 
failure.

{{{quote}}}

The first problem will lead to mulitple missing report fields. The potential 
missing fields include completed containers(leads to live-lock in this case), 
increasedContainers(I didn't dig into the impact though), and etc. It also 
means that people had better be aware of this when they want to add new 
heartbeat fields in the future.  I hope we could fix it too. But I agree with 
the concern of changing protocol. So if we don't want to fix it in this jira. 
We should keep track of it. What do you think? [~Jim_Brennan]

As for the second problem, it's directly related to the missing completed 
container issue. [~Jim_Brennan] proposed a good approach. [~yuanbo] 
[~adam.antal]  also made good points. We couldn't clear the 
pendingCompletedContainers on the first successful response after a failure. 
The marker approach works and the heartbeatId comparasion approach wouldn't 
work.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188591#comment-17188591
 ] 

Jim Brennan commented on YARN-10393:


{quote}BTW, the normal calculation between requestID and responseID would be 
 responseID == requestID + 1
{noformat}
// code placeholder
if (responseId != lastHeartbeatID + 1) {
   pendingCompletedContainers.clear();
}
{noformat}
{quote}
Good point [~yuanbo]. But this is the reverse of what we want. We want to do 
the clear() when we get the expected responseID, and we want to skip the 
clear() when we don't get the expected responseId.

But your comment made me realize that my suggestion won't work as is. The NM's 
lastHeartBeatID is only updated when we get a response back from the RM. On the 
missed heartbeat, we don't update it. In the case of a duplicate, the RM 
re-sends the last response again, which will still be lastHeartBeatID + 1. I 
think we need to base this on a flag that we set when we get an exception from 
the request attempt. So something like:
{noformat}
logAggregationReportForAppsTempList.clear();
lastHeartbeatID = response.getResponseId();
// Don't clear if the previous heartbeat was missed,
if (!missedHearbeat) {
  pendingCompletedContainers.clear();
}
missedHearbeat = false;
...
} catch (Exception e) {

  // TODO Better error handling. Thread can die with the rest of the
  // NM still running.
  LOG.error("Caught exception in status-updater", e);
  missedHearbeat = true;
} finally {
{noformat}
On the missed hearbeat, we don't clear {{pendingCompletedContainers}} because 
of the exception, and by adding this change, we will prevent clearing it until 
the heartbeat after we get a response back from the RM. I think the downside is 
that if the RM never got the missed heartbeat (if it failed on the send vs the 
response), then the one after the missed on might not be a duplicate, and we 
will potentially re-send completed containers to the RM on the next hearbeat. 
But that should happen only once in this case. I don't think there is any way 
for the NM to tell if the RM's response is a duplicate or not.

What do you think?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled b

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188358#comment-17188358
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~adam.antal] Thanks for your comments. I misunderstood [~Jim_Brennan] 's 
solution.  
+1 for that. 

BTW, the normal calculation between requestID and responseID would be 

responseID == requestID + 1
{code:java}
// code placeholder
if (responseId != lastHeartbeatID + 1) {
   pendingCompletedContainers.clear();
}
{code}
correct me if I'm wrong.

 

[~wzzdreamer] What's your thoughts about the solution?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188306#comment-17188306
 ] 

Adam Antal commented on YARN-10393:
---

Thanks for the discussion above, I think we've seen several approaches there.

In my opinion our best solution so far is this one:
bq. Another option would be to remove pendingCompletedContainers.clear() from 
the end of removeOrTrackCompletedContainersFromContext() - it doesn't really 
belong there anyway, and do it directly in the run()) method. By doing that, 
you can make it conditional on the heartbeatId changing. Something like: [...]

Also that would generally resolve all heartbeat processing problems. The 
concrete bug report was about the completed containers, but this is a more 
general solution for this bug, and it's quite elegant.

Regarding the two ways you mentioned [~yuanbo], I'd prefer option 2:
bq. 2、Resending container id from recentlyStoppedContainers periodically (maybe 
1 mins?), once it get response from RM, it will be removed from 
recentlyStoppedContainers and never get retried again.

I believe option 1. does not guarantee to resolve anything in case of e.g. an 
intermittent network disruption, where the same thing happens: even if we retry 
3 times, we may end up in the same situation. If there's a solution without the 
assumption that the connection issue resolve sooner then the retries succeed, I 
think we should go there: if makes Hadoop more resilient and not less resilient 
(because we depend on the assumption). Obviously everything is configurable 
(make arbitrary large retries), and you cannot achieve 100% resilience, but I 
found solution 2 a logically better solution.

However even option 2 provides a less general solution than the one mentioned 
by [~Jim_Brennan]. I would be the most comfortable with combining the two 
solutions, but that might be too much for this issue. Let's elaborate on this a 
bit. What do others think?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatus

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188276#comment-17188276
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~adam.antal] sorry to interrupt, any thoughts about this issue?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUt

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188270#comment-17188270
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan] Thanks for the reply.

Basically I'm opening to this two options:

1、Change the code of sending heartbeat, send it again if response not received 
(like [~wzzdreamer] did in the pr), but I'd prefer to introduce some max-retry 
code in case of potential infinit loop.

2、Resending container id from recentlyStoppedContainers periodically (maybe 1 
mins?), once it get response from RM, it will be removed from 
recentlyStoppedContainers and never get retried again.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-31 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188026#comment-17188026
 ] 

Jim Brennan commented on YARN-10393:


Thanks for the discussion [~yuanbo] and [~wzzdreamer]!
{quote}
I guess this will end up completed containers being sent multiple times if we 
just remove line 613
{quote}
Yeah.  I think that effectively defeats the purpose of [YARN-2997].  I was 
thinking about whether we could replace the {{!isContainerRecentlyStopped()}} 
with {{!pendingContainersToRemove.contains(containerId)}}, but I think that has 
a similar problem.  We would keep reporting the completed container to the RM 
until it the AM acks the completion, or until it is removed from the NM context.
{quote}
What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with a response.
{quote}
I think that might cause more problems than it solves.  The 
recentlyStoppedContainers cache is important for NM restart state, and 
{{isContainerRecentlyStopped()}} is called in several places in the code.

Another option would be to remove {{pendingCompletedContainers.clear()}} from 
the end of {{removeOrTrackCompletedContainersFromContext()}} - it doesn't 
really belong there anyway, and do it directly in the {{run())}} method.  By 
doing that, you can make it conditional on the heartbeatId changing.  Something 
like:
{noformat}
int responseId = response.getResponseId();
if (responseId != lastHeartbeatID) {
  pendingCompletedContainers.clear();
}
lastHeartbeatID = responseId;
{noformat}

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-20 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180989#comment-17180989
 ] 

zhenzhao wang commented on YARN-10393:
--

Thanks [~Jim_Brennan] [~yuanbo] for the comment!
??citation It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.??
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.  

??But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added. ??
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is backed by RM. But it's a reasonable concern of potential peak. [~yuanbo] 
also pointed it out with a solution suggestion.
??This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.

if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).??
I guess this will end up completed containers being sent multiple times if we 
just remove line 613


What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with response. 







> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAlloc

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-19 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180959#comment-17180959
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan]

Thanks for the comments.

> I think the concern is that if we remove that 
>pendingCompletedContainers.clear()

This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.
{code:java}
if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}{code}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180612#comment-17180612
 ] 

Jim Brennan commented on YARN-10393:


 Thanks [~wzzdreamer] for submitting this with such a detailed analysis.  While 
I agree with your analysis, I am not sure I agree with the solution.

It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost.  I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again.  As you noted, the RPC retry logic is already doing 
that.  Also note that there is a lot of other state in that request, so I am 
not sure of the implications of not sending the most recent status for all that 
other state.  Changing the protocol seems scary.

But the change you made in removeOrTrackCompletedContainersFromContext() seems 
to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
[YARN-2997] and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added.

I think the concern is that if we remove that 
pendingCompletedContainers.clear(), there is a potential to leak entries in 
pendingCompletedContainers.  The question is whether it is possible to miss an 
ack in the heartbeat response for a completed container.


> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.Protobu

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177462#comment-17177462
 ] 

zhenzhao wang commented on YARN-10393:
--

[~adam.antal] This is a great question. 
First, it's not like we upgraded it to 2.9.2 and the problem was gone. We 
stopped seeing new cases reported when we still run 2.6.x. This was because we 
have a stand-alone police service which could kill long-running mapper/reducer 
or the job itself. I guess all the users whose job pattern is easy to encounter 
this problem have adopted the service to prevent the problem.
Second, I guess this is also because of the default retries policy change. 
Here's the code of creating RM proxy from 2.6. I don't see retry with proxy 
invoke failure.

{code:java}
  public  ProtocolProxy getProxy(Class protocol, long clientVersion,
 InetSocketAddress addr, UserGroupInformation ticket,
 Configuration conf, SocketFactory factory,
 int rpcTimeout, RetryPolicy connectionRetryPolicy,
 AtomicBoolean fallbackToSimpleAuth)
throws IOException {

if (connectionRetryPolicy != null) {
  throw new UnsupportedOperationException(
  "Not supported: connectionRetryPolicy=" + connectionRetryPolicy);
}

T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(),
new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf,
factory, rpcTimeout, fallbackToSimpleAuth));
return new ProtocolProxy(protocol, proxy, true);
  }


Invoker.Java
@Override
public Object invoke(Object proxy, Method method, Object[] args)
  throws Throwable {
  long startTime = 0;
  if (LOG.isDebugEnabled()) {
startTime = Time.now();
  }
  TraceScope traceScope = null;
  if (Trace.isTracing()) {
traceScope = Trace.startSpan(
method.getDeclaringClass().getCanonicalName() +
"." + method.getName());
  }
  ObjectWritable value;
  try {
value = (ObjectWritable)
  client.call(RPC.RpcKind.RPC_WRITABLE, new Invocation(method, args),
remoteId, fallbackToSimpleAuth);
  } finally {
if (traceScope != null) traceScope.close();
  }
  if (LOG.isDebugEnabled()) {
long callTime = Time.now() - startTime;
LOG.debug("Call: " + method.getName() + " " + callTime);
  }
  return value.get();
}
{code}
And in 2.9, the RMProxy default retry policy is like the following. It's up 
15min with fixed 30s sleep time. Client could do lots of retries.
{code:java}
  retryPolicy =
  RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS (15 
* 60 * 1000ms),
  rmConnectionRetryIntervalMS(30 * 1000ms), TimeUnit.MILLISECONDS);
{code}




> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:5

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176984#comment-17176984
 ] 

Adam Antal commented on YARN-10393:
---

Nice finding [~wzzdreamer] and thorough explanation. Let me check the PR as 
well.

One question: what do you think is the reason that this bug did not appear 
after the upgrade to 2.9? I mean the possibility of the stuck mapper is still 
in the protocol, so it could have been appeared.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.c

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176820#comment-17176820
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~wzzdreamer] Thanks for your patch, I've made some comments about your patch.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IO

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-12 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176740#comment-17176740
 ] 

zhenzhao wang commented on YARN-10393:
--

[~bibinchundatt] [~adam.antal][~Jim_Brennan][~jdonofrio][~aceric] Could you 
please help with the review? Thanks

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOU

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-12 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176467#comment-17176467
 ] 

zhenzhao wang commented on YARN-10393:
--

I could see two issues here:
# RM and NM has a different understanding of heartbeat. RM uses the heartbeatId 
to distinguish the heartbeat. However, NM might generate different requests 
with the same heartbeat id on heartbeat failure.
# The cache for containers inside NM is not maintained correctly on heartbeat 
failure.
I submitted a PR https://github.com/apache/hadoop/pull/2204. I tried to make 
fewer code changes.  However, I'd say some cache structures 
(recentlyStoppedContainers, pendingCompletedContainers) NM used are kind of 
complex and error-prone. E.g. the cache is updated while getContainerStatuses 
regardless. This is before the heartbeat request. I'd suggest maybe worth to do 
a refactor in the future.

[~templedf][~yuanbo]I'd appreciate it if you could help with the review. Thanks!

Note that this patch is not tested in our production Hadoop clusters yet.


> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-10 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175148#comment-17175148
 ] 

Yuanbo Liu commented on YARN-10393:
---

Thanks for opening this issue, we happened to get similar situation on 
hadoop-2.7.0. The mapper lost heartbeat and nerver finish. Currently we just 
use "mapred fail-task" to make those mappers in failure state, and re-execute 
those mappers again. Looking forward to your patch!

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(Sock