[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

zhenzhao wang (Jira) Fri, 07 Aug 2020 23:29:08 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhenzhao wang updated YARN-10393:
---------------------------------
    Description: 
This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x.  
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything. 

*High-level description:
*
We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:
*
# Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_000095_0
{code}
#  The container finished on  2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
# The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXXXXXX
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
{code}
#  NodeStatusUpdaterImpl try to send another heart beart 10s after last 
exception.  Container_1501226097332_249991_01_000199 was added to the heartbeat 
request  as a completed container for the first time. We can confirm this point 
from the timestamp in recentlyStoppedContainers@NodeStatusUpdaterImpl. 
{code:java}
Java.util.LinkedHashMap$Entry@0x7889caca0
http://localhost:7000/object/0x7889cacc8
Value 1502209034239 
Time add to recently stopped 
1502208434239 = 1502209034239 - 600000
$date -d@1502208434
Tue Aug  8 16:07:14 UTC 2017
{code}
#  RM thought the request as duplication heartbeat as the $hid is the same as 
the heartBeatId it received last time. So it returned the last response 
generated from previous heartbeat and didn’t handle the request, in other 
words, it didn't add container_1501226097332_249991_01_000199 as a completed 
container to its own data structure.
{code:java}
// 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
NodeHeartbeatResponse lastNodeHeartbeatResponse = 
rmNode.getLastNodeHeartBeatResponse();
if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
   .getResponseId()) {
 LOG.info("Received duplicate heartbeat from node "
     + rmNode.getNodeAddress()+ " responseId=" + 
remoteNodeStatus.getResponseId());
 return lastNodeHeartbeatResponse;
{code}
#  Node manager received the response and clear 
container_1501226097332_249991_01_000199 from 
pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
container_1501226097332_249991_01_000199 is still in 
recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
container_1501226097332_249991_01_000199 again as a complete container in the 
heartbeat request.
{code:java}
             if (containerStatus.getState() == ContainerState.COMPLETE) {
 if (isApplicationStopped(applicationId)) {
   if (LOG.isDebugEnabled()) {
     LOG.debug(applicationId + " is completing, " + " remove "
         + containerId + " from NM context.");
   }
   context.getContainers().remove(containerId);
   pendingCompletedContainers.put(containerId, containerStatus);
 } else {
   if (!isContainerRecentlyStopped(containerId)) {
     pendingCompletedContainers.put(containerId, containerStatus);
     // Adding to finished containers cache. Cache will keep it around at
     // least for #durationToTrackStoppedContainers duration. In the
     // subsequent call to stop the container will get removed from cache.
     addCompletedContainer(containerId);
   }
 }
} else {
 containerStatuses.add(containerStatus);
}
{code}     
#  The application master relies on getResourses() to get completed containers 
and remove them from assignedRequests. As the container will never be reported 
to the RM by nodemanger again, thus application master can’t remove the attempt 
associated with the container. And the preemption calculation returns false due 
to following codes, thus causing the starving mapper.
{code:java}
boolean preemptReducesIfNeeded() {
 if (reduceResourceRequest.equals(Resources.none())) {
   return false; // no reduces
 }
 if (assignedRequests.maps.size() > 0) {
   // there are assigned mappers
   return false;
 }
{code}



  was:
This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x.  
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything. 

*High-level description:
*
We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:
*
# Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_000095_0
{code}
#  The container finished on  2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
# The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXXXXXX
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
{code}
#  NodeStatusUpdaterImpl try to send another heart beart 10s after last 
exception.  Container_1501226097332_249991_01_000199 was added to the heartbeat 
request  as a completed container for the first time. We can confirm this point 
from the timestamp in recentlyStoppedContainers@NodeStatusUpdaterImpl. 
{code:java}
Java.util.LinkedHashMap$Entry@0x7889caca0
http://localhost:7000/object/0x7889cacc8
Value 1502209034239 
Time add to recently stopped 
1502208434239 = 1502209034239 - 600000
$date -d@1502208434
Tue Aug  8 16:07:14 UTC 2017
{code}
#  RM thought the request as duplication heartbeat as the $hid is the same as 
the heartBeatId it received last time. So it returned the last response 
generated from previous heartbeat and didn’t handle the request, in other 
words, it didn't add container_1501226097332_249991_01_000199 as a completed 
container to its own data structure.
{code:java}
// 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
NodeHeartbeatResponse lastNodeHeartbeatResponse = 
rmNode.getLastNodeHeartBeatResponse();
if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
   .getResponseId()) {
 LOG.info("Received duplicate heartbeat from node "
     + rmNode.getNodeAddress()+ " responseId=" + 
remoteNodeStatus.getResponseId());
 return lastNodeHeartbeatResponse;

{code}
#  Node manager received the response and clear 
container_1501226097332_249991_01_000199 from 
pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
container_1501226097332_249991_01_000199 is still in 
recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
container_1501226097332_249991_01_000199 again as a complete container in the 
heartbeat request.

{code:java}
             if (containerStatus.getState() == ContainerState.COMPLETE) {
 if (isApplicationStopped(applicationId)) {
   if (LOG.isDebugEnabled()) {
     LOG.debug(applicationId + " is completing, " + " remove "
         + containerId + " from NM context.");
   }
   context.getContainers().remove(containerId);
   pendingCompletedContainers.put(containerId, containerStatus);
 } else {
   if (!isContainerRecentlyStopped(containerId)) {
     pendingCompletedContainers.put(containerId, containerStatus);
     // Adding to finished containers cache. Cache will keep it around at
     // least for #durationToTrackStoppedContainers duration. In the
     // subsequent call to stop the container will get removed from cache.
     addCompletedContainer(containerId);
   }
 }
} else {
 containerStatuses.add(containerStatus);
}
{code}     
#  The application master relies on getResourses() to get completed containers 
and remove them from assignedRequests. As the container will never be reported 
to the RM by nodemanger again, thus application master can’t remove the attempt 
associated with the container. And the preemption calculation returns false due 
to following codes, thus causing the starving mapper.
{code:java}
boolean preemptReducesIfNeeded() {
 if (reduceResourceRequest.equals(Resources.none())) {
   return false; // no reduces
 }
 if (assignedRequests.maps.size() > 0) {
   // there are assigned mappers
   return false;
 }
{code}




> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10393
>                 URL: https://issues.apache.org/jira/browse/YARN-10393
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>            Reporter: zhenzhao wang
>            Assignee: zhenzhao wang
>            Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. 
>  We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything. 
> *High-level description:
> *
> We had seen a starving mapper issue several times. The MR job stuck in a live 
> lock state and couldn't make any progress. The queue is full so the pending 
> mapper can’t get any resource to continue, and the application master failed 
> to preempt the reducer, thus causing the job to be stuck. The reason why the 
> application master didn’t preempt the reducer was that there was a leaked 
> container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:
> *
> # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_000095_0
> {code}
> #  The container finished on  2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
> # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXXXXXX
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> {code}
> #  NodeStatusUpdaterImpl try to send another heart beart 10s after last 
> exception.  Container_1501226097332_249991_01_000199 was added to the 
> heartbeat request  as a completed container for the first time. We can 
> confirm this point from the timestamp in 
> recentlyStoppedContainers@NodeStatusUpdaterImpl. 
> {code:java}
> Java.util.LinkedHashMap$Entry@0x7889caca0
> http://localhost:7000/object/0x7889cacc8
> Value 1502209034239 
> Time add to recently stopped 
> 1502208434239 = 1502209034239 - 600000
> $date -d@1502208434
> Tue Aug  8 16:07:14 UTC 2017
> {code}
> #  RM thought the request as duplication heartbeat as the $hid is the same as 
> the heartBeatId it received last time. So it returned the last response 
> generated from previous heartbeat and didn’t handle the request, in other 
> words, it didn't add container_1501226097332_249991_01_000199 as a completed 
> container to its own data structure.
> {code:java}
> // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
> NodeHeartbeatResponse lastNodeHeartbeatResponse = 
> rmNode.getLastNodeHeartBeatResponse();
> if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
>    .getResponseId()) {
>  LOG.info("Received duplicate heartbeat from node "
>      + rmNode.getNodeAddress()+ " responseId=" + 
> remoteNodeStatus.getResponseId());
>  return lastNodeHeartbeatResponse;
> {code}
> #  Node manager received the response and clear 
> container_1501226097332_249991_01_000199 from 
> pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
> container_1501226097332_249991_01_000199 is still in 
> recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
> container_1501226097332_249991_01_000199 again as a complete container in the 
> heartbeat request.
> {code:java}
>              if (containerStatus.getState() == ContainerState.COMPLETE) {
>  if (isApplicationStopped(applicationId)) {
>    if (LOG.isDebugEnabled()) {
>      LOG.debug(applicationId + " is completing, " + " remove "
>          + containerId + " from NM context.");
>    }
>    context.getContainers().remove(containerId);
>    pendingCompletedContainers.put(containerId, containerStatus);
>  } else {
>    if (!isContainerRecentlyStopped(containerId)) {
>      pendingCompletedContainers.put(containerId, containerStatus);
>      // Adding to finished containers cache. Cache will keep it around at
>      // least for #durationToTrackStoppedContainers duration. In the
>      // subsequent call to stop the container will get removed from cache.
>      addCompletedContainer(containerId);
>    }
>  }
> } else {
>  containerStatuses.add(containerStatus);
> }
> {code}     
> #  The application master relies on getResourses() to get completed 
> containers and remove them from assignedRequests. As the container will never 
> be reported to the RM by nodemanger again, thus application master can’t 
> remove the attempt associated with the container. And the preemption 
> calculation returns false due to following codes, thus causing the starving 
> mapper.
> {code:java}
> boolean preemptReducesIfNeeded() {
>  if (reduceResourceRequest.equals(Resources.none())) {
>    return false; // no reduces
>  }
>  if (assignedRequests.maps.size() > 0) {
>    // there are assigned mappers
>    return false;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Reply via email to