[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

zhenzhao wang (Jira) Thu, 20 Aug 2020 00:04:21 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180989#comment-17180989
 ]


zhenzhao wang edited comment on YARN-10393 at 8/20/20, 7:03 AM:
----------------------------------------------------------------

Thanks [~Jim_Brennan] [~yuanbo] for the comment!
{quote}It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.
{quote}
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.
{quote}But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem. The current code is always clearing 
pendingCompletedContainers at the end of that function. I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added.
{quote}
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is acked by RM. But it's a reasonable concern of potential peak. [~yuanbo] also 
pointed it out with a solution suggestion.
{quote}This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
 I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.

if (!isContainerRecentlyStopped(containerId))
Unknown macro: \{ pendingCompletedContainers.put(containerId, containerStatus); 
}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).
{quote}
I guess this will end up completed containers being sent multiple times if we 
just remove line 613

What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with a response.


was (Author: wzzdreamer):
Thanks [~Jim_Brennan] [~yuanbo] for the comment!
??citation It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.??
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.  

??But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added. ??
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is backed by RM. But it's a reasonable concern of potential peak. [~yuanbo] 
also pointed it out with a solution suggestion.
??This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.

if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).??
I guess this will end up completed containers being sent multiple times if we 
just remove line 613


What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with response. 







> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10393
>                 URL: https://issues.apache.org/jira/browse/YARN-10393
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>            Reporter: zhenzhao wang
>            Assignee: zhenzhao wang
>            Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_000095_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXXXXXX
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> {code}
>  # NodeStatusUpdaterImpl try to send another heart beart 10s after last 
> exception. Container_1501226097332_249991_01_000199 was added to the 
> heartbeat request as a completed container for the first time. We can confirm 
> this point from the timestamp in 
> recentlyStoppedContainers@NodeStatusUpdaterImpl.
> {code:java}
> Java.util.LinkedHashMap$Entry@0x7889caca0
> http://localhost:7000/object/0x7889cacc8
> Value 1502209034239 
> Time add to recently stopped 
> 1502208434239 = 1502209034239 - 600000
> $date -d@1502208434
> Tue Aug  8 16:07:14 UTC 2017
> {code}
>  # RM thought the request as duplication heartbeat as the $hid is the same as 
> the heartBeatId it received last time. So it returned the last response 
> generated from previous heartbeat and didn’t handle the request, in other 
> words, it didn't add container_1501226097332_249991_01_000199 as a completed 
> container to its own data structure.
> {code:java}
> // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
> NodeHeartbeatResponse lastNodeHeartbeatResponse = 
> rmNode.getLastNodeHeartBeatResponse();
> if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
>    .getResponseId()) {
>  LOG.info("Received duplicate heartbeat from node "
>      + rmNode.getNodeAddress()+ " responseId=" + 
> remoteNodeStatus.getResponseId());
>  return lastNodeHeartbeatResponse;
> {code}
>  # Node manager received the response and clear 
> container_1501226097332_249991_01_000199 from 
> pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
> container_1501226097332_249991_01_000199 is still in 
> recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
> container_1501226097332_249991_01_000199 again as a complete container in the 
> heartbeat request.
> {code:java}
>              if (containerStatus.getState() == ContainerState.COMPLETE) {
>  if (isApplicationStopped(applicationId)) {
>    if (LOG.isDebugEnabled()) {
>      LOG.debug(applicationId + " is completing, " + " remove "
>          + containerId + " from NM context.");
>    }
>    context.getContainers().remove(containerId);
>    pendingCompletedContainers.put(containerId, containerStatus);
>  } else {
>    if (!isContainerRecentlyStopped(containerId)) {
>      pendingCompletedContainers.put(containerId, containerStatus);
>      // Adding to finished containers cache. Cache will keep it around at
>      // least for #durationToTrackStoppedContainers duration. In the
>      // subsequent call to stop the container will get removed from cache.
>      addCompletedContainer(containerId);
>    }
>  }
> } else {
>  containerStatuses.add(containerStatus);
> }
> {code}
>  # The application master relies on getResourses() to get completed 
> containers and remove them from assignedRequests. As the container will never 
> be reported to the RM by nodemanger again, thus application master can’t 
> remove the attempt associated with the container. And the preemption 
> calculation returns false due to following codes, thus causing the starving 
> mapper.
> {code:java}
> boolean preemptReducesIfNeeded() {
>  if (reduceResourceRequest.equals(Resources.none())) {
>    return false; // no reduces
>  }
>  if (assignedRequests.maps.size() > 0) {
>    // there are assigned mappers
>    return false;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Reply via email to