[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Jim Brennan (Jira) Tue, 01 Sep 2020 09:07:18 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188591#comment-17188591
 ]


Jim Brennan commented on YARN-10393:
------------------------------------

{quote}BTW, the normal calculation between requestID and responseID would be 
 responseID == requestID + 1
{noformat}
// code placeholder
if (responseId != lastHeartbeatID + 1) {
   pendingCompletedContainers.clear();
}
{noformat}
{quote}
Good point [~yuanbo]. But this is the reverse of what we want. We want to do 
the clear() when we get the expected responseID, and we want to skip the 
clear() when we don't get the expected responseId.

But your comment made me realize that my suggestion won't work as is. The NM's 
lastHeartBeatID is only updated when we get a response back from the RM. On the 
missed heartbeat, we don't update it. In the case of a duplicate, the RM 
re-sends the last response again, which will still be lastHeartBeatID + 1. I 
think we need to base this on a flag that we set when we get an exception from 
the request attempt. So something like:
{noformat}
            logAggregationReportForAppsTempList.clear();
            lastHeartbeatID = response.getResponseId();
            // Don't clear if the previous heartbeat was missed,
            if (!missedHearbeat) {
              pendingCompletedContainers.clear();
            }
            missedHearbeat = false;
...
        } catch (Exception e) {

          // TODO Better error handling. Thread can die with the rest of the
          // NM still running.
          LOG.error("Caught exception in status-updater", e);
          missedHearbeat = true;
        } finally {
{noformat}
On the missed hearbeat, we don't clear {{pendingCompletedContainers}} because 
of the exception, and by adding this change, we will prevent clearing it until 
the heartbeat after we get a response back from the RM. I think the downside is 
that if the RM never got the missed heartbeat (if it failed on the send vs the 
response), then the one after the missed on might not be a duplicate, and we 
will potentially re-send completed containers to the RM on the next hearbeat. 
But that should happen only once in this case. I don't think there is any way 
for the NM to tell if the RM's response is a duplicate or not.

What do you think?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10393
>                 URL: https://issues.apache.org/jira/browse/YARN-10393
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>            Reporter: zhenzhao wang
>            Assignee: zhenzhao wang
>            Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_000095_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXXXXXX
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> {code}
>  # NodeStatusUpdaterImpl try to send another heart beart 10s after last 
> exception. Container_1501226097332_249991_01_000199 was added to the 
> heartbeat request as a completed container for the first time. We can confirm 
> this point from the timestamp in 
> recentlyStoppedContainers@NodeStatusUpdaterImpl.
> {code:java}
> Java.util.LinkedHashMap$Entry@0x7889caca0
> http://localhost:7000/object/0x7889cacc8
> Value 1502209034239 
> Time add to recently stopped 
> 1502208434239 = 1502209034239 - 600000
> $date -d@1502208434
> Tue Aug  8 16:07:14 UTC 2017
> {code}
>  # RM thought the request as duplication heartbeat as the $hid is the same as 
> the heartBeatId it received last time. So it returned the last response 
> generated from previous heartbeat and didn’t handle the request, in other 
> words, it didn't add container_1501226097332_249991_01_000199 as a completed 
> container to its own data structure.
> {code:java}
> // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
> NodeHeartbeatResponse lastNodeHeartbeatResponse = 
> rmNode.getLastNodeHeartBeatResponse();
> if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
>    .getResponseId()) {
>  LOG.info("Received duplicate heartbeat from node "
>      + rmNode.getNodeAddress()+ " responseId=" + 
> remoteNodeStatus.getResponseId());
>  return lastNodeHeartbeatResponse;
> {code}
>  # Node manager received the response and clear 
> container_1501226097332_249991_01_000199 from 
> pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
> container_1501226097332_249991_01_000199 is still in 
> recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
> container_1501226097332_249991_01_000199 again as a complete container in the 
> heartbeat request.
> {code:java}
>              if (containerStatus.getState() == ContainerState.COMPLETE) {
>  if (isApplicationStopped(applicationId)) {
>    if (LOG.isDebugEnabled()) {
>      LOG.debug(applicationId + " is completing, " + " remove "
>          + containerId + " from NM context.");
>    }
>    context.getContainers().remove(containerId);
>    pendingCompletedContainers.put(containerId, containerStatus);
>  } else {
>    if (!isContainerRecentlyStopped(containerId)) {
>      pendingCompletedContainers.put(containerId, containerStatus);
>      // Adding to finished containers cache. Cache will keep it around at
>      // least for #durationToTrackStoppedContainers duration. In the
>      // subsequent call to stop the container will get removed from cache.
>      addCompletedContainer(containerId);
>    }
>  }
> } else {
>  containerStatuses.add(containerStatus);
> }
> {code}
>  # The application master relies on getResourses() to get completed 
> containers and remove them from assignedRequests. As the container will never 
> be reported to the RM by nodemanger again, thus application master can’t 
> remove the attempt associated with the container. And the preemption 
> calculation returns false due to following codes, thus causing the starving 
> mapper.
> {code:java}
> boolean preemptReducesIfNeeded() {
>  if (reduceResourceRequest.equals(Resources.none())) {
>    return false; // no reduces
>  }
>  if (assignedRequests.maps.size() > 0) {
>    // there are assigned mappers
>    return false;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Reply via email to