[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207912#comment-17207912 ]
Adam Antal commented on YARN-10393: ----------------------------------- Committed to trunk, I will cherry-pick this to other branches now. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -------------------------------------------------------------------------------------------------- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn > Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 > Reporter: zhenzhao wang > Assignee: Jim Brennan > Priority: Major > Attachments: YARN-10393.001.patch, YARN-10393.002.patch, > YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_000095_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXXXXXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966) > {code} > # NodeStatusUpdaterImpl try to send another heart beart 10s after last > exception. Container_1501226097332_249991_01_000199 was added to the > heartbeat request as a completed container for the first time. We can confirm > this point from the timestamp in > recentlyStoppedContainers@NodeStatusUpdaterImpl. > {code:java} > Java.util.LinkedHashMap$Entry@0x7889caca0 > http://localhost:7000/object/0x7889cacc8 > Value 1502209034239 > Time add to recently stopped > 1502208434239 = 1502209034239 - 600000 > $date -d@1502208434 > Tue Aug 8 16:07:14 UTC 2017 > {code} > # RM thought the request as duplication heartbeat as the $hid is the same as > the heartBeatId it received last time. So it returned the last response > generated from previous heartbeat and didn’t handle the request, in other > words, it didn't add container_1501226097332_249991_01_000199 as a completed > container to its own data structure. > {code:java} > // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat > NodeHeartbeatResponse lastNodeHeartbeatResponse = > rmNode.getLastNodeHeartBeatResponse(); > if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse > .getResponseId()) { > LOG.info("Received duplicate heartbeat from node " > + rmNode.getNodeAddress()+ " responseId=" + > remoteNodeStatus.getResponseId()); > return lastNodeHeartbeatResponse; > {code} > # Node manager received the response and clear > container_1501226097332_249991_01_000199 from > pendingCompletedContainers@nodeStatusUpdaterImpl. However, > container_1501226097332_249991_01_000199 is still in > recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report > container_1501226097332_249991_01_000199 again as a complete container in the > heartbeat request. > {code:java} > if (containerStatus.getState() == ContainerState.COMPLETE) { > if (isApplicationStopped(applicationId)) { > if (LOG.isDebugEnabled()) { > LOG.debug(applicationId + " is completing, " + " remove " > + containerId + " from NM context."); > } > context.getContainers().remove(containerId); > pendingCompletedContainers.put(containerId, containerStatus); > } else { > if (!isContainerRecentlyStopped(containerId)) { > pendingCompletedContainers.put(containerId, containerStatus); > // Adding to finished containers cache. Cache will keep it around at > // least for #durationToTrackStoppedContainers duration. In the > // subsequent call to stop the container will get removed from cache. > addCompletedContainer(containerId); > } > } > } else { > containerStatuses.add(containerStatus); > } > {code} > # The application master relies on getResourses() to get completed > containers and remove them from assignedRequests. As the container will never > be reported to the RM by nodemanger again, thus application master can’t > remove the attempt associated with the container. And the preemption > calculation returns false due to following codes, thus causing the starving > mapper. > {code:java} > boolean preemptReducesIfNeeded() { > if (reduceResourceRequest.equals(Resources.none())) { > return false; // no reduces > } > if (assignedRequests.maps.size() > 0) { > // there are assigned mappers > return false; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org