[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204370#comment-17204370 ]
Hadoop QA commented on YARN-10393: ---------------------------------- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 45s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 4s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 22s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 21s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 8s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 30s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 28s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 29s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 1 new + 120 unchanged - 0 fixed = 121 total (was 120) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 52s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 36s{color} | {color:green}{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 23m 20s{color} | {color:green}{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 92m 9s{color} | {color:black}{color} | {color:black}{color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/artifact/out/Dockerfile | | JIRA Issue | YARN-10393 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13012322/YARN-10393.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 6820fa580055 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / a7a1f1541afe8ae9f98817d45bae2f30f78c1271 | | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | Test Results | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/testReport/ | | Max. process+thread count | 452 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/console | | versions | git=2.17.1 maven=3.6.0 findbugs=4.0.6 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -------------------------------------------------------------------------------------------------- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn > Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 > Reporter: zhenzhao wang > Assignee: Jim Brennan > Priority: Major > Attachments: YARN-10393.001.patch, YARN-10393.draft.2.patch, > YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_000095_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXXXXXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966) > {code} > # NodeStatusUpdaterImpl try to send another heart beart 10s after last > exception. Container_1501226097332_249991_01_000199 was added to the > heartbeat request as a completed container for the first time. We can confirm > this point from the timestamp in > recentlyStoppedContainers@NodeStatusUpdaterImpl. > {code:java} > Java.util.LinkedHashMap$Entry@0x7889caca0 > http://localhost:7000/object/0x7889cacc8 > Value 1502209034239 > Time add to recently stopped > 1502208434239 = 1502209034239 - 600000 > $date -d@1502208434 > Tue Aug 8 16:07:14 UTC 2017 > {code} > # RM thought the request as duplication heartbeat as the $hid is the same as > the heartBeatId it received last time. So it returned the last response > generated from previous heartbeat and didn’t handle the request, in other > words, it didn't add container_1501226097332_249991_01_000199 as a completed > container to its own data structure. > {code:java} > // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat > NodeHeartbeatResponse lastNodeHeartbeatResponse = > rmNode.getLastNodeHeartBeatResponse(); > if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse > .getResponseId()) { > LOG.info("Received duplicate heartbeat from node " > + rmNode.getNodeAddress()+ " responseId=" + > remoteNodeStatus.getResponseId()); > return lastNodeHeartbeatResponse; > {code} > # Node manager received the response and clear > container_1501226097332_249991_01_000199 from > pendingCompletedContainers@nodeStatusUpdaterImpl. However, > container_1501226097332_249991_01_000199 is still in > recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report > container_1501226097332_249991_01_000199 again as a complete container in the > heartbeat request. > {code:java} > if (containerStatus.getState() == ContainerState.COMPLETE) { > if (isApplicationStopped(applicationId)) { > if (LOG.isDebugEnabled()) { > LOG.debug(applicationId + " is completing, " + " remove " > + containerId + " from NM context."); > } > context.getContainers().remove(containerId); > pendingCompletedContainers.put(containerId, containerStatus); > } else { > if (!isContainerRecentlyStopped(containerId)) { > pendingCompletedContainers.put(containerId, containerStatus); > // Adding to finished containers cache. Cache will keep it around at > // least for #durationToTrackStoppedContainers duration. In the > // subsequent call to stop the container will get removed from cache. > addCompletedContainer(containerId); > } > } > } else { > containerStatuses.add(containerStatus); > } > {code} > # The application master relies on getResourses() to get completed > containers and remove them from assignedRequests. As the container will never > be reported to the RM by nodemanger again, thus application master can’t > remove the attempt associated with the container. And the preemption > calculation returns false due to following codes, thus causing the starving > mapper. > {code:java} > boolean preemptReducesIfNeeded() { > if (reduceResourceRequest.equals(Resources.none())) { > return false; // no reduces > } > if (assignedRequests.maps.size() > 0) { > // there are assigned mappers > return false; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org