[ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754946#comment-16754946 ]
lujie commented on YARN-9238: ----------------------------- HI [~cheersyang] I have deleted the unused code and adding the log to indicate this case. > We get a wrong attempt by an appAttemptId when AM crash at some point > ---------------------------------------------------------------------- > > Key: YARN-9238 > URL: https://issues.apache.org/jira/browse/YARN-9238 > Project: Hadoop YARN > Issue Type: Bug > Reporter: lujie > Assignee: lujie > Priority: Critical > Attachments: YARN-9238_1.patch, YARN-9238_2.patch, > hadoop-test-resourcemanager-hadoop11.log > > > We have found a data race that can make an odd situation. > See > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color} > {code:java} > // Allocate OPPORTUNISTIC containers. > 171. SchedulerApplicationAttempt appAttempt = > 172. ((AbstractYarnScheduler)rmContext.getScheduler()) > 173. .getApplicationAttempt(appAttemptId); > 174. > 175. OpportunisticContainerContext oppCtx = > 176. appAttempt.getOpportunisticContainerContext(); > 177. oppCtx.updateNodeList(getLeastLoadedNodes()); > {code} > if we just crash the current AM(its attemptid is appattempt_0) just before > code1#171, when code1#171~173 continue to execute to get the appAttempt by > appattempt_0, the obtained appAttempt should represent the currenct AM. But > we found that the obtained appAttempt represents the new AM and its > attempid is appattempt_1. This obtained appAttempt has not init its oppCtx, > so NPE happnes at line code1#177. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) > {code} > So why old appAttempt disappeares and why we use old appattempt_0 but get > the new appAttempt > We have found the reason. Below code({color:#ff0000}code2{color}) is the > function body of getApplicationAttempt at code1#173 > {code:java} > 399. public T getApplicationAttempt(ApplicationAttemptId > applicationAttemptId) { > 400 SchedulerApplication<T> app = applications.get( > 401 applicationAttemptId.getApplicationId()); > 402 return app == null ? null : app.getCurrentAppAttempt(); > 403 } > {code} > when old AM Crash, new AM and new appAttempt comes. The currentAttempt of > app will be setted as the new appAttempt (see code3). So the code2 #402 will > return the new appAttempt. > if AM crashes at the head of allocate function(code1), bug won't happens due > to ApplicationDoesNotExistInCacheException. AM crashed after code1, > everything is also ok. > We shoud add the check: whether the the getted appAttempt have the same id > with given id. > patch comes soon! > {color:#ff0000}code3{color} > {code:java} > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T > currentAttempt){ > this.currentAttempt = currentAttempt; > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org