[ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lujie updated YARN-9238: ------------------------ Summary: We get a wrong attempt by an appAttemptId when AM crash at some point (was: An Data Race can make we get a wrong attempt by an appAttemptId) > We get a wrong attempt by an appAttemptId when AM crash at some point > ---------------------------------------------------------------------- > > Key: YARN-9238 > URL: https://issues.apache.org/jira/browse/YARN-9238 > Project: Hadoop YARN > Issue Type: Bug > Reporter: lujie > Priority: Critical > > We have foud a data race that can make an odd situation. > See > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate: > {code:java} > // Allocate OPPORTUNISTIC containers. > 171. SchedulerApplicationAttempt appAttempt = > 172. ((AbstractYarnScheduler)rmContext.getScheduler()) > 173. .getApplicationAttempt(appAttemptId); > 174. > 175. OpportunisticContainerContext oppCtx = > 176. appAttempt.getOpportunisticContainerContext(); > 177. oppCtx.updateNodeList(getLeastLoadedNodes()); > {code} > if we just crash the current AM(its attemptid is appattempt_0)just before > line171, when the code of line 171~173 continue to execute to get the > appAttempt by appattempt_0, the appAttempt should represents the currenct > AM. But we found that the appAttempt represents to the new AM and its > attempid is appattempt_1. This appAttempt that represents the new AM has > not init its oppCtx, so NPE happnes at line 177. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) > {code} > We have found the reason about we use old appattempt_0 but get the new > appAttempt that represent to new AM. Below is the function body of > getApplicationAttempt at line 173 > {code:java} > 399. public T getApplicationAttempt(ApplicationAttemptId > applicationAttemptId) { > 400 SchedulerApplication<T> app = applications.get( > 401 applicationAttemptId.getApplicationId()); > 402 return app == null ? null : app.getCurrentAppAttempt(); > 403 } > {code} > when old AM Crash, the CurrentAppAttempt of app will be setted as the new > appAttempt that presentes the new AM. So the code line 402 will return the > new appAttempt. > We shoud add the check: whether the the getted appAttempt have the same id as > given id. > patch comes soon! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org