lujie created YARN-9238: --------------------------- Summary: An huge Data Race can make we get a wrong attempt by an appAttemptId Key: YARN-9238 URL: https://issues.apache.org/jira/browse/YARN-9238 Project: Hadoop YARN Issue Type: Bug Reporter: lujie
We have foud a data race that can make an odd situation. See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate: {code:java} // Allocate OPPORTUNISTIC containers. 171. SchedulerApplicationAttempt appAttempt = 172. ((AbstractYarnScheduler)rmContext.getScheduler()) 173. .getApplicationAttempt(appAttemptId); 174. 175. OpportunisticContainerContext oppCtx = 176. appAttempt.getOpportunisticContainerContext(); 177. oppCtx.updateNodeList(getLeastLoadedNodes()); {code} if we just crash the current AM(its attemptid is appattempt_0)just before line171, when the code of line 171~173 continue to execute to get the appAttempt by appattempt_0, the appAttempt should represents the currenct AM. But we found that the appAttempt represents to the new AM and its attempid is appattempt_1. This appAttempt that represents the new AM has not init its oppCtx, so NPE happnes at line 177. {code:java} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code} We have found the reason why we use old appattempt_0 but get the new appAttempt that represent to new AM. Below is the function body of getApplicationAttempt at line 173 {code:java} 399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) { 400 SchedulerApplication<T> app = applications.get( 401 applicationAttemptId.getApplicationId()); 402 return app == null ? null : app.getCurrentAppAttempt(); 403 } {code} when old AM Crash, the CurrentAppAttempt of app will be setted as the new appAttempt that presentes the new AM. So the code line 402 will return the new appAttempt. We shoud add the check: whether the the getted appAttempt have the same id as given id. patch comes soon! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org