[jira] [Commented] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point

lujie (JIRA) Tue, 29 Jan 2019 04:34:19 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754946#comment-16754946
 ]


lujie commented on YARN-9238:
-----------------------------

HI [~cheersyang]

I have deleted the unused code and adding the log to indicate this case.

> We get a wrong attempt  by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: YARN-9238_1.patch, YARN-9238_2.patch, 
> hadoop-test-resourcemanager-hadoop11.log
>
>
> We have found a data race that can make an odd situation.
> See 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0) just before 
> code1#171, when code1#171~173 continue to execute to get the appAttempt by 
> appattempt_0, the obtained appAttempt  should represent the  currenct AM. But 
> we found that the obtained appAttempt  represents  the new AM and its 
> attempid is appattempt_1. This  obtained appAttempt  has not init its oppCtx, 
> so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> So why old appAttempt  disappeares and  why we use old appattempt_0 but get 
> the new appAttempt
> We have found the reason. Below code({color:#ff0000}code2{color}) is the 
> function body of getApplicationAttempt  at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId 
> applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of 
> app will be setted as the new appAttempt (see code3). So the code2 #402 will 
> return the new appAttempt. 
> if AM crashes at the head of allocate function(code1), bug won't happens due 
> to ApplicationDoesNotExistInCacheException. AM crashed after code1, 
> everything is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id 
> with given id.
> patch comes soon!
> {color:#ff0000}code3{color}
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
>  currentAttempt){
>     this.currentAttempt = currentAttempt;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point

Reply via email to