[jira] [Updated] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point

lujie (JIRA) Fri, 25 Jan 2019 05:30:16 -0800


     [ 
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


lujie updated YARN-9238:
------------------------
    Summary: We get a wrong attempt  by an appAttemptId when AM crash at some 
point  (was: An Data Race can make we get a wrong attempt  by an appAttemptId)

> We get a wrong attempt  by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>
> We have foud a data race that can make an odd situation.
> See 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0)just before 
> line171, when the code of line 171~173 continue to execute to get the 
> appAttempt by appattempt_0, the appAttempt  should represents the  currenct 
> AM. But we found that the  appAttempt  represents to  the new AM and its 
> attempid is appattempt_1. This appAttempt that represents  the new AM  has 
> not init its oppCtx, so NPE happnes at line 177.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> We have found the reason about we use old appattempt_0 but get the new 
> appAttempt that represent to new AM. Below is the function body of 
> getApplicationAttempt  at line 173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId 
> applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  the CurrentAppAttempt of app will be setted as the new 
> appAttempt that presentes the new AM. So the code line 402 will return the 
> new appAttempt. 
> We shoud add the check: whether the the getted appAttempt have the same id as 
> given id.
> patch comes soon!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point

Reply via email to