[ 
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie updated YARN-9238:
------------------------
    Description: 
We have found a data race that can make an odd situation.

See 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just before 
code1#171, when code1#171~173 continue to execute to get the appAttempt by 
appattempt_0, the obtained appAttempt  should represent the  currenct AM. But 
we found that the obtained appAttempt  represents  the new AM and its attempid 
is appattempt_1. This  obtained appAttempt  has not init its oppCtx, so NPE 
happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
So why old appAttempt  disappeares and  why we use old appattempt_0 but get the 
new appAttempt

We have found the reason. Below code({color:#ff0000}code2{color}) is the 
function body of getApplicationAttempt  at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of app 
will be setted as the new appAttempt that presentes the new AM(see code3). So 
the code2 #402 will return the new appAttempt. 

if AM crashes just before code1, bug won't happens due to 
ApplicationDoesNotExistInCacheException. AM crashed after code1, everything is 
also ok.

We shoud add the check: whether the the getted appAttempt have the same id with 
given id.

patch comes soon!

{color:#ff0000}code3{color}
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
 currentAttempt){
    this.currentAttempt = currentAttempt;
}
{code}
 

  was:
We have found a data race that can make an odd situation.

See 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just at 
code1#171, when code1#171~173 continue to execute to get the appAttempt by 
appattempt_0, the obtained appAttempt  should represent the  currenct AM. But 
we found that the obtained appAttempt  represents  the new AM and its attempid 
is appattempt_1. This  obtained appAttempt  has not init its oppCtx, so NPE 
happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
So why old appAttempt  disappeares and  why we use old appattempt_0 but get the 
new appAttempt

We have found the reason. Below code({color:#ff0000}code2{color}) is the 
function body of getApplicationAttempt  at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of app 
will be setted as the new appAttempt that presentes the new AM(see code3). So 
the code2 #402 will return the new appAttempt. 

if AM crashes just before code1, bug won't happens due to 
ApplicationDoesNotExistInCacheException. AM crashed after code1, everything is 
also ok.

We shoud add the check: whether the the getted appAttempt have the same id with 
given id.

patch comes soon!

{color:#ff0000}code3{color}
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
 currentAttempt){
    this.currentAttempt = currentAttempt;
}
{code}
 


> We get a wrong attempt  by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: YARN-9238_1.patch, 
> hadoop-test-resourcemanager-hadoop11.log
>
>
> We have found a data race that can make an odd situation.
> See 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0) just before 
> code1#171, when code1#171~173 continue to execute to get the appAttempt by 
> appattempt_0, the obtained appAttempt  should represent the  currenct AM. But 
> we found that the obtained appAttempt  represents  the new AM and its 
> attempid is appattempt_1. This  obtained appAttempt  has not init its oppCtx, 
> so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> So why old appAttempt  disappeares and  why we use old appattempt_0 but get 
> the new appAttempt
> We have found the reason. Below code({color:#ff0000}code2{color}) is the 
> function body of getApplicationAttempt  at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId 
> applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of 
> app will be setted as the new appAttempt that presentes the new AM(see 
> code3). So the code2 #402 will return the new appAttempt. 
> if AM crashes just before code1, bug won't happens due to 
> ApplicationDoesNotExistInCacheException. AM crashed after code1, everything 
> is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id 
> with given id.
> patch comes soon!
> {color:#ff0000}code3{color}
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
>  currentAttempt){
>     this.currentAttempt = currentAttempt;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to