[ 
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie updated YARN-9238:
------------------------
    Description: 
We have foud a data race that can make an odd situation.

See 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0)just before 
line171, when the code of line 171~173 continue to execute to get the 
appAttempt by appattempt_0, the appAttempt  should represents the  currenct AM. 
But we found that the  appAttempt  represents to  the new AM and its attempid 
is appattempt_1. This appAttempt that represents  the new AM  has not init its 
oppCtx, so NPE happnes at line 177.
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason why we use old appattempt_0 but get the new appAttempt 
that represent to new AM. Below is the function body of getApplicationAttempt  
at line 173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  the CurrentAppAttempt of app will be setted as the new 
appAttempt that presentes the new AM. So the code line 402 will return the new 
appAttempt. 

We shoud add the check: whether the the getted appAttempt have the same id as 
given id.

patch comes soon!

 

  was:
We have foud a data race that can make an odd situation.

See 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0)just before 
line171, when the code of line 171~173 continue to execute to get the 
appAttempt by appattempt_0, the appAttempt  should represents the  currenct AM. 
But we found that the  appAttempt  represents to  the new AM and its attempid 
is appattempt_1. This appAttempt that represents  the new AM  has not init its 
oppCtx, so NPE happnes at line 177.
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason why we use old appattempt_0 but get the new appAttempt 
that represent to new AM. Below is the function body of getApplicationAttempt  
at line 173

 
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  the CurrentAppAttempt of app will be setted as the new 
appAttempt that presentes the new AM. So the code line 402 will return the new 
appAttempt. 

We shoud add the check: whether the the getted appAttempt have the same id as 
given id.

patch comes soon!

 


> An huge Data Race can make we get a wrong attempt  by an appAttemptId 
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>
> We have foud a data race that can make an odd situation.
> See 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0)just before 
> line171, when the code of line 171~173 continue to execute to get the 
> appAttempt by appattempt_0, the appAttempt  should represents the  currenct 
> AM. But we found that the  appAttempt  represents to  the new AM and its 
> attempid is appattempt_1. This appAttempt that represents  the new AM  has 
> not init its oppCtx, so NPE happnes at line 177.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> We have found the reason why we use old appattempt_0 but get the new 
> appAttempt that represent to new AM. Below is the function body of 
> getApplicationAttempt  at line 173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId 
> applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  the CurrentAppAttempt of app will be setted as the new 
> appAttempt that presentes the new AM. So the code line 402 will return the 
> new appAttempt. 
> We shoud add the check: whether the the getted appAttempt have the same id as 
> given id.
> patch comes soon!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to