Prabhu Joseph created YARN-10873:
------------------------------------

             Summary: Graceful Decommission ignores launched containers and 
gets deactivated before timeout
                 Key: YARN-10873
                 URL: https://issues.apache.org/jira/browse/YARN-10873
             Project: Hadoop YARN
          Issue Type: Bug
          Components: RM
    Affects Versions: 3.3.1
            Reporter: Prabhu Joseph
            Assignee: Srinivas S T


Graceful Decommission of a Node gets deactivated before timeout even though 
there are launched containers. 

On Status update from Node which is in Decommissioning, RM transitions the node 
to DECOMMISSIONED before timeout if there are no running applications. These 
running applications are added from the Container Statuses from NodeManager. We 
have observed Containers are launched at NodeManager and at the same time 
ResourceManager forcefully decommissions the node.

This affects the Livy Interactive jobs which supports only one application 
attempt.

Will suggest to check FicaSchedulerNode to identify if there are any launched 
containers and determine whether to forcefully decommission or not.

{code}
  public static class StatusUpdateWhenHealthyTransition implements
      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
    @Override
    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
      .....
      if (isNodeDecommissioning) {
        List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
        if (rmNode.runningApplications.isEmpty() &&
            (keepAliveApps == null || keepAliveApps.isEmpty())) {
          RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
          return NodeState.DECOMMISSIONED;
        }
      }
{code}


*ResourceManager Logs:*
{code}
2021-06-16 08:45:04,140 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching 
masterappattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up 
container Container: [ContainerId: container_1623830067124_0382_01_000001, 
AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, 
vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Creating password for appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,154 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
launching container Container: [ContainerId: 
container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0, 
NodeId: node1:34753, NodeHttpAddress: 
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, 
vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
appattempt_1623830067124_0382_000001


2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
decommission node node1:34753 with state RUNNING
2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
node1:34753 in DECOMMISSIONING.
2021-06-16 08:45:04,776 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
Node Transitioned from RUNNING to DECOMMISSIONING
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node node1:34753 as it is now DECOMMISSIONED
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
2021-06-16 08:45:05,131 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1623830067124_0382_01_000001 Container Transitioned from ACQUIRED to 
KILLED
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to