[ 
https://issues.apache.org/jira/browse/YARN-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated YARN-11421:
----------------------------------
    Description: 
During Graceful Decommission, a Node gets deactivated before timeout even 
though there are launched containers on that node.

We have observed cases when graceful decommission signal is sent to node and 
Containers are launched at NodeManager and at the same time, in such cases 
ResourceManager moves the node from Decommissioning to Decommissioned state 
because launced containers are not checked in DecommissioningNodesWatcher.

We will suggest waiting for 
yarn.resourcemanager.decommissioning-nodes-watcher.delay-ms to complete before 
marking node ready to be decommissioned. No delay if set to 0. Expire interval 
should not be configured more than RM_AM_EXPIRY_INTERVAL_MS.

  was:
During Graceful Decommission, a Node gets deactivated before timeout even 
though there are launched containers on that node.

We have observed cases when graceful decommission signal is sent to node and 
Containers are launched at NodeManager and at the same time,  in such cases 
ResourceManager moves the node from Decommissioning to Decommissioned state 
because launced containers are not checked in DeactivateNodeTransition.

We will suggest waiting for AM liveliness timeout to complete before marking 
node ready to be decommissioned. This behavior will be gated behind flag 
decommissioning-nodes-watcher.delayed-removal.allowed


> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-11421
>                 URL: https://issues.apache.org/jira/browse/YARN-11421
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.2.1, 3.3.1, 3.3.4
>            Reporter: Abhishek Dixit
>            Priority: Major
>              Labels: pull-request-available
>
> During Graceful Decommission, a Node gets deactivated before timeout even 
> though there are launched containers on that node.
> We have observed cases when graceful decommission signal is sent to node and 
> Containers are launched at NodeManager and at the same time, in such cases 
> ResourceManager moves the node from Decommissioning to Decommissioned state 
> because launced containers are not checked in DecommissioningNodesWatcher.
> We will suggest waiting for 
> yarn.resourcemanager.decommissioning-nodes-watcher.delay-ms to complete 
> before marking node ready to be decommissioned. No delay if set to 0. Expire 
> interval should not be configured more than RM_AM_EXPIRY_INTERVAL_MS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to