[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991069#comment-14991069 ]
Bikas Saha commented on YARN-2047: ---------------------------------- >From the description it seems like the original scope was making sure that a >lost NM's containers are marked expired by the RM even across RM restart. For >that, wont it be enough to save a dead/decommissioned NM info in the state >store. Upon restart, repopulate the decommissioned/dead status from the state >store. It can take appropriate action at that time - e.g. cancelling an AM >containers for those NMs when the AM re-registers or asking those NMs to >restart and re-register if they heartbeat again. If this is a required action then it would also imply that saving a such nodes would be a critical state change operation. So, e.g. decommission command from the admin should not complete until the store has been updated. Is that the case? > RM should honor NM heartbeat expiry after RM restart > ---------------------------------------------------- > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)