[ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312525#comment-14312525 ]
Junping Du commented on YARN-914: --------------------------------- Thanks for review and comments, [~xgong], [~jlowe] and [~mingma]! bq. I believe this is about the configuration synchronization between multiple RM nodes. Please take a look at https://issues.apache.org/jira/browse/YARN-1666, and https://issues.apache.org/jira/browse/YARN-1611 Thanks for pointing this out. Sounds like we already resolve most of the problem, good to know it. :) bq. Do we really need to handle the "LRS containers" and "short-term containers" differently? There are lots of different cases we need to take care. I think that we can just use the same way to handle both. I haven't think through this yet. IMO, the benefit of this feature is to provide a reasonable time window for running applications get chance to finish before nodes get decommissioned. Given the endless live cycle of LRS containers, I didn't see the benefit to keep LRS containers running until timeout but only delay the decommission process. Or we assume AM can do some react to LRS containers when get notified? May be for the first step, we can do the same thing for LRS and non-LRS containers to keep it simple. But I think we should keep mind open for this. bq. Maybe we need to track the timeout at RM side and NM side. RM can stop NM if the timeout is reached but it does not receive the "decommission complete" from NM. Sounds reasonable given possible broken communication between NM and RM. However, as Jason Lowe proposed below, we can only track in RM side. Thoughts? bq. For transferring knowledge to the standby RM, we could persist the graceful decomm node list to the state store. Yes. Sounds like most of work already be done in YARN-1666 (decommission node list) and YARN-1611 (timeout value) like @Xuan mentioned above. The only left work here is to keep track of start time of each decommissioning node. Isn't it? bq. I agree with Xuan that so far I don't see a need to treat LRS and normal containers separately. Either a container exits before the decommission timeout or it doesn't. Just like we want decommission happen before timeout if all containers and apps are finished, we don't want unnecessary time cost for delay the decommission process. Isn't it? However, it could be the other case if we think the delay can help to LRS application. Anyway, like mentioned above, it should be fine to keep the same behavior for now but I think we may need to keep mind on it. bq. Just to be clear, the NM is already tracking which applications are active on a node and is reporting these to the RM on heartbeats (see NM context and NodeStatusUpdaterImpl appTokenKeepAliveMap). The DecommissionService doesn't need to explicitly track the apps itself as this is already being done. Yes. The diagram not only include the new components but also existing components. Thanks for reminding on this though. bq. As for doing this RM side or NM side, I think it can simplify things if we do this on the RM side. The RM already needs to know about graceful decommission to avoid scheduling new apps/containers on the node. Also the NM is heartbeating active apps back to the RM, so it's easy for the RM to track which apps are still active on a particular node. If the RMNodeImpl state machine sees that it's in the decommissioning state and all apps/containers have completed then it can transition to the decommissioned state. For timeouts the RM can simply set a timer-delivered event to the RMNode when the graceful decommission starts, and the RMNode can act accordingly when the timer event arrives, killing containers etc. Actually I'm not sure the NM needs to know about graceful decommission at all, which IMHO simplifies the design since only one daemon needs to participate and be knowledgeable of the feature. The NM would simply see the process as a reduction in container assignments until eventually containers are killed and the RM tells it that it's decommissioned. That make sense. In addition, I think even RMNode doesn't have to track time themselves (or in worst case, thousands of threads need to access time), and we can have something like DecommssionTimeoutMonitor that derived from AbstractLivelinessMonitor. When detected timeout, it can send out decommssion_timeout event to RMNode to make node shutdown happens. Also, I agree that NM may not necessary to aware of this decommission_in_process. bq. To clarify decomm node list, it appears there are two things, one is the decomm request list; another one is the run time state of the decomm nodes. From Xuan's comment it appears we want to put the request in HDFS and leverage FileSystemBasedConfigurationProvider to read it at run time. Given it is considered configuration, that seems a good fit. Jason mentioned the state store , that can be used to track the run time state of the decomm. This is necessary given we plan to introduce timeout for graceful decommission. However, if we assume ResouceOption's overcommitTimeout state is stored in state store for RM failover case as part YARN-291, then the new active RM can just replay the state transition. If so, it seems we don't need to persist decomm run time state to state store. I think we may assume timeout (value specified in yarn properties) is consistent across the RM's failure over and we need to track each node's start time of decommissioning. It reminds me that putting timeout value on yarn-site.xml may not be a good idea if we want to update it at runtime as RM only load yarn-site configuration during its start. However, I think we can override this value with new parameters in command line of Refresh node. Thoughts? bq. Alternatively we can remove graceful decommission timeout for YARN layer and let external decommission script handle that. If the script considers the graceful decommission takes too long, it can ask YARN to do the immediate decommission. Aha~~, that make things much simpler. However, if we don't have that script, it could make decommission work looks like an endless process. May be we shouldn't assume some external script can be there? No? bq. BTW, it appears fair scheduler doesn't support ConfigurationProvider. Not sure. Need to check it later. May be some FairScheduler guys can comment here. bq. Recommission is another scenario. It can happen when node is in decommissioned state or decommissioned_in_progress state. Agree. It should be addressed already in document. > Support graceful decommission of nodemanager > -------------------------------------------- > > Key: YARN-914 > URL: https://issues.apache.org/jira/browse/YARN-914 > Project: Hadoop YARN > Issue Type: Improvement > Affects Versions: 2.0.4-alpha > Reporter: Luke Lu > Assignee: Junping Du > Attachments: Gracefully Decommission of NodeManager (v1).pdf > > > When NMs are decommissioned for non-fault reasons (capacity change etc.), > it's desirable to minimize the impact to running applications. > Currently if a NM is decommissioned, all running containers on the NM need to > be rescheduled on other NMs. Further more, for finished map tasks, if their > map output are not fetched by the reducers of the job, these map tasks will > need to be rerun as well. > We propose to introduce a mechanism to optionally gracefully decommission a > node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)