kyungwan nam created SLIDER-1253: ------------------------------------ Summary: All containers are killed when decommission a NM which the AM is placed. Key: SLIDER-1253 URL: https://issues.apache.org/jira/browse/SLIDER-1253 Project: Slider Issue Type: Bug Affects Versions: Slider 0.92 Reporter: kyungwan nam
Once a nodemanager is decommissioned, RM release containers running in the nodemanager immediately. and new appattempt will be launched if the released container is AM. RM log. {code} 2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl (RMNodeImpl.java:transition(734)) - Deactivating Node host1:45454 as it is now DECOMMISSIONED 2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) - host1:45454 Node Transitioned from RUNNING to DECOMMISSIONED 2017-11-30 09:11:31,352 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(384)) - container_e12_1487083747959_0214_01_000001 Container Transitioned from RUNNING to KILLED 2017-11-30 09:11:31,352 ERROR ahs.RMApplicationHistoryWriter (RMApplicationHistoryWriter.java:handleWritingApplicationHistoryEvent(214)) - Error when storing the finish data of container container_e12_1487083747959_0214_01_000001 2017-11-30 09:11:31,352 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(123)) - Completed container: container_e12_1487083747959_0214_01_000001 in state: KILLED event:KILL 2017-11-30 09:11:31,352 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=user1 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1487083747959_0214 CONTAINERID=container_e12_1487083747959_0214_01_000001 2017-11-30 09:11:31,352 INFO scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(217)) - Released container container_e12_1487083747959_0214_01_000001 of capacity <memory:1024, vCores:1> on host host1:45454, which currently has 0 containers, <memory:0, vCores:0> used and <memory:120000, vCores:24> available, release resources=true 2017-11-30 09:11:31,352 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1169)) - Updating application attempt appattempt_1487083747959_0214_000001 with final state: FAILED, and exit status: -100 ... 2017-11-30 09:11:31,354 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1487083747959_0214 State change from RUNNING to ACCEPTED 2017-11-30 09:11:31,354 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(818)) - Application Attempt appattempt_1487083747959_0214_000001 is done. finalState=FAILED 2017-11-30 09:11:31,354 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(675)) - Registering app attempt : appattempt_1487083747959_0214_000002 {code} At this time, in the AM which has not been released yet actually, the ApplicationAttemptNotFoundException can happens due to different appattempt between AM and RM. As a result, AMRMClientAsync.onShutdownRequest callback will be called. AM log for container_e12_1487083747959_0214_01_000001 {code} 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Shutdown Request received 17/11/30 09:11:32 INFO impl.AMRMClientAsyncImpl: Shutdown requested. Stopping callback. 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: SliderAppMasterApi.stopCluster: Shutdown requested from RM 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Triggering shutdown of the AM: stop: exit code = 0, SUCCEEDED: Shutdown requested from RM; 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Process has exited with exit code 0 mapped to 0 -ignoring 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Setting stopInitiated flag to true 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Container release timeout in millis = 0 17/11/30 09:11:32 INFO state.AppState: Releasing 11 containers {code} Currently, the entire application is stopped in the onShutdownRequest callback. {code} public void onShutdownRequest() { LOG_YARN.info("Shutdown Request received"); ActionStopSlider stopSlider = new ActionStopSlider("stop", EXIT_SUCCESS, FinalApplicationStatus.SUCCEEDED, "Shutdown requested from RM"); stopSlider.setExitReason(SliderExitReason.YARN_ERROR); signalAMComplete(stopSlider); } {code} I think it needs to stop AM only instead of stopping entire application. -- This message was sent by Atlassian JIRA (v6.4.14#64029)