kyungwan nam created SLIDER-1253:
------------------------------------

             Summary: All containers are killed when decommission a NM which 
the AM is placed.
                 Key: SLIDER-1253
                 URL: https://issues.apache.org/jira/browse/SLIDER-1253
             Project: Slider
          Issue Type: Bug
    Affects Versions: Slider 0.92
            Reporter: kyungwan nam


Once a nodemanager is decommissioned, RM release containers running in the 
nodemanager immediately. and new appattempt will be launched if the released 
container is AM.

RM log.
{code}
2017-11-30 09:11:31,351 INFO  rmnode.RMNodeImpl 
(RMNodeImpl.java:transition(734)) - Deactivating Node host1:45454 as it is now 
DECOMMISSIONED
2017-11-30 09:11:31,351 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) - 
host1:45454 Node Transitioned from RUNNING to DECOMMISSIONED
2017-11-30 09:11:31,352 INFO  rmcontainer.RMContainerImpl 
(RMContainerImpl.java:handle(384)) - container_e12_1487083747959_0214_01_000001 
Container Transitioned from RUNNING to KILLED
2017-11-30 09:11:31,352 ERROR ahs.RMApplicationHistoryWriter 
(RMApplicationHistoryWriter.java:handleWritingApplicationHistoryEvent(214)) - 
Error when storing the finish data of container 
container_e12_1487083747959_0214_01_000001
2017-11-30 09:11:31,352 INFO  fica.FiCaSchedulerApp 
(FiCaSchedulerApp.java:containerCompleted(123)) - Completed container: 
container_e12_1487083747959_0214_01_000001 in state: KILLED event:KILL
2017-11-30 09:11:31,352 INFO  resourcemanager.RMAuditLogger 
(RMAuditLogger.java:logSuccess(106)) - USER=user1    OPERATION=AM Released 
Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1487083747959_0214    
CONTAINERID=container_e12_1487083747959_0214_01_000001
2017-11-30 09:11:31,352 INFO  scheduler.SchedulerNode 
(SchedulerNode.java:releaseContainer(217)) - Released container 
container_e12_1487083747959_0214_01_000001 of capacity <memory:1024, vCores:1> 
on host host1:45454, which currently has 0 containers, <memory:0, vCores:0> 
used and <memory:120000, vCores:24> available, release resources=true
2017-11-30 09:11:31,352 INFO  attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1169)) - Updating 
application attempt appattempt_1487083747959_0214_000001 with final state: 
FAILED, and exit status: -100
...
2017-11-30 09:11:31,354 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - 
application_1487083747959_0214 State change from RUNNING to ACCEPTED
2017-11-30 09:11:31,354 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:doneApplicationAttempt(818)) - Application Attempt 
appattempt_1487083747959_0214_000001 is done. finalState=FAILED
2017-11-30 09:11:31,354 INFO  resourcemanager.ApplicationMasterService 
(ApplicationMasterService.java:registerAppAttempt(675)) - Registering app 
attempt : appattempt_1487083747959_0214_000002
{code}


At this time, in the AM which has not been released yet actually, the 
ApplicationAttemptNotFoundException can happens due to different appattempt 
between AM and RM.
As a result, AMRMClientAsync.onShutdownRequest callback will be called.

AM log for container_e12_1487083747959_0214_01_000001
{code}
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Shutdown Request received
17/11/30 09:11:32 INFO impl.AMRMClientAsyncImpl: Shutdown requested. Stopping 
callback.
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: 
SliderAppMasterApi.stopCluster: Shutdown requested from RM
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Triggering shutdown of the 
AM: stop:  exit code = 0, SUCCEEDED: Shutdown requested from RM;
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Process has exited with exit 
code 0 mapped to 0 -ignoring
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Setting stopInitiated flag to 
true
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Container release timeout in 
millis = 0
17/11/30 09:11:32 INFO state.AppState: Releasing 11 containers
{code}

Currently, the entire application is stopped in the onShutdownRequest callback.
{code}
   public void onShutdownRequest() {
     LOG_YARN.info("Shutdown Request received");
     ActionStopSlider stopSlider = new ActionStopSlider("stop", EXIT_SUCCESS,
         FinalApplicationStatus.SUCCEEDED, "Shutdown requested from RM");
     stopSlider.setExitReason(SliderExitReason.YARN_ERROR);
     signalAMComplete(stopSlider);
   }
{code}

I think it needs to stop AM only instead of stopping entire application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to