kyungwan nam created SLIDER-1253:
------------------------------------
Summary: All containers are killed when decommission a NM which
the AM is placed.
Key: SLIDER-1253
URL: https://issues.apache.org/jira/browse/SLIDER-1253
Project: Slider
Issue Type: Bug
Affects Versions: Slider 0.92
Reporter: kyungwan nam
Once a nodemanager is decommissioned, RM release containers running in the
nodemanager immediately. and new appattempt will be launched if the released
container is AM.
RM log.
{code}
2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl
(RMNodeImpl.java:transition(734)) - Deactivating Node host1:45454 as it is now
DECOMMISSIONED
2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) -
host1:45454 Node Transitioned from RUNNING to DECOMMISSIONED
2017-11-30 09:11:31,352 INFO rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(384)) - container_e12_1487083747959_0214_01_000001
Container Transitioned from RUNNING to KILLED
2017-11-30 09:11:31,352 ERROR ahs.RMApplicationHistoryWriter
(RMApplicationHistoryWriter.java:handleWritingApplicationHistoryEvent(214)) -
Error when storing the finish data of container
container_e12_1487083747959_0214_01_000001
2017-11-30 09:11:31,352 INFO fica.FiCaSchedulerApp
(FiCaSchedulerApp.java:containerCompleted(123)) - Completed container:
container_e12_1487083747959_0214_01_000001 in state: KILLED event:KILL
2017-11-30 09:11:31,352 INFO resourcemanager.RMAuditLogger
(RMAuditLogger.java:logSuccess(106)) - USER=user1 OPERATION=AM Released
Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1487083747959_0214
CONTAINERID=container_e12_1487083747959_0214_01_000001
2017-11-30 09:11:31,352 INFO scheduler.SchedulerNode
(SchedulerNode.java:releaseContainer(217)) - Released container
container_e12_1487083747959_0214_01_000001 of capacity <memory:1024, vCores:1>
on host host1:45454, which currently has 0 containers, <memory:0, vCores:0>
used and <memory:120000, vCores:24> available, release resources=true
2017-11-30 09:11:31,352 INFO attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1169)) - Updating
application attempt appattempt_1487083747959_0214_000001 with final state:
FAILED, and exit status: -100
...
2017-11-30 09:11:31,354 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(721)) -
application_1487083747959_0214 State change from RUNNING to ACCEPTED
2017-11-30 09:11:31,354 INFO capacity.CapacityScheduler
(CapacityScheduler.java:doneApplicationAttempt(818)) - Application Attempt
appattempt_1487083747959_0214_000001 is done. finalState=FAILED
2017-11-30 09:11:31,354 INFO resourcemanager.ApplicationMasterService
(ApplicationMasterService.java:registerAppAttempt(675)) - Registering app
attempt : appattempt_1487083747959_0214_000002
{code}
At this time, in the AM which has not been released yet actually, the
ApplicationAttemptNotFoundException can happens due to different appattempt
between AM and RM.
As a result, AMRMClientAsync.onShutdownRequest callback will be called.
AM log for container_e12_1487083747959_0214_01_000001
{code}
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Shutdown Request received
17/11/30 09:11:32 INFO impl.AMRMClientAsyncImpl: Shutdown requested. Stopping
callback.
17/11/30 09:11:32 INFO appmaster.SliderAppMaster:
SliderAppMasterApi.stopCluster: Shutdown requested from RM
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Triggering shutdown of the
AM: stop: exit code = 0, SUCCEEDED: Shutdown requested from RM;
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Process has exited with exit
code 0 mapped to 0 -ignoring
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Setting stopInitiated flag to
true
17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Container release timeout in
millis = 0
17/11/30 09:11:32 INFO state.AppState: Releasing 11 containers
{code}
Currently, the entire application is stopped in the onShutdownRequest callback.
{code}
public void onShutdownRequest() {
LOG_YARN.info("Shutdown Request received");
ActionStopSlider stopSlider = new ActionStopSlider("stop", EXIT_SUCCESS,
FinalApplicationStatus.SUCCEEDED, "Shutdown requested from RM");
stopSlider.setExitReason(SliderExitReason.YARN_ERROR);
signalAMComplete(stopSlider);
}
{code}
I think it needs to stop AM only instead of stopping entire application.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)