[ https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Min Zhou updated SPARK-4069: ---------------------------- Summary: [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM (was: [SPARK-YARN] ApplicationMaster should releases all executors' containers before unregistering itself from Yarn RM) > [SPARK-YARN] ApplicationMaster should release all executors' containers > before unregistering itself from Yarn RM > ---------------------------------------------------------------------------------------------------------------- > > Key: SPARK-4069 > URL: https://issues.apache.org/jira/browse/SPARK-4069 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.1.0 > Reporter: Min Zhou > > Curently, ApplciationMaster in yarn mode simply unregister itself from yarn > master , a.k.a resourcemanager. Itnever release executors' containers before > that. Yarn's master will make a decision to kill all the executors' > containers if it face such scenario. so the log of resourcemanager is like > below > {noformat} > 2014-10-22 23:39:09,903 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1414003182949_0004_000001 of type UNREGISTERED > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1414003182949_0004_000001 State change from RUNNING to FINAL_SAVING > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1414003182949_0004 with final state: FINISHING > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1414003182949_0004 State change from RUNNING to FINAL_SAVING > 2014-10-22 23:39:09,903 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1414003182949_0004_000001 of type > ATTEMPT_UPDATE_SAVED > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing > info for app: application_1414003182949_0004 > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1414003182949_0004_000001 State change from FINAL_SAVING to > FINISHING > 2014-10-22 23:39:09,903 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1414003182949_0004 State change from FINAL_SAVING to FINISHING > 2014-10-22 23:39:10,485 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1414003182949_0004_000001 of type > CONTAINER_FINISHED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1414003182949_0004_01_000001 Container Transitioned from RUNNING to > COMPLETED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > Unregistering app attempt : appattempt_1414003182949_0004_000001 > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1414003182949_0004_01_000001 in state: > COMPLETED event:FINISHED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1414003182949_0004_01_000001 is > written > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1414003182949_0004_000001 State change from FINISHING to FINISHED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1414003182949_0004 > CONTAINERID=container_1414003182949_0004_01_000001 > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1414003182949_0004_01_000001 > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Released container container_1414003182949_0004_01_000001 of capacity > <memory:3072, vCores:1> on host host1, which currently has 0 containers, > <memory:0, vCores:0> used and <memory:241901, vCores:32> available, release > resources=true > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1414003182949_0004 State change from FINISHING to FINISHED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of application attempt > appattempt_1414003182949_0004_000001 is written > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim > OPERATION=Application Finished - Succeeded TARGET=RMAppManager > RESULT=SUCCESS APPID=application_1414003182949_0004 > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1414003182949_0004_000001 released container > container_1414003182949_0004_01_000001 on node: host: host2:8041 > #containers=0 available=<memory:241901, vCores:32> used=<memory:0, vCores:0> > with event: FINISHED > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of application attempt > appattempt_1414003182949_0004_000001 > 2014-10-22 23:39:10,485 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application appattempt_1414003182949_0004_000001 is done. finalState=FINISHED > 2014-10-22 23:39:10,486 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of application application_1414003182949_0004 is written > 2014-10-22 23:39:10,486 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1414003182949_0004_01_000019 Container Transitioned from RUNNING to > KILLED > {noformat} > Although it won't affect the job's final succeed status, but the log will > confuse users. > If we run a spark job on yarn 2.4.1 with timeline server enabled, we will > get errors on the resourcemanager's log > {noformat} > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000019 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000017 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000009 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000010 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000012 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000003 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000005 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000004 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000015 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000018 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000013 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000008 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000014 > 2014-10-22 23:39:10,637 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000007 > 2014-10-22 23:39:10,638 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Error when storing the finish data of container > container_1414003182949_0004_01_000002 > {noformat} > This is because the application is finished before containers being > terminated. Once the executors' containers being killed, resourcemanager > will try to log something for containers' finsih event, but can't find a > writer due to the application finished before that. > {noformat} > java.io.IOException: History file of application > application_1414003182949_0003 is not opened > > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileWriter(FileSystemApplicationHistoryStore.java:643) > > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.containerFinished(FileSystemApplicationHistoryStore.java:532) > > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:203) > > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297) > > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292) > > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org