[ 
https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4069:
-----------------------------------

    Assignee: Apache Spark

> [SPARK-YARN] ApplicationMaster should release all executors' containers 
> before unregistering itself from Yarn RM
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4069
>                 URL: https://issues.apache.org/jira/browse/SPARK-4069
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.1.0
>            Reporter: Min Zhou
>            Assignee: Apache Spark
>
> Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
> master , a.k.a resourcemanager.  Itnever release executors' containers before 
> that.  Yarn's master will make a decision to kill all the executors' 
> containers if it face such scenario.  so the log of resourcemanager is like 
> below 
> {noformat}
> 2014-10-22 23:39:09,903 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type UNREGISTERED
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1414003182949_0004 with final state: FINISHING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type 
> ATTEMPT_UPDATE_SAVED
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
> info for app: application_1414003182949_0004
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from FINAL_SAVING to 
> FINISHING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
> 2014-10-22 23:39:10,485 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type 
> CONTAINER_FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1414003182949_0004_01_000001 Container Transitioned from RUNNING to 
> COMPLETED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Unregistering app attempt : appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1414003182949_0004_01_000001 in state: 
> COMPLETED event:FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of container container_1414003182949_0004_01_000001 is 
> written
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1414003182949_0004    
> CONTAINERID=container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Stored the finish data of container container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Released container container_1414003182949_0004_01_000001 of capacity 
> <memory:3072, vCores:1> on host host1, which currently has 0 containers, 
> <memory:0, vCores:0> used and <memory:241901, vCores:32> available, release 
> resources=true
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of application attempt 
> appattempt_1414003182949_0004_000001 is written
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
> OPERATION=Application Finished - Succeeded      TARGET=RMAppManager     
> RESULT=SUCCESS  APPID=application_1414003182949_0004
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1414003182949_0004_000001 released container 
> container_1414003182949_0004_01_000001 on node: host: host2:8041 
> #containers=0 available=<memory:241901, vCores:32> used=<memory:0, vCores:0> 
> with event: FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Stored the finish data of application attempt 
> appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application appattempt_1414003182949_0004_000001 is done. finalState=FINISHED
> 2014-10-22 23:39:10,486 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of application application_1414003182949_0004 is written
> 2014-10-22 23:39:10,486 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1414003182949_0004_01_000019 Container Transitioned from RUNNING to 
> KILLED
> {noformat}
> Although it won't affect the job's final succeed status, but the log will 
> confuse users. 
> If we run a  spark job on yarn 2.4.1 with timeline server enabled,  we will 
> get errors on the resourcemanager's log
> {noformat}
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000019
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000017
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000009
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000010
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000012
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000003
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000005
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000004
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000015
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000018
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000013
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000008
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000014
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000007
> 2014-10-22 23:39:10,638 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000002
> {noformat}
> This is because the application is finished before containers being 
> terminated.  Once the executors' containers being killed,  resourcemanager 
> will try to log something for containers' finsih event, but can't find a 
> writer due to the application  finished before that.  
> {noformat}
> java.io.IOException: History file of application 
> application_1414003182949_0003 is not opened
>     
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileWriter(FileSystemApplicationHistoryStore.java:643)
>     
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.containerFinished(FileSystemApplicationHistoryStore.java:532)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:203)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>     java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to