[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192828#comment-15192828 ]
Apache Spark commented on SPARK-13642: -------------------------------------- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/11690 > Properly handle signal kill of ApplicationMaster > ------------------------------------------------ > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.0 > Reporter: Saisai Shao > Assignee: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(30000L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/jobs/job/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/jobs/job,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/jobs/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/jobs,null} > 16/03/03 12:44:39 INFO SparkUI: Stopped Spark web UI at > http://192.168.0.105:57452 > 16/03/03 12:44:39 INFO YarnAllocator: Driver requested a total number of 0 > executor(s). > 16/03/03 12:44:39 INFO YarnAllocator: Canceling requests for 0 executor > containers > 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Shutting down all > executors > 16/03/03 12:44:39 WARN YarnAllocator: Expected to find pending requests, but > found none. > 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Asking each executor to > shut down > 16/03/03 12:44:39 INFO SchedulerExtensionServices: Stopping > SchedulerExtensionServices > (serviceOption=None, > services=List(), > started=false) > 16/03/03 12:44:39 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 16/03/03 12:44:39 INFO MemoryStore: MemoryStore cleared > 16/03/03 12:44:39 INFO BlockManager: BlockManager stopped > 16/03/03 12:44:39 INFO BlockManagerMaster: BlockManagerMaster stopped > 16/03/03 12:44:39 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 16/03/03 12:44:39 INFO SparkContext: Successfully stopped SparkContext > 16/03/03 12:44:39 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 16/03/03 12:44:39 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED (diag message: Shutdown hook called before final status was > reported.) > 16/03/03 12:44:39 INFO AMRMClientImpl: Waiting for application to be > successfully unregistered. > 16/03/03 12:44:39 INFO ApplicationMaster: Deleting staging directory > .sparkStaging/application_1456975940304_0004 > 16/03/03 12:44:40 INFO ShutdownHookManager: Shutdown hook called > {noformat} > So basically, I think only after the finish of user class, we could mark this > application as "SUCCESS", otherwise, especially in the signal stopped > scenario, it would be better to mark as failed and try again (except > explicitly KILL command by yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org