[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13642:
--------------------------------
    Summary: Different finishing state between Spark and YARN  (was: 
Inconsistent finishing state between driver and AM )

> Different finishing state between Spark and YARN
> ------------------------------------------------
>
>                 Key: SPARK-13642
>                 URL: https://issues.apache.org/jira/browse/SPARK-13642
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.0
>            Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(30000L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/jobs/job/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/jobs/job,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/jobs/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/jobs,null}
> 16/03/03 12:44:39 INFO SparkUI: Stopped Spark web UI at 
> http://192.168.0.105:57452
> 16/03/03 12:44:39 INFO YarnAllocator: Driver requested a total number of 0 
> executor(s).
> 16/03/03 12:44:39 INFO YarnAllocator: Canceling requests for 0 executor 
> containers
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Shutting down all 
> executors
> 16/03/03 12:44:39 WARN YarnAllocator: Expected to find pending requests, but 
> found none.
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Asking each executor to 
> shut down
> 16/03/03 12:44:39 INFO SchedulerExtensionServices: Stopping 
> SchedulerExtensionServices
> (serviceOption=None,
>  services=List(),
>  started=false)
> 16/03/03 12:44:39 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 16/03/03 12:44:39 INFO MemoryStore: MemoryStore cleared
> 16/03/03 12:44:39 INFO BlockManager: BlockManager stopped
> 16/03/03 12:44:39 INFO BlockManagerMaster: BlockManagerMaster stopped
> 16/03/03 12:44:39 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 16/03/03 12:44:39 INFO SparkContext: Successfully stopped SparkContext
> 16/03/03 12:44:39 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 16/03/03 12:44:39 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED (diag message: Shutdown hook called before final status was 
> reported.)
> 16/03/03 12:44:39 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 16/03/03 12:44:39 INFO ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1456975940304_0004
> 16/03/03 12:44:40 INFO ShutdownHookManager: Shutdown hook called
> {noformat}
> So basically, I think only after the finish of user class, we could mark this 
> application as "SUCCESS", otherwise, especially in the signal stopped 
> scenario, it would be better to mark as failed and try again (except 
> explicitly KILL command by yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to