[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

Matt Cheah (JIRA) Mon, 21 Sep 2015 11:43:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901164#comment-14901164
 ]


Matt Cheah commented on SPARK-10568:
------------------------------------

Yup!

> Error thrown in stopping one component in SparkContext.stop() doesn't allow 
> other components to be stopped
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10568
>                 URL: https://issues.apache.org/jira/browse/SPARK-10568
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1
>            Reporter: Matt Cheah
>            Priority: Minor
>
> When I shut down a Java process that is running a SparkContext, it invokes a 
> shutdown hook that eventually calls SparkContext.stop(), and inside 
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler 
> Backend) is stopped. If an exception is thrown in stopping one of these 
> components, none of the other components will be stopped cleanly either. This 
> caused problems when I stopped a Java process running a Spark context in 
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to 
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception 
> in thread Thread-3
> java.lang.NullPointerException: null
>         at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
>         at 
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
>         at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) 
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued 
> state, it tries to run the SparkContext.stop() method on the driver and stop 
> each component. It dies trying to stop the DiskBlockManager because it hasn't 
> been initialized yet - the application is still waiting to be scheduled by 
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the 
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state 
> makes it so that the YARN scheduler is unable to schedule any more jobs 
> unless we manually remove this application via the YARN CLI. We can tackle 
> the YARN stuck state separately, but ensuring that all components get at 
> least some chance to stop when a SparkContext stops seems like a good idea. 
> Of course we can still throw some exception and/or log exceptions for 
> everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

Reply via email to