[ https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901164#comment-14901164 ]
Matt Cheah commented on SPARK-10568: ------------------------------------ Yup! > Error thrown in stopping one component in SparkContext.stop() doesn't allow > other components to be stopped > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-10568 > URL: https://issues.apache.org/jira/browse/SPARK-10568 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.4.1 > Reporter: Matt Cheah > Priority: Minor > > When I shut down a Java process that is running a SparkContext, it invokes a > shutdown hook that eventually calls SparkContext.stop(), and inside > SparkContext.stop() each individual component (DiskBlockManager, Scheduler > Backend) is stopped. If an exception is thrown in stopping one of these > components, none of the other components will be stopped cleanly either. This > caused problems when I stopped a Java process running a Spark context in > yarn-client mode, because not properly stopping YarnSchedulerBackend leads to > problems. > The steps I ran are as follows: > 1. Create one job which fills the cluster > 2. Kick off another job which creates a Spark Context > 3. Kill the Java process with the Spark Context in #2 > 4. The job remains in the YARN UI as ACCEPTED > Looking in the logs we see the following: > {code} > 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception > in thread Thread-3 > java.lang.NullPointerException: null > at > org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > at > org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) > ~[spark-core_2.10-1.4.1.jar:1.4.1] > {code} > I think what's going on is that when we kill the application in the queued > state, it tries to run the SparkContext.stop() method on the driver and stop > each component. It dies trying to stop the DiskBlockManager because it hasn't > been initialized yet - the application is still waiting to be scheduled by > the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the > application sticking around in the accepted state. > Because of what appears to be bugs in the YARN scheduler, entering this state > makes it so that the YARN scheduler is unable to schedule any more jobs > unless we manually remove this application via the YARN CLI. We can tackle > the YARN stuck state separately, but ensuring that all components get at > least some chance to stop when a SparkContext stops seems like a good idea. > Of course we can still throw some exception and/or log exceptions for > everything that goes wrong at the end of stopping the context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org