[
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-10568.
----------------------------------
Resolution: Incomplete
> Error thrown in stopping one component in SparkContext.stop() doesn't allow
> other components to be stopped
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-10568
> URL: https://issues.apache.org/jira/browse/SPARK-10568
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.1
> Reporter: Matt Cheah
> Priority: Minor
> Labels: bulk-closed
>
> When I shut down a Java process that is running a SparkContext, it invokes a
> shutdown hook that eventually calls SparkContext.stop(), and inside
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler
> Backend) is stopped. If an exception is thrown in stopping one of these
> components, none of the other components will be stopped cleanly either. This
> caused problems when I stopped a Java process running a Spark context in
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception
> in thread Thread-3
> java.lang.NullPointerException: null
> at
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308)
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued
> state, it tries to run the SparkContext.stop() method on the driver and stop
> each component. It dies trying to stop the DiskBlockManager because it hasn't
> been initialized yet - the application is still waiting to be scheduled by
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state
> makes it so that the YARN scheduler is unable to schedule any more jobs
> unless we manually remove this application via the YARN CLI. We can tackle
> the YARN stuck state separately, but ensuring that all components get at
> least some chance to stop when a SparkContext stops seems like a good idea.
> Of course we can still throw some exception and/or log exceptions for
> everything that goes wrong at the end of stopping the context.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]