GitHub user CodingCat opened a pull request:
https://github.com/apache/spark/pull/186
SPARK-1235: fail all jobs when DAGScheduler crashes for some reason
https://spark-project.atlassian.net/browse/SPARK-1235
In the current implementation, the running job will hang if the
DAGScheduler crashes for some reason (eventProcessActor throws exception in
receive() )
The reason is that the actor will automatically restart when the exception
is thrown during the running but is not captured properly (Akka behaviour), and
the JobWaiters are still waiting there for the completion of the tasks
In this patch, I override the preRestart hook of the actor, in which I fail
all running jobs if the dagScheduler crashes and restart
thanks for @kayousterhout and @markhamstra to give the hints in JIRA
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/CodingCat/spark SPARK-1235
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/186.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #186
----
commit b417b763b3dec602b1262ec4f28460181d32e5ff
Author: CodingCat <[email protected]>
Date: 2014-03-20T04:59:52Z
fail all jobs when DAGScheduler crashes for some reason
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---