GitHub user kayousterhout opened a pull request: https://github.com/apache/spark/pull/309
[SPARK-1397] [WIP] Notify SparkListeners when stages fail or are cancelled. [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305. You can look at just the second commit: https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac to see just the changes relevant to this PR] Previously, when stages fail or get cancelled, the SparkListener is only notified indirectly through the SparkListenerJobEnd, where we sometimes pass in a single stage that failed. This worked before job cancellation, because jobs would only fail due to a single stage failure. However, with job cancellation, multiple running stages can fail when a job gets cancelled. Right now, this is not handled correctly, which results in stages that get stuck in the âRunning Stagesâ window in the UI even though theyâre dead. This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded event, and uses this event to tell SparkListeners when stages fail in addition to when they complete successfully. This change is NOT publicly backward compatible for two reasons. First, it changes the SparkListener interface. We could alternately add a new event, SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However, this is less consistent with the listener events for tasks / jobs ending, and will result in some code duplication for listeners (because failed and completed stages are handled in similar ways). Note that I havenât finished updating the JSON code to correctly handle the new event because Iâm waiting for feedback on whether this is a good or bad idea (hence the âWIPâ). It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed() method to no longer include a stage that caused the failure. I think this change should definitely stay, because with cancellation (as described above), a failure isnât necessarily caused by a single stage. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kayousterhout/spark-1 stage_cancellation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #309 ---- commit 6aae3b119dd259330f9530a27af84c9575967132 Author: Kay Ousterhout <kayousterh...@gmail.com> Date: 2014-04-02T18:14:53Z Properly cleanup DAGScheduler on job cancellation. Previously, when jobs were cancelled, not all of the state in the DAGScheduler was cleaned up, leading to a slow memory leak in the DAGScheduler. As we expose easier ways ot cancel jobs, it's more important to fix these issues. This commit adds 3 tests. ârun shuffle with map stage failureâ is a new test to more thoroughly test this functionality, and passes on both the old and new versions of the code. âtrivial job cancellationâ fails on the old code because all state wasnât cleaned up correctly when jobs were cancelled (we didnât remove the job from resultStageToJob). âfailure of stage used by two jobsâ fails on the old code because taskScheduler.cancelTasks wasnât called for one of the stages (see test comments). commit 93f08baf731b9eaf5c9792a5373560526e2bccac Author: Kay Ousterhout <kayousterh...@gmail.com> Date: 2014-04-02T22:54:29Z Notify SparkListeners when stages fail or are cancelled. Previously, when stages fail or get cancelled, the SparkListener is only notified indirectly through the SparkListenerJobEnd, where we sometimes pass in a single stage that failed. This worked before job cancellation, because jobs would only fail due to a single stage failure. However, with job cancellation, multiple running stages can fail when a job gets cancelled. Right now, this is not handled correctly, which results in stages that get stuck in the âRunning Stagesâ window in the UI even though theyâre dead. This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded event, and uses this event to tell SparkListeners when stages fail in addition to when they complete successfully. This change is NOT publicly backward compatible for two reasons. First, it changes the SparkListener interface. We could alternately add a new event, SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However, this is less consistent with the listener events for tasks / jobs ending, and will result in some code duplication for listeners (because failed and completed stages are handled in similar ways). Note that I havenât finished updating the JSON code to correctly handle the new event because Iâm waiting for feedback on whether this is a good or bad idea (hence the âWIPâ). It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed() method to no longer include a stage that caused the failure. I think this change should definitely stay, because with cancellation (as described above), a failure isnât necessarily caused by a single stage. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---