GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/309

    [SPARK-1397] [WIP] Notify SparkListeners when stages fail or are cancelled.

    [I wanted to post this for folks to comment but it depends on (and thus 
includes the changes in) a currently outstanding PR, #305.  You can look at 
just the second commit: 
https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac
 to see just the changes relevant to this PR]
    
    Previously, when stages fail or get cancelled, the SparkListener is only 
notified
    indirectly through the SparkListenerJobEnd, where we sometimes pass in a 
single
    stage that failed.  This worked before job cancellation, because jobs would 
only fail
    due to a single stage failure.  However, with job cancellation, multiple 
running stages
    can fail when a job gets cancelled.  Right now, this is not handled 
correctly, which
    results in stages that get stuck in the “Running Stages” window in the 
UI even
    though they’re dead.
    
    This PR changes the SparkListenerStageCompleted event to a 
SparkListenerStageEnded
    event, and uses this event to tell SparkListeners when stages fail in 
addition to when
    they complete successfully.  This change is NOT publicly backward 
compatible for two
    reasons.  First, it changes the SparkListener interface.  We could 
alternately add a new event,
    SparkListenerStageFailed, and keep the existing 
SparkListenerStageCompleted.  However,
    this is less consistent with the listener events for tasks / jobs ending, 
and will result in some
    code duplication for listeners (because failed and completed stages are 
handled in similar
    ways).  Note that I haven’t finished updating the JSON code to correctly 
handle the new event
    because I’m waiting for feedback on whether this is a good or bad idea 
(hence the “WIP”).
    
    It is also not backwards compatible because it changes the publicly visible 
JobWaiter.jobFailed()
    method to no longer include a stage that caused the failure.  I think this 
change should definitely
    stay, because with cancellation (as described above), a failure isn’t 
necessarily caused by a
    single stage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 stage_cancellation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #309
    
----
commit 6aae3b119dd259330f9530a27af84c9575967132
Author: Kay Ousterhout <kayousterh...@gmail.com>
Date:   2014-04-02T18:14:53Z

    Properly cleanup DAGScheduler on job cancellation.
    
    Previously, when jobs were cancelled, not all of the state in the
    DAGScheduler was cleaned up, leading to a slow memory leak in the
    DAGScheduler.  As we expose easier ways ot cancel jobs, it's more
    important to fix these issues.
    
    This commit adds 3 tests.  “run shuffle with map stage failure” is
    a new test to more thoroughly test this functionality, and passes on
    both the old and new versions of the code.  “trivial job
    cancellation” fails on the old code because all state wasn’t cleaned
    up correctly when jobs were cancelled (we didn’t remove the job from
    resultStageToJob).  “failure of stage used by two jobs” fails on the
    old code because taskScheduler.cancelTasks wasn’t called for one of
    the stages (see test comments).

commit 93f08baf731b9eaf5c9792a5373560526e2bccac
Author: Kay Ousterhout <kayousterh...@gmail.com>
Date:   2014-04-02T22:54:29Z

    Notify SparkListeners when stages fail or are cancelled.
    
    Previously, when stages fail or get cancelled, the SparkListener is only 
notified
    indirectly through the SparkListenerJobEnd, where we sometimes pass in a 
single
    stage that failed.  This worked before job cancellation, because jobs would 
only fail
    due to a single stage failure.  However, with job cancellation, multiple 
running stages
    can fail when a job gets cancelled.  Right now, this is not handled 
correctly, which
    results in stages that get stuck in the “Running Stages” window in the 
UI even
    though they’re dead.
    
    This PR changes the SparkListenerStageCompleted event to a 
SparkListenerStageEnded
    event, and uses this event to tell SparkListeners when stages fail in 
addition to when
    they complete successfully.  This change is NOT publicly backward 
compatible for two
    reasons.  First, it changes the SparkListener interface.  We could 
alternately add a new event,
    SparkListenerStageFailed, and keep the existing 
SparkListenerStageCompleted.  However,
    this is less consistent with the listener events for tasks / jobs ending, 
and will result in some
    code duplication for listeners (because failed and completed stages are 
handled in similar
    ways).  Note that I haven’t finished updating the JSON code to correctly 
handle the new event
    because I’m waiting for feedback on whether this is a good or bad idea 
(hence the “WIP”).
    
    It is also not backwards compatible because it changes the publicly visible 
JobWaiter.jobFailed()
    method to no longer include a stage that caused the failure.  I think this 
change should definitely
    stay, because with cancellation (as described above), a failure isn’t 
necessarily caused by a
    single stage.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to