Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-55364614
  
    @andrewor14 I think you're right that there's a deeper problem here.  I 
haven't tested this but here's what I think is going on:
    
    (1) In TaskSchedulerImpl.cancelTasks(), the killTask call throws an 
unsupported operation exception, as is logged 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L194).
  As a result, tsm.abort() never gets called.  So, the TaskSetManager still 
thinks everything is hunky dory.
    (2) Slowly the rest of the tasks fail, triggering the handleFailedTask() 
code in TaskSetManager.  The TSM doesn't realize the task set is effectively 
dead because abort() was never called.
    (3) Now, what I would expect to happen is that the code 
here:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L605
 would trigger the task to be re-launched.  Eventually, a task would fail 4 
times and the stage would get killed.  This isn't exactly the right behavior, 
but still wouldn't lead to a hang.  It might be good to understand why that 
isn't happening.
    
    Regardless of what's going on with (3), I think the right way to fix this 
is to move the tsm.abort() call here: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L196
 up to before we try to kill the task.  That way, regardless of whether 
killTask() is successful, we'll mark the task set as aborted and send all the 
appropriate events.
    
    Also, whoever fixes this should definitely add a unit test!! It would be 
great to add a short unit test to show the problem first, so it's easier for 
others to reproduce, and then deal with the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to