Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1940#issuecomment-55364614 @andrewor14 I think you're right that there's a deeper problem here. I haven't tested this but here's what I think is going on: (1) In TaskSchedulerImpl.cancelTasks(), the killTask call throws an unsupported operation exception, as is logged (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L194). As a result, tsm.abort() never gets called. So, the TaskSetManager still thinks everything is hunky dory. (2) Slowly the rest of the tasks fail, triggering the handleFailedTask() code in TaskSetManager. The TSM doesn't realize the task set is effectively dead because abort() was never called. (3) Now, what I would expect to happen is that the code here:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L605 would trigger the task to be re-launched. Eventually, a task would fail 4 times and the stage would get killed. This isn't exactly the right behavior, but still wouldn't lead to a hang. It might be good to understand why that isn't happening. Regardless of what's going on with (3), I think the right way to fix this is to move the tsm.abort() call here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L196 up to before we try to kill the task. That way, regardless of whether killTask() is successful, we'll mark the task set as aborted and send all the appropriate events. Also, whoever fixes this should definitely add a unit test!! It would be great to add a short unit test to show the problem first, so it's easier for others to reproduce, and then deal with the fix.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org