[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

Mridul Muralidharan (JIRA) Tue, 04 Oct 2016 16:44:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547073#comment-15547073
 ]


Mridul Muralidharan commented on SPARK-17064:
---------------------------------------------


I agree, interrupt'ing a thread can (and usually does) have unintended side 
effects.
On plus side, our current model does mean that any actual processing of tuple 
will cause the task to get killed ...

Perhaps we can check this state in shuffle fetch as well to further avoid 
expensive prefix operations (data fetch, etc) ?
This will ofcourse not catch the case of user code doing something extremely 
long running and expensive while processing a single tuple - but then there is 
no gaurantee thread interruption will work in that case anyway.

> Reconsider spark.job.interruptOnCancel
> --------------------------------------
>
>                 Key: SPARK-17064
>                 URL: https://issues.apache.org/jira/browse/SPARK-17064
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we can set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

Reply via email to