[ https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin closed SPARK-17064. ------------------------------- Resolution: Won't Fix Marking this as won't fix for now. We can still continue to discuss it here. > Reconsider spark.job.interruptOnCancel > -------------------------------------- > > Key: SPARK-17064 > URL: https://issues.apache.org/jira/browse/SPARK-17064 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core > Reporter: Mark Hamstra > > There is a frequent need or desire in Spark to cancel already running Tasks. > This has been recognized for a very long time (see, e.g., the ancient TODO > comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've > never had more than an incomplete solution. Killing running Tasks at the > Executor level has been implemented by interrupting the threads running the > Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since > https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 > addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting > Task threads in this way has only been possible if interruptThread is true, > and that typically comes from the setting of the interruptOnCancel property > in the JobGroup, which in turn typically comes from the setting of > spark.job.interruptOnCancel. Because of concerns over > https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes > being marked dead when a Task thread is interrupted, the default value of the > boolean has been "false" -- i.e. by default we do not interrupt Tasks already > running on Executor even when the Task has been canceled in the DAGScheduler, > or the Stage has been abort, or the Job has been killed, etc. > There are several issues resulting from this current state of affairs, and > they each probably need to spawn their own JIRA issue and PR once we decide > on an overall strategy here. Among those issues: > * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS > versions that Spark now supports so that we can set the default value of > spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely? > * Even if interrupting Task threads is no longer an issue for HDFS, is it > still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still > need protection similar to what the current default value of > spark.job.interruptOnCancel provides? > * If interrupting Task threads isn't safe enough, what should we do instead? > * Once we have a safe mechanism to stop and clean up after already executing > Tasks, there is still the question of whether we _should_ end executing > Tasks. While that is likely a good thing to do in cases where individual > Tasks are lightweight in terms of resource usage, at least in some cases not > all running Tasks should be ended: https://github.com/apache/spark/pull/12436 > That means that we probably need to continue to make allowing Task > interruption configurable at the Job or JobGroup level (and we need better > documentation explaining how and when to allow interruption or not.) > * There is one place in the current code > (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to > "true". This should be fixed, and similar misuses of killTask be denied in > pull requests until this issue is adequately resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org