Josh Rosen created SPARK-18761:
----------------------------------

             Summary: Uncancellable / unkillable tasks may starve jobs of 
resoures
                 Key: SPARK-18761
                 URL: https://issues.apache.org/jira/browse/SPARK-18761
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Josh Rosen
            Assignee: Josh Rosen


Spark's current task cancellation / task killing mechanism is "best effort" in 
the sense that some tasks may not be interruptible and may not respond to their 
"killed" flags being set. If a significant fraction of a cluster's task slots 
are occupied by tasks that have been marked as killed but remain running then 
this can lead to a situation where new jobs and tasks are starved of resources 
because zombie tasks are holding resources.

I propose to address this problem by introducing a "task reaper" mechanism in 
executors to monitor tasks after they are marked for killing in order to 
periodically re-attempt the task kill, capture and log stacktraces / warnings 
if tasks do not exit in a timely manner, and, optionally, kill the entire 
executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to