Josh Rosen created SPARK-3289:
---------------------------------

             Summary: Prevent complete job failures due to rescheduling of 
failing tasks on buggy machines
                 Key: SPARK-3289
                 URL: https://issues.apache.org/jira/browse/SPARK-3289
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Josh Rosen


Some users have reported issues where a task fails due to an environment / 
configuration issue on some machine, then the task is reattempted _on that same 
buggy machine_ until the entire job failures because that single task has 
failed too many times.

To guard against this, maybe we should add some randomization in how we 
reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to