Josh Rosen created SPARK-3289: --------------------------------- Summary: Prevent complete job failures due to rescheduling of failing tasks on buggy machines Key: SPARK-3289 URL: https://issues.apache.org/jira/browse/SPARK-3289 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen
Some users have reported issues where a task fails due to an environment / configuration issue on some machine, then the task is reattempted _on that same buggy machine_ until the entire job failures because that single task has failed too many times. To guard against this, maybe we should add some randomization in how we reschedule failed tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org