speculative task failure can kill jobs
--------------------------------------
Key: HADOOP-979
URL: https://issues.apache.org/jira/browse/HADOOP-979
Project: Hadoop
Issue Type: Bug
Components: mapred
Affects Versions: 0.11.0
Reporter: Owen O'Malley
Fix For: 0.12.0
We had a case where the random writer example was killed by speculative
execution. It happened like:
task_0001_m_000123_0 -> starts
task_0001_m_000123_1 -> starts and fails because attempt 0 is creating the file
task_0001_m_000123_2 -> starts and fails because attempt 0 is creating the file
task_0001_m_000123_3 -> starts and fails because attempt 0 is creating the file
task_0001_m_000123_4 -> starts and fails because attempt 0 is creating the file
job_0001 is killed because map_000123 failed 4 times. From this experience, I
think we should change the scheduling so that:
1. Tasks are only allowed 1 speculative attempt.
2. TIPs don't kill jobs until they have 4 failures AND the last task under
that tip fails.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.