[
https://issues.apache.org/jira/browse/HADOOP-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470393
]
Arun C Murthy commented on HADOOP-979:
--------------------------------------
Another point to ponder is that we might be getting too aggressive launching
speculative attempts; a way around might have a larger backoff for each
successive attempt; ensuring we don't launch too many of them too quickly.
To illustrate:
task_1_0 (of tip_1) is launched
task_1_1 is launched when tip_1's progress falls behind other tips by x%
task_1_2 is launched when tip_1's progress falls behind other tips by (x + x/8)%
task_1_3 is launched when tip_1's progress falls behind other tips by (x + x/4)%
task_1_4 is launched when tip_1's progress falls behind other tips by (x + x/2)%
Thoughts?
Of course this could clearly be a future enhancement to Owen's proposal, and it
makes sense to get speculative-execution working reliably for now.
> speculative task failure can kill jobs
> --------------------------------------
>
> Key: HADOOP-979
> URL: https://issues.apache.org/jira/browse/HADOOP-979
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.11.0
> Reporter: Owen O'Malley
> Fix For: 0.12.0
>
>
> We had a case where the random writer example was killed by speculative
> execution. It happened like:
> task_0001_m_000123_0 -> starts
> task_0001_m_000123_1 -> starts and fails because attempt 0 is creating the
> file
> task_0001_m_000123_2 -> starts and fails because attempt 0 is creating the
> file
> task_0001_m_000123_3 -> starts and fails because attempt 0 is creating the
> file
> task_0001_m_000123_4 -> starts and fails because attempt 0 is creating the
> file
> job_0001 is killed because map_000123 failed 4 times. From this experience, I
> think we should change the scheduling so that:
> 1. Tasks are only allowed 1 speculative attempt.
> 2. TIPs don't kill jobs until they have 4 failures AND the last task under
> that tip fails.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.