[ http://issues.apache.org/jira/browse/HADOOP-142?page=all ]
Doug Cutting resolved HADOOP-142:
---------------------------------
Resolution: Fixed
I just committed this. Thanks, Owen!
> failed tasks should be rescheduled on different hosts after other jobs
> ----------------------------------------------------------------------
>
> Key: HADOOP-142
> URL: http://issues.apache.org/jira/browse/HADOOP-142
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Versions: 0.1.1
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.2
> Attachments: no-repeat-failures.patch
>
> Currently when tasks fail, they are usually rerun immediately on the same
> host. This causes problems in a couple of ways.
> 1.The task is more likely to fail on the same host.
> 2.If there is cleanup code (such as clearing pendingCreates) it does not
> always run immediately, leading to cascading failures.
> For a first pass, I propose that when a task fails, we start the scan for new
> tasks to launch at the following task of the same type (within that job). So
> if maps[99] fails, when we are looking to assign new map tasks from this job,
> we scan like maps[100]...maps[N], maps[0]..,maps[99].
> A more involved change would avoid running tasks on nodes where it has failed
> before. This is a little tricky, because you don't want to prevent
> re-excution of tasks on 1 node clusters and the job tracker needs to schedule
> one task tracker at a time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira