[
https://issues.apache.org/jira/browse/HADOOP-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492243
]
Hadoop QA commented on HADOOP-1278:
-----------------------------------
+1
http://issues.apache.org/jira/secure/attachment/12356397/HADOOP-1278_20070427_1.patch
applied and successfully tested against trunk revision r533013.
Test results:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/89/testReport/
Console output:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/89/console
> Fix the per-job tasktracker 'blacklist'
> ---------------------------------------
>
> Key: HADOOP-1278
> URL: https://issues.apache.org/jira/browse/HADOOP-1278
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.12.3
> Reporter: Arun C Murthy
> Assigned To: Arun C Murthy
> Fix For: 0.13.0
>
> Attachments: HADOOP-1278_20070427_1.patch
>
>
> Today whenever a tracker is 'lost' all the jobs which ever ran on it are
> considered as failures and added to the blacklist, which automatically
> ensures that the particular TT is *never* considered for allocating new tasks
> unless *all* tasktrackers are on the list. This results in an ugly situation
> where a majority of nodes in the cluster are on the blacklist and hence idle,
> while the other TTs are maxed out.
> The proposal is two-fold:
> a) Don't count *all* tasks which ever ran on the TT, we can count it as a
> 'single' task failure - which means that each 'lost' tracker results in a
> loss of 20% of the '5 failures == blacklisted' quota.
> b) Stop adding nodes to the blacklist when a certain percentage of the
> cluster, say 25%, are already on the blacklist - adding more than that would
> just delay the inevitable i.e. there is something horrendously wrong with the
> cluster - we might as well fail the job early and noisily.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.