[ 
https://issues.apache.org/jira/browse/HADOOP-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625512#action_12625512
 ] 

Arun C Murthy commented on HADOOP-2676:
---------------------------------------

I'm a little leery about putting this intelligence in the TaskTracker - 
particularly since we do don't have a good story on how the blacklist will 
survive TaskTracker restarts. Also, we probably need to be careful about 
_which_ tasks failed since for e.g. some job's tasks might fail since they need 
excessive amounts of temporary disk-space, while every other job might be just 
fine... 

I propose we rather keep this centrally, with the JobTracker. We can continue 
the per-job blacklist as today; in addition the JobTracker can track the 
per-Job lists and add wonky TaskTrackers to the persistent global list if a 
TaskTracker has been blacklisted by more than two or three Jobs. This also has 
the advantage of maintaining state centrally which is easier to manage, persist 
etc. Also, we would need to add 'admin' commands to let a human operator list, 
add and remove TaskTrackers from the global blacklist (presumably after the 
faulty machine got corrected).

Thoughts?

> Maintaining cluster information across multiple job submissions
> ---------------------------------------------------------------
>
>                 Key: HADOOP-2676
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2676
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Lohit Vijayarenu
>            Assignee: dhruba borthakur
>
> Could we have a way to maintain cluster state across multiple job submissions.
> Consider a scenario where we run multiple jobs in iteration on a cluster back 
> to back. The nature of the job is same, but input/output might differ. 
> Now, if a node is blacklisted in one iteration of job run, it would be useful 
> to maintain this information and blacklist this node for next iteration of 
> job as well. 
> Another situation which we saw is, if there are failures less than 
> mapred.map.max.attempts in each iterations few nodes are never marked for 
> blacklisting. But in we consider two or three iterations, these nodes fail 
> all jobs and should be taken out of cluster. This hampers overall performance 
> of the job.
> Could have have config variables something which matches a job type (provided 
> by user) and maintains the cluster status for that job type alone? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to