[jira] Commented: (HADOOP-2676) Maintaining cluster information across multiple job submissions

dhruba borthakur (JIRA) Wed, 20 Aug 2008 15:45:06 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624173#action_12624173
 ]


dhruba borthakur commented on HADOOP-2676:
------------------------------------------

We have seen a similar instance of this one. There were a 5 bad nodes which 
filled up the userlog directory of tasks. This caused all tasks that were 
scheduled on this tasktracker to fail. This is ok by itself, but the problem 
was that since the first task failed immediately,  this tasktracker sucked in 
more and more tasks, and all of them failed. There were a few small jobs (each 
had about 4 tasks), all these tasks got sucked into the same set of bad 
tasktrackers and the entire job failed. This continued to occru again and again 
till the bad nodes were fixed.

One option would be to have some intelligence in the JT to blacklist 
tacktrackers across jobs. If a particular task-tracker consistently started 
failing tasks, maybe the JT should start allocating fewer and fewer tasks to 
these task trackers. Another option would be that once a fixed number of tasks 
from a threshold number of jobs failed on a tasktracker, it could be 
blacklisted for a fixed period of time. 

> Maintaining cluster information across multiple job submissions
> ---------------------------------------------------------------
>
>                 Key: HADOOP-2676
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2676
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Lohit Vijayarenu
>
> Could we have a way to maintain cluster state across multiple job submissions.
> Consider a scenario where we run multiple jobs in iteration on a cluster back 
> to back. The nature of the job is same, but input/output might differ. 
> Now, if a node is blacklisted in one iteration of job run, it would be useful 
> to maintain this information and blacklist this node for next iteration of 
> job as well. 
> Another situation which we saw is, if there are failures less than 
> mapred.map.max.attempts in each iterations few nodes are never marked for 
> blacklisting. But in we consider two or three iterations, these nodes fail 
> all jobs and should be taken out of cluster. This hampers overall performance 
> of the job.
> Could have have config variables something which matches a job type (provided 
> by user) and maintains the cluster status for that job type alone? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2676) Maintaining cluster information across multiple job submissions

Reply via email to