[
https://issues.apache.org/jira/browse/HADOOP-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625560#action_12625560
]
Runping Qi commented on HADOOP-2676:
------------------------------------
I don't think blacklist related info should persist through TT restart.
I'd like the TT be more adaptive than just being blacklisted or not.
TT should adaptively adjust the number of tasks it can run concurrently based
on current load and its history.
Before any tasks fail, it may concurrently run tasks up to an configured limit
(the number of slots).
As more and more tasks fail, the limit is decremented.
As more tasks report success, the limit should be incremented.
Once the limit reaches to zero, it is equivalent to be in the blacklist state.
You may be able to implement a more sophisticated adaption policy, but the
basic idea will be similar.
> Maintaining cluster information across multiple job submissions
> ---------------------------------------------------------------
>
> Key: HADOOP-2676
> URL: https://issues.apache.org/jira/browse/HADOOP-2676
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.15.2
> Reporter: Lohit Vijayarenu
> Assignee: dhruba borthakur
>
> Could we have a way to maintain cluster state across multiple job submissions.
> Consider a scenario where we run multiple jobs in iteration on a cluster back
> to back. The nature of the job is same, but input/output might differ.
> Now, if a node is blacklisted in one iteration of job run, it would be useful
> to maintain this information and blacklist this node for next iteration of
> job as well.
> Another situation which we saw is, if there are failures less than
> mapred.map.max.attempts in each iterations few nodes are never marked for
> blacklisting. But in we consider two or three iterations, these nodes fail
> all jobs and should be taken out of cluster. This hampers overall performance
> of the job.
> Could have have config variables something which matches a job type (provided
> by user) and maintains the cluster status for that job type alone?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.