[ https://issues.apache.org/jira/browse/MAPREDUCE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973318#comment-13973318 ]
Jason Lowe commented on MAPREDUCE-5844: --------------------------------------- Moved this to MAPREDUCE since the decision to preempt reducers for mappers is ultimately an MR AM decision and not a YARN decision. A headroom of zero should mean there is literally no more room in the queue, and I would expect the job would need to take action in those cases to make progress in light of fetch failures. (e.g.: think of a scenario where all the other jobs taking up resources are long-running and won't release resources anytime soon) If you are seeing cases where reducers are shot then immediately relaunched along with the failed maps then that implies that either the headroom calculation is wrong or resources happened to be freed right at the time the new containers were requested. Note that there are a number of issues with headroom calculations, see YARN-1198 and related JIRAs. Assuming those are fixed, there might be some usefulness to a grace period where we wait for other apps to free up resources in the queue to avoid shooting reducers. A proper value for that probably depends upon how much work would be lost by the reducers in question, how long we can tolerate waiting to try to preserve that work, and how likely it is that another app will free up resources anytime soon. If we wait and still don't get our resources then that's purely worse than a job that took decisive action as soon as a map retroactively failed and there's no more space left in the queue. Also if the headroom is zero because a single job has hit user limits within the queue then waiting serves no purpose -- it has to shoot a reducer in that case to make progress. In that latter case we'd need additional information in the allocate response from the scheduler to know that waiting for resources to be released from other applications in the queue isn't going to work. It would be good to verify from the RM logs what is happening in your case. If the headroom calculation is wrong then we should fix that, otherwise if resources are churning quickly then a grace period before preempting reducers may make sense. > Reducer Preemption is too aggressive > ------------------------------------ > > Key: MAPREDUCE-5844 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5844 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Maysam Yabandeh > Assignee: Maysam Yabandeh > > We observed cases where the reducer preemption makes the job finish much > later, and the preemption does not seem to be necessary since after > preemption both the preempted reducer and the mapper are assigned > immediately--meaning that there was already enough space for the mapper. > The logic for triggering preemption is at > RMContainerAllocator::preemptReducesIfNeeded > The preemption is triggered if the following is true: > {code} > headroom + am * |m| + pr * |r| < mapResourceRequest > {code} > where am: number of assigned mappers, |m| is mapper size, pr is number of > reducers being preempted, and |r| is the reducer size. > The original idea apparently was that if headroom is not big enough for the > new mapper requests, reducers should be preempted. This would work if the job > is alone in the cluster. Once we have queues, the headroom calculation > becomes more complicated and it would require a separate headroom calculation > per queue/job. > So, as a result headroom variable is kind of given up currently: *headroom is > always set to 0* What this implies to the speculation is that speculation > becomes very aggressive, not considering whether there is enough space for > the mappers or not. -- This message was sent by Atlassian JIRA (v6.2#6252)