[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529644#comment-14529644 ]
Craig Welch commented on YARN-1680: ----------------------------------- bq. Please leave out the head-room concerns w.r.t node-labels. IIRC, we had tickets at YARN-796 tracking that. It is very likely a completely different solution, so. I'm not sure that's so - there is already a process of calculating headroom for labels associated with an application, the above is an extension of that to blacklisted nodes to handle label cases. If we leave it out, then the solution won't work for node-labels, and it can be made to do so, so that would be a loss. bq. When I said node-labels above, I meant partitions. Clearly the problem and the corresponding solution will likely be very similar for node-constraints (one type of node-labels). After all, blacklisting is a type of (anti) node-constraint. It could be modeled that way, but then it will be qualitatively different from the solution for non-label cases, which is not a good thing... bq. There is no notion of a cluster-level blacklisting in YARN. We have notions of unhealthy/lost/decommissioned nodes in a cluster. This is what I am referring when I say: bq. addition/removal at the cluster level I'm not suggesting/referring to anything other than nodes entering/leaving the cluster bq. Coming to the app-level blacklisting, clearly, the solution proposed is better than dead-locks. But blindly reducing the resources corresponding to blacklisted nodes will result in under-utilization (sometimes massively) and over-conservative scheduling requests by apps. So, that's the point of the recommended approach. The idea is to detect when it is necessary to recalculate the impact of the blacklisting on app headroom, which is when either blacklisting from the app has changed or the node composition of the cluster has changed (each of which should be relatively infrequent, certainly in relation to headroom calculation), and at that time to accurately calculate the impact by only adding the resource value nodes which actually exist from the blacklist into the value of the deduction. It isn't "blindly reducing resources", it's doing it accurately, and should both prevent deadlocks and under-utilization bq. One way to resolve this is to get the apps (or optionally in the AMRMClient library) to deduct the resource unusable on blacklisted nodes It could be moved into the AM's or client library, but then they would have to do the same sort of thing, and then the logic needs to be duplicated amongst the AM's or will only be available to those which use the library (do they all?). It's worth considering if it can be made to cover them all via the library, but I'm not sure this isn't something which should be handled as part of the headroom calculation in the rm, as it is meant to provide this accurately, and is otherwise aware of the blacklist. Which suggested to me that we already have the blacklist for the application in the RM/available to the scheduler (I'm not sure why that wasn't obvious to me before...), which does appear to be the case and which therefore drops out concerns about adding it - it's already there... > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > ------------------------------------------------------------------------------------------------------ > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler > Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 > Reporter: Rohith > Assignee: Craig Welch > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)