[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209214#comment-14209214 ]
Craig Welch commented on YARN-2848: ----------------------------------- bq. Thanks for your explanation, I think it is valid to have such mechanism of course , I just concerned about the cost. It sounds like you're under the impression that this is somehow optional/elective - I don't believe it is. Until we implement something along these lines we have known defects ( [YARN-1680], [https://issues.apache.org/jira/browse/YARN-2496?focusedCommentId=14143993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14143993], [https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14146321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14146321] ), one way or another, some capability like this needs to be created, or we need to remove other functionality (headroom, userlimits), or continue to have significant defects/shortcomings (which is problematic, and imho not really an option) bq. The pull model you mentioned is isomorphic as the push model (send events to apps, which we can also add filters to select which apps to send). And wrt pull model, we don't have dedicated thread for app to do that. And more problematic, if we cannot get apps synchronously handle such events, we need prepare a event queue for apps to do that. not at all - as I've mentioned a couple of times, an option is simply to attach an update indicator to resources which can be compared by the app against it's own to determine if any action needs to be taken, with the general case expected to be, none. That's where the efficiency of the approach comes in. Of course, the particulars of the implementation are what we need to work out here, but we do not necessarily have to have event queues, and we certainly don't need to have the apps synchronously handle events. It's possible to take those approaches, but certainly not necessary. bq. And I think the statement is not always true ... Since it is possible we change labels on a set of nodes (say 1k nodes), and many applications could run across the 1k nodes, some operation will scan nodes and build information from scratch, it is a O ( n * m ) operation in very extreme cases. if all running applications were interested in a label which changed across all nodes in a cluster some activity would be necessary for them to make adjustments. As a rule, this will be very infrequent in comparison to the frequency of allocation requests in the cluster, which is the strength of the approach. Depending on how exactly we model things, it may well not be necessary for all applications to process all nodes of the cluster individually. For example, if we limit nodes to a single label per node then that could be calculated at a cluster level. If not, tracking intersection values for label combinations (if limited) could eliminate the need. Putting aside possible shortcuts for a moment, however, I suspect the straightforward approach of recalculation only when necessary at an application level will actually be fine - it's possible to posit pathological cases which will be problematic there, but it's possible to do that with many things. If the pathological case (a change to labels of interest or nodes to every application at every allocation heartbeat, or a change to the set of cluster nodes on every heartbeat...) is not likely and does not need to be supported (it isn't and doesn't...), then infrequent recalculations only when necessary should not be problematic. The original approach on [YARN-1680] would have performed that calculation with every allocation request - which we rightly took issue with - but doing so only when needed is considered to be a viable approach (the only realistic one I'm aware of...), which is why we're heading in that direction - the question is how to do that in detail. The point of this jira is to note that the blacklist problem and the node label problem in relation to resources available to the application are strikingly similar to their needs (they're photo-negatives of one another, effectively...), and so it makes sense to combine them as it is likely that sharing would build both runtime and code efficiency. > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > ------------------------------------------------------------------------------------------------------------------ > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler > Reporter: Craig Welch > Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will occur less frequently than the need to calculate headroom, userlimit, > etc (which is a valid assumption given that occurs per-allocation heartbeat). > Given that, the application should (with assistance from cluster-level > code...) detect changes to the composition of the cluster (node addition, > removal) and when those have occurred, calculate an application specific > cluster resource by comparing cluster nodes to it's own blacklist (both rack > and individual node). I think it makes sense to include nodelabel > considerations into this calculation as it will be efficient to do both at > the same time and the single resource value reflecting both constraints could > then be used for efficient frequent headroom and userlimit calculations while > remaining highly accurate. The application would need to be made aware of > nodelabel changes it is interested in (the application or removal of labels > of interest to the application to/from nodes). For this purpose, the > application submissions's nodelabel expression would be used to determine the > nodelabel impact on the resource used to calculate userlimit and headroom > (Cases where the application elected to request resources not using the > application level label expression are out of scope for this - but for the > common usecase of an application which uses a particular expression > throughout, userlimit and headroom would be accurate) This could also provide > an overall mechanism for handling application-specific resource constraints > which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)