[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523613#comment-14523613 ] Wangda Tan commented on YARN-2848: -- [~cwelch], this problem doesn't existed after you added {{CapacityHeadroomProvider}}, right? My understanding is application-specific resources needs to calculate headroom and userlimit can be added to {{CapacityHeadroomProvider}}. (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523629#comment-14523629 ] Craig Welch commented on YARN-2848: --- The ResourceUsage functionality added in [YARN-3356] [YARN-3099] and [YARN-3092] is effectively an implementation of the approach suggested here, was also used for [YARN-3463]. Given that, I'm going to close this one. While it's not yet been used to address the blacklist issue with headroom [YARN-1680], that should be handled there in any case. (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266590#comment-14266590 ] Chen He commented on YARN-2848: --- I guess the label is provide by users or applications to choose what nodes to run. The Blacklist is detected by system that what nodes are not stable to run. The blacklisted nodes could be regarded as a special label or NOT label. However, we need extra synchronization process to keep the consistency of users/apps requests and unstable nodes before making scheduling decision. YARN-1680 could be a solution before we actually settle down the label scope and the synchronization overhead issue. (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1420#comment-1420 ] Wangda Tan commented on YARN-2848: -- [~cwelch], IIUC, this JIRA is to tackle the cases which app has some special requirements on resource requests (including but not limited to black list nodes, node labels expression, etc.) and RM want to return headroom considering such factors to AM. My major concern of this is it will bring more computation complexity in RM side -- we already have very heavy computation when trying to allocate containers, like locality/hierachy-of-queues/user-limit/headroom/node-labels, if we trying to resolve the problem by handling events (such as node label change, black node list change, etc.) at *app level*, it will be very problematic, since some of the operations cannot be even done in O\(n\) time. So I think if some operation have complex of O\(n\), (n can be as large as #app in the cluster), we should be very discreet to such operation. Any thoughts? (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208941#comment-14208941 ] Craig Welch commented on YARN-2848: --- bq. IIUC, this JIRA is to tackle the cases which app has some special requirements on resource requests (including but not limited to black list nodes, node labels expression, etc.) and RM want to return headroom considering such factors to AM. Well, yes... although the extent to which they are special isn't clear, [YARN-1680] surfaces this as a bug (something of a design miss...) for blacklisting of resources which has been around for some time - and of course, node labels were recently added but with an eye to being used - as in, there's a desire to be able to use them with processes which will want to have accurate headroom, userlimit, etc - so the problem already exists, as it were, it's not something new we're choosing to introduce, it's rather a way of resolving inconsistencies which exist because of functionalities which is are perhaps not fully complete wrt the rest of the system - and in so far as we want applications to work with constraints with respect to nodes they use, we will need to solve this problem in some way, or do away with headroom and / or user limits as such, which is not a very attractive choice bq. My major concern of this is it will bring more computation complexity in RM side – we already have very heavy computation when trying to allocate containers, like locality/hierachy-of-queues/user-limit/headroom/node-labels The idea is to minimize the calculation needed during allocation by making adjustments to resources only as needed by external events which should be relatively infrequent with respect to any given application bq. if we trying to resolve the problem by handling events (such as node label change, black node list change, etc.) at app level, it will be very problematic, since some of the operations cannot be even done in O( n ) time. bq. So I think if some operation have complex of O( n ), (n can be as large as #app in the cluster), we should be very discreet to such operation. so, the suggestion is not to have the activity which accepts a node label change or a node addition or removal from a cluster synchronously notify all applications of that change - rather, to allow applications to check for changes relevant to them (changes to the nodes held by a label they care about (label level info), node additions or removals relevant to their blacklisting (cluster level info)) and to have the application only adjust it's resource view when it determines it is necessary to do so - at the level of the cluster handling the addition or removal of a node, or changes to the nodes for a node label, nothing more than an indication of last change for the resources needs to occur, and applications will simply check for change indications that they care about and take action as needed - it should be as efficient and lightweight as possible, and would not impose any O ( n ) (where n=#app in cluster) operations on any single/synchronous code path (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209021#comment-14209021 ] Wangda Tan commented on YARN-2848: -- [~cwelch], Thanks for your explanation, I think it is valid to have such mechanism of course :), I just concerned about the cost. The pull model you mentioned is isomorphic as the push model (send events to apps, which we can also add filters to select which apps to send). And wrt pull model, we don't have dedicated thread for app to do that. And more problematic, if we cannot get apps synchronously handle such events, we need prepare a event queue for apps to do that. And I think the statement is not always true bq. and would not impose any O ( n ) (where n=#app in cluster) operations on any single/synchronous code path Since it is possible we change labels on a set of nodes (say 1k nodes), and many applications could run across the 1k nodes, some operation will scan nodes and build information from scratch, it is a O ( n * m ) operation in very extreme cases. (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209214#comment-14209214 ] Craig Welch commented on YARN-2848: --- bq. Thanks for your explanation, I think it is valid to have such mechanism of course , I just concerned about the cost. It sounds like you're under the impression that this is somehow optional/elective - I don't believe it is. Until we implement something along these lines we have known defects ( [YARN-1680], [https://issues.apache.org/jira/browse/YARN-2496?focusedCommentId=14143993page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14143993], [https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14146321page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14146321] ), one way or another, some capability like this needs to be created, or we need to remove other functionality (headroom, userlimits), or continue to have significant defects/shortcomings (which is problematic, and imho not really an option) bq. The pull model you mentioned is isomorphic as the push model (send events to apps, which we can also add filters to select which apps to send). And wrt pull model, we don't have dedicated thread for app to do that. And more problematic, if we cannot get apps synchronously handle such events, we need prepare a event queue for apps to do that. not at all - as I've mentioned a couple of times, an option is simply to attach an update indicator to resources which can be compared by the app against it's own to determine if any action needs to be taken, with the general case expected to be, none. That's where the efficiency of the approach comes in. Of course, the particulars of the implementation are what we need to work out here, but we do not necessarily have to have event queues, and we certainly don't need to have the apps synchronously handle events. It's possible to take those approaches, but certainly not necessary. bq. And I think the statement is not always true ... Since it is possible we change labels on a set of nodes (say 1k nodes), and many applications could run across the 1k nodes, some operation will scan nodes and build information from scratch, it is a O ( n * m ) operation in very extreme cases. if all running applications were interested in a label which changed across all nodes in a cluster some activity would be necessary for them to make adjustments. As a rule, this will be very infrequent in comparison to the frequency of allocation requests in the cluster, which is the strength of the approach. Depending on how exactly we model things, it may well not be necessary for all applications to process all nodes of the cluster individually. For example, if we limit nodes to a single label per node then that could be calculated at a cluster level. If not, tracking intersection values for label combinations (if limited) could eliminate the need. Putting aside possible shortcuts for a moment, however, I suspect the straightforward approach of recalculation only when necessary at an application level will actually be fine - it's possible to posit pathological cases which will be problematic there, but it's possible to do that with many things. If the pathological case (a change to labels of interest or nodes to every application at every allocation heartbeat, or a change to the set of cluster nodes on every heartbeat...) is not likely and does not need to be supported (it isn't and doesn't...), then infrequent recalculations only when necessary should not be problematic. The original approach on [YARN-1680] would have performed that calculation with every allocation request - which we rightly took issue with - but doing so only when needed is considered to be a viable approach (the only realistic one I'm aware of...), which is why we're heading in that direction - the question is how to do that in detail. The point of this jira is to note that the blacklist problem and the node label problem in relation to resources available to the application are strikingly similar to their needs (they're photo-negatives of one another, effectively...), and so it makes sense to combine them as it is likely that sharing would build both runtime and code efficiency. (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack