[ 
https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209214#comment-14209214
 ] 

Craig Welch commented on YARN-2848:
-----------------------------------

bq. Thanks for your explanation, I think it is valid to have such mechanism of 
course , I just concerned about the cost.

It sounds like you're under the impression that this is somehow 
optional/elective - I don't believe it is.  Until we implement something along 
these lines we have known defects ( [YARN-1680], 
[https://issues.apache.org/jira/browse/YARN-2496?focusedCommentId=14143993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14143993],
 
[https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14146321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14146321]
 ), one way or another, some capability like this needs to be created, or we 
need to remove other functionality (headroom, userlimits), or continue to have 
significant defects/shortcomings (which is problematic, and imho not really an 
option)

bq. The pull model you mentioned is isomorphic as the push model (send events 
to apps, which we can also add filters to select which apps to send). And wrt 
pull model, we don't have dedicated thread for app to do that. And more 
problematic, if we cannot get apps synchronously handle such events, we need 
prepare a event queue for apps to do that.

not at all - as I've mentioned a couple of times, an option is simply to attach 
an update indicator to resources which can be compared by the app against it's 
own to determine if any action needs to be taken, with the general case 
expected to be, none.  That's where the efficiency of the approach comes in.  
Of course, the particulars of the implementation are what we need to work out 
here, but we do not necessarily have to have event queues, and we certainly 
don't need to have the apps synchronously handle events.  It's possible to take 
those approaches, but certainly not necessary.

bq. And I think the statement is not always true ... Since it is possible we 
change labels on a set of nodes (say 1k nodes), and many applications could run 
across the 1k nodes, some operation will scan nodes and build information from 
scratch, it is a O ( n * m ) operation in very extreme cases.

if all running applications were interested in a label which changed across all 
nodes in a cluster some activity would be necessary for them to make 
adjustments.  As a rule, this will be very infrequent in comparison to the 
frequency of allocation requests in the cluster, which is the strength of the 
approach.  Depending on how exactly we model things, it may well not be 
necessary for all applications to process all nodes of the cluster 
individually.  For example, if we limit nodes to a single label per node then 
that could be calculated at a cluster level.  If not, tracking intersection 
values for label combinations (if limited) could eliminate the need.  

Putting aside possible shortcuts for a moment, however, I suspect the 
straightforward approach of recalculation only when necessary at an application 
level will actually be fine - it's possible to posit pathological cases which 
will be problematic there, but it's possible to do that with many things.  If 
the pathological case (a change to labels of interest or nodes to every 
application at every allocation heartbeat, or a change to the set of cluster 
nodes on every heartbeat...) is not likely and does not need to be supported 
(it isn't and doesn't...), then infrequent recalculations only when necessary 
should not be problematic.  The original approach on [YARN-1680] would have 
performed that calculation with every allocation request - which we rightly 
took issue with - but doing so only when needed is considered to be a viable 
approach (the only realistic one I'm aware of...), which is why we're heading 
in that direction - the question is how to do that in detail.  The point of 
this jira is to note that the blacklist problem and the node label problem in 
relation to resources available to the application are strikingly similar to 
their needs (they're photo-negatives of one another, effectively...), and so it 
makes sense to combine them as it is likely that sharing would build both 
runtime and code efficiency.


> (FICA) Applications should maintain an application specific 'cluster' 
> resource to calculate headroom and userlimit
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2848
>                 URL: https://issues.apache.org/jira/browse/YARN-2848
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>            Reporter: Craig Welch
>            Assignee: Craig Welch
>
> Likely solutions to [YARN-1680] (properly handling node and rack blacklisting 
> with cluster level node additions and removals) will entail managing an 
> application-level "slice" of the cluster resource available to the 
> application for use in accurately calculating the application headroom and 
> user limit.  There is an assumption that events which impact this resource 
> will occur less frequently than the need to calculate headroom, userlimit, 
> etc (which is a valid assumption given that occurs per-allocation heartbeat). 
>  Given that, the application should (with assistance from cluster-level 
> code...) detect changes to the composition of the cluster (node addition, 
> removal) and when those have occurred, calculate an application specific 
> cluster resource by comparing cluster nodes to it's own blacklist (both rack 
> and individual node).  I think it makes sense to include nodelabel 
> considerations into this calculation as it will be efficient to do both at 
> the same time and the single resource value reflecting both constraints could 
> then be used for efficient frequent headroom and userlimit calculations while 
> remaining highly accurate.  The application would need to be made aware of 
> nodelabel changes it is interested in (the application or removal of labels 
> of interest to the application to/from nodes).  For this purpose, the 
> application submissions's nodelabel expression would be used to determine the 
> nodelabel impact on the resource used to calculate userlimit and headroom 
> (Cases where the application elected to request resources not using the 
> application level label expression are out of scope for this - but for the 
> common usecase of an application which uses a particular expression 
> throughout, userlimit and headroom would be accurate) This could also provide 
> an overall mechanism for handling application-specific resource constraints 
> which might be added in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to