[ 
https://issues.apache.org/jira/browse/YARN-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243662#comment-15243662
 ] 

Nathan Roberts commented on YARN-4963:
--------------------------------------

Thanks [~leftnoteasy] for the feedback. I agree that it would be a useful 
feature to be able to give some applications better spread, regardless of 
allocation type. Now we just have to figure out how to get there.

My concern is that I don't think we'd want to implement it using the same 
simple approach if it's going to apply to all container types. For example, in 
our case we almost always want NODE_LOCAL and RACK_LOCAL to get scheduled as 
quickly as possible so I'd want the limit to be high, as opposed to OFF_SWITCH 
where I want the limit to be 3-5 to keep a nice balance between scheduling 
performance and clustering. 

The reason this check was introduced in the first place (iirc) was to prevent 
network-heavy applications from loading up on specific nodes. The OFF_SWITCH 
check was a simple way of achieving this at a global level. The feature I think 
you're asking for (please correct me if I misunderstood) is that applications 
should be able to request that container spread be prioritized over timely 
scheduling (kind of like locality delay does today). I completely agree this 
would be a useful knob for applications to have. It is a trade-off though. An 
application that wants really good spread would be sacrificing scheduling 
opportunities that would probably be given to applications behind them in the 
queue (like locality delay).

So maybe there are two things to do:
1) Have the global OFF_SWITCH check to handle the simple case of avoiding too 
many network-heavy applications on a node. 
2) A feature where applications can specify a 
max_containers_assigned_per_node_per_heartbeat. I think this would be checked 
down in LeafQueue.assignContainers().

Even with #2 in place, I don't think #1 could immediately go away because the 
network-heavy applications would need to start properly specifying this limit.

The other approach to get rid of #1 would be when network is a resource. Such 
applications could then request lots of network resource, which should prevent 
clustering.

Does that make any sort of sense?


> capacity scheduler: Make number of OFF_SWITCH assignments per heartbeat 
> configurable
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-4963
>                 URL: https://issues.apache.org/jira/browse/YARN-4963
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>    Affects Versions: 3.0.0, 2.7.2
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: YARN-4963.001.patch
>
>
> Currently the capacity scheduler will allow exactly 1 OFF_SWITCH assignment 
> per heartbeat. With more and more non MapReduce workloads coming along, the 
> degree of locality is declining, causing scheduling to be significantly 
> slower. It's still important to limit the number of OFF_SWITCH assignments to 
> avoid densely packing OFF_SWITCH containers onto nodes. 
> Proposal is to add a simple config that makes the number of OFF_SWITCH 
> assignments configurable.
> Will upload candidate patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to