[ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107009#comment-15107009
 ] 

Nathan Roberts commented on YARN-1011:
--------------------------------------

bq. Welcome any thoughts/suggestions on handling promotion if we allow 
applications to ask for only guaranteed containers. I ll continue 
brain-storming. We want to have a simple mechanism, if possible; complex 
protocols seem to find a way to hoard bugs.

I agree that we want something simple and this probably doesn’t qualify, but 
below are some thoughts anyway. 

This seems like a difficult problem. Maybe a webex would make sense at some 
point to go over the design and work through some of these issues????

Maybe we need to run two schedulers, conceptually anyway. One of them is 
exactly what we have today, call it the “GUARANTEED” scheduler. The second one 
is responsible for the “OPPORTUNISTIC” space. What I like about this sort of 
approach is that we aren’t changing the way the GUARANTEED scheduler would do 
things. The GUARANTEED scheduler assigns containers in the same order as it 
always has, regardless of whether or not opportunistic containers are being 
allocated in the background. By having separate schedulers, we’re not 
perturbing the way user_limits, capacity limits, reservations, preemption, and 
other scheduler-specific fairness algorithms deal with opportunistic capacity 
(I’m concerned we’ll have lots of bugs in this area). The only difference is 
that the OPPORTUNISTIC side might already be running a container when the 
GUARANTEED scheduler gets around to the same piece of work (the promotion 
problem). What I don't like is that it's obviously not simple.
- The OPPORTUNISTIC scheduler could behave very differently from the GUARANTEED 
scheduler (e.g. it could only consider applications in certain queues, it could 
heavily favor applications with quick running containers, it could randomly 
select applications to fairly use OPPORTUNISTIC space, it could ignore 
reservations, it could ignore user limits, it could work extra hard to get good 
container locality, etc.)
- When the OPPORTUNISTIC scheduler launches a container, it modifies the ask to 
indicate this portion has been launched opportunistically, the size of the ask 
does not change (this means the application needs to be aware that it is 
launching an OPPORTUNISTIC container) 
- Like Bikas already mentioned, we have to promote opportunistic containers, 
even if it means shooting an opportunistic one and launching a guaranteed one 
somewhere else.
- If the GUARANTEED scheduler decides to assign a container y to a portion of 
an ask that has already been opportunistically launched with container x, the 
AM is asked to migrate container x to container y. If x and y are on the same 
host, great, the AM asks the NM to convert x to y (mostly bookkeeping); if not 
the AM kills x and launches y. Probably need a new state to track the migration.
- Maybe locality would make the killing of opportunistic containers a rare 
event? If both schedulers are working hard to get locality (e.g. YARN-80 gets 
us to about 80% node local), then it seems like the GUARANTEED scheduler is 
going to usually pick the same nodes as the OPPORTUNISTIC scheduler, resulting 
in very simple container conversions with no lost work.
- I don’t see how we can get away from occasionally shooting an opportunistic 
container so that a guaranteed one can run somewhere else. Given that we want 
opportunistic space to be used for both SLA and non-SLA work, we can’t wait 
around for a low priority opportunistic container on a busy node. Ideally the 
OPPORTUNISTIC scheduler would be good at picking containers that almost never 
get shot. 
- When the GUARANTEED scheduler assigns a container to a node, the 
over-allocate thresholds could be violated, in this case OPPORTUNISTIC 
containers on the node need to be shot.  It would be good if this didn’t happen 
if a simple conversion was going to occur anyway. 

Given the complexities of this problem, we're going to experiment with a 
simpler approach of over-allocating up-to 2-3X on memory with the NM shooting 
containers (preemptable containers first) when resources are dangerously low. 
The over-allocate will be dynamic based on current node usage (when node is 
idle, no over-allocate; basically there has to be some evidence that  
over-allocating will be successful before we actually over-allocate). This type 
of approach might not satisfy all use cases but it might turn out to be very 
simple and mostly effective. We'll report back on how this type of approach 
works out.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-1011
>                 URL: https://issues.apache.org/jira/browse/YARN-1011
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>         Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, 
> yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to