[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107009#comment-15107009 ]
Nathan Roberts commented on YARN-1011: -------------------------------------- bq. Welcome any thoughts/suggestions on handling promotion if we allow applications to ask for only guaranteed containers. I ll continue brain-storming. We want to have a simple mechanism, if possible; complex protocols seem to find a way to hoard bugs. I agree that we want something simple and this probably doesn’t qualify, but below are some thoughts anyway. This seems like a difficult problem. Maybe a webex would make sense at some point to go over the design and work through some of these issues???? Maybe we need to run two schedulers, conceptually anyway. One of them is exactly what we have today, call it the “GUARANTEED” scheduler. The second one is responsible for the “OPPORTUNISTIC” space. What I like about this sort of approach is that we aren’t changing the way the GUARANTEED scheduler would do things. The GUARANTEED scheduler assigns containers in the same order as it always has, regardless of whether or not opportunistic containers are being allocated in the background. By having separate schedulers, we’re not perturbing the way user_limits, capacity limits, reservations, preemption, and other scheduler-specific fairness algorithms deal with opportunistic capacity (I’m concerned we’ll have lots of bugs in this area). The only difference is that the OPPORTUNISTIC side might already be running a container when the GUARANTEED scheduler gets around to the same piece of work (the promotion problem). What I don't like is that it's obviously not simple. - The OPPORTUNISTIC scheduler could behave very differently from the GUARANTEED scheduler (e.g. it could only consider applications in certain queues, it could heavily favor applications with quick running containers, it could randomly select applications to fairly use OPPORTUNISTIC space, it could ignore reservations, it could ignore user limits, it could work extra hard to get good container locality, etc.) - When the OPPORTUNISTIC scheduler launches a container, it modifies the ask to indicate this portion has been launched opportunistically, the size of the ask does not change (this means the application needs to be aware that it is launching an OPPORTUNISTIC container) - Like Bikas already mentioned, we have to promote opportunistic containers, even if it means shooting an opportunistic one and launching a guaranteed one somewhere else. - If the GUARANTEED scheduler decides to assign a container y to a portion of an ask that has already been opportunistically launched with container x, the AM is asked to migrate container x to container y. If x and y are on the same host, great, the AM asks the NM to convert x to y (mostly bookkeeping); if not the AM kills x and launches y. Probably need a new state to track the migration. - Maybe locality would make the killing of opportunistic containers a rare event? If both schedulers are working hard to get locality (e.g. YARN-80 gets us to about 80% node local), then it seems like the GUARANTEED scheduler is going to usually pick the same nodes as the OPPORTUNISTIC scheduler, resulting in very simple container conversions with no lost work. - I don’t see how we can get away from occasionally shooting an opportunistic container so that a guaranteed one can run somewhere else. Given that we want opportunistic space to be used for both SLA and non-SLA work, we can’t wait around for a low priority opportunistic container on a busy node. Ideally the OPPORTUNISTIC scheduler would be good at picking containers that almost never get shot. - When the GUARANTEED scheduler assigns a container to a node, the over-allocate thresholds could be violated, in this case OPPORTUNISTIC containers on the node need to be shot. It would be good if this didn’t happen if a simple conversion was going to occur anyway. Given the complexities of this problem, we're going to experiment with a simpler approach of over-allocating up-to 2-3X on memory with the NM shooting containers (preemptable containers first) when resources are dangerously low. The over-allocate will be dynamic based on current node usage (when node is idle, no over-allocate; basically there has to be some evidence that over-allocating will be successful before we actually over-allocate). This type of approach might not satisfy all use cases but it might turn out to be very simple and mostly effective. We'll report back on how this type of approach works out. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > ------------------------------------------------------------------------------------- > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)