[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755233#comment-15755233 ] Wangda Tan commented on YARN-5864: -- Thanks [~curino] for the quick response! All great points, I will cover them in the doc, and will cover at least this feature related "tunables". > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754101#comment-15754101 ] Carlo Curino commented on YARN-5864: [~wangda] I like the direction of specifying more clearly what happens. I think working on a design doc that spells this out would be very valuable, I am happy to review and brainstorm with you if you think it is useful. (But FYI: I am on parental leave, and traveling abroad till mid-Jan.) In writing the document, in particular I think you should address the semantics from all points of view, e.g., which guarantees do I get as a user of any of the queues (not just the one we are preempting in favor of)? It is clear that if I am running over-capacity I can be preempted, but what happens if I am (safely?) within my capacity? (This is related to the "abuses" I was describing before, e.g., one in which I ask for massive containers on the nodes I want, and then resize them down, after you have killed anyone in my way). Looking further ahead: Ideally, this document you are starting to capture the semantics of this feature can be expanded to slowly cover all "tunables" of the scheduler, and explore the many complex interactions among features and the semantics we can derive from that (I bet we might be able to get rid of some redundancies). This could become part of the documentation of YARN. Even nicer would be to codify this with SLS driven tests (so that any future feature will not mess up with the semantics you are capturing, without us noticing). > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752695#comment-15752695 ] Wangda Tan commented on YARN-5864: -- Offline discussed with [~vinodkv]. We can have a better semantic of this feature, which is we can add queue-priority property. (Credit to [~vinodkv] for the idea). In existing scheduler, we sort queues based on (used-capacity / configured-capacity). But in some cases we have some apps/services need get resource first. For example, we allocate 85% to production queue, and 15% to test queue. When production queue is underutilized, we want scheduler give resource to production queue first regardless of test queue's utilization. A rough plan is: we will assign priority to queues under the same parent. Each time scheduler picks underutilized queue with highest priority, if there's no underutilized queue, scheduler picks queue with lowest utilization. And when we do preemption, if queue with higher priority has some special resource requests, such as very large memory, hard locality, placement constraint, etc. Scheduler will do relatively *conservative* preemption from other queues with lower priority regardless of utilization. That is just a rough idea, [~curino] please let us know your comments. I can formalize the design once we can agree with the approach generally. > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674481#comment-15674481 ] Wangda Tan commented on YARN-5864: -- Thanks [~curino] for sharing the firmament paper. I just read it, it provided a lot of insightful ideas. I believe it can work pretty well for a cluster which have homogeneous workload, but it may not be able to solve the mix workloads issues, as it stated: bq. Firmament shows that a single scheduler can attain scalability, but its MCMF optimization does not trivially admit multiple independent schedulers. So in my mind, for YARN, we need borg-like architecture to make different kinds of workload can be scheduled using different pluggable scheduling policies and scorers. Firmament could be one of these scheduling policies. I agree your comment about we should make a better semantics of the feature, I will think it again and keep you posted. > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669047#comment-15669047 ] Carlo Curino commented on YARN-5864: [~wangda] I think we are on the same page on the problem side, and I agree that the scheduling invariants (that were once hard constraints) will eventually look more like soft-constraints, which we aim to meet/maximize but are ok to comprise over in some cases. Understanding how to trade one for the other, or how to make decisions that maximize the number/amount of met constraints is the hard problem. To this purpose I would argue that (2) is structurally better position to capture all the tradeoffs in a compact and easy to understand way, than any combination of heuristics. Said this how to design (2) in a scalable/fast way is an open problem (an interesting direction recently appeared in OSDI 2016, http://www.firmament.io/, while it is not enough, it has some good ideas we could consider to leverage). So I am proposing it more as a north-star than as a short-term proposal of how to tackle this JIRA (or the scheduler issues in general). On the other hand, (1) is an ongoing activity we can start right-away, and we should do it regardless of whether we eventually manage to do something like (2) or not. Regarding abuses/scope of the feature. I am certain that the initial scenarios you are designing for has all the right properties to be safe/reasonable/trusted, but once the feature is out there, people will start using it in the most baroque ways and some of the issues I allude it to, might come up. Having very crisply defined semantics, configuration-validation mechanics (that prevent the worst configuration mistakes), and very tight unit tests are probably our best line of defense. > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655514#comment-15655514 ] Tan, Wangda commented on YARN-5864: --- Thanks [~curino] for sharing these insightful suggestions. The problem you mentioned is totally true: we were putting lots of efforts to add features for various of resource constraints (such as limits, node partition, priority, etc.) but we paid less attention about how to make easier/consistent semantics. I also agree that we do need to spend some time to think about what is the semantics that YARN scheduler should have. For example, the minimum guarantee of CS is queue should get at least their configured capacity, but a picky app could make an under-utilized queue waiting forever for the resource. And also as you mentioned above, non-preemptable queue can invalidate configured capacity as well. However, I would argue that the scheduler is not able to run perfectly without invalidating all the constraints. It is not just a group of formulas we need to define and let the solver to optimize it, it involves lots of human's emotions and preferences. For example, user may not understand and glad to accept why a picky request cannot be allocated even if the queue/cluster have available capacity. And it may not be acceptable to a production cluster that a long running service for realtime queries cannot be launched because we don't want to kill some less-important batch jobs. My point is, if we can have these rules defined in the doc and user can know what happened from the UI/log, we can add them. To improve these, I think your suggestion (1) will be more helpful and achievable in a short term, we can definitely remove some parameters, for example, existing user-limit definition is not good enough and user-limit-factor can always make a queue cannot fully utilize its capacity. And we can better define these semantics in doc and UI. (2) Looks beautiful but it may not be able to solve the root problem directly: The first priority is to make our users feel happy to accept it instead of beautifully solving it in mathematics. For example, for the problem I put in description of the JIRA, I don't think (2) can get allocation without harming other applications. And in implementation's perspective, I'm not sure how to make a solver-based solution can handle both of fast allocation (we want to do allocation within milli-seconds for interactive queries) and good placement (such as gang scheduling with some other constraints like anti-affinity). It seems to me that we will sacrifice low latency to get better quality of placement for the option (2). bq. This opens up many abuses, one that comes to mind ... Actually this feature will be only used in a pretty controlled environment: Important long running services running in a separate queue, and admin/user agrees that it can preempt other batch jobs to get new containers. ACLs will be set to avoid normal user running inside these queues, all apps running in the queue should be trusted apps such as YARN native services (Slider), Spark, etc. And we can also make sure these apps will try best to respect other apps. And please advice if you think we can improve the semantics of this feature. Thanks, > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653315#comment-15653315 ] Carlo Curino commented on YARN-5864: [~wangda] I understand the need for this feature, but the general concern I have is with that the collection of features in CS have very poorly defined interactions, and worse they do violate each other invariants left, right and center. For example non-preemptable queues when in use break the fair over-capacity sharing semantics. Similarly locality and node labels have heavy and not fully clear redundancies, and user-limits / app priorities / request priorities / container types / etc... are further complicating this space. The mental model associated with the system is growing disproportionately for both users and operators, and this is a bad sign. The new feature you propose seem to further push us down this slippery slope, where the semantics of what a user tenant gets for his/her money are very unclear. Up till before this feature the one invariant we had not violated yet was that, If I paid for capacity C, and I am within capacity C my containers will not be disturbed (regardless of other tenants desires). Now a queue may or may not be preempted within its capacity to accommodate some other queue large containers. This opens up many abuses, one that comes to mind: # I request a large container on node N1, # preemption kicks out some other tenant, # I get the container on N1, # I reduce the size of the container on N1 to a normal size containers... # (I repeat till I grab all the nodes I want). Through this little trick a nasty user can simply bully his way into the nodes he/she wants, regardless of the container size he really needs, and his/her capacity standing w.r.t. other tenants. I am sure if we squint hard enough there is a combination of configurations that can prevent this, but the general concern remains. Bottomline, I don't want to stand in the way of progress and important features, but I don't see this ending well. I see two paths forward: # a deep refactoring to make the code manageable, and an analysis that produces crisp semantics associated with each of the N! combination of our features---this should ideally lead to cutting all "nice on the box" features that are rarely/never used, or have undefined semantics. # Keep CS for legacy, and create a new -based scheduler for which we can prove clear semantics, and that allows users/operators to have a simple mental model of what the system is supposed to deliver. (2) is my favorite option if I had a choice. > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652608#comment-15652608 ] Wangda Tan commented on YARN-5864: -- The problem in the description is hard because it's hard clearly explain why a queue will be preempted even if a queue is within its limit. So I'm proposing to solve one use case only: in some of our customer's configuration, we have separate queues for long running services, for example LLAP-queue for LLAP services. LLAP services will scale up and down depends on the workload, they will ask container with lots of resource to make sure hosts running LLAP daemons not used by other applications. And we want to allocate containers for such LRS sooner when they have requirements to scale up. There's one quick approach in my mind to handle the use case above: - Add a new preemption selector (which make sure this feature can be disabled by configuration) - Add a white-list of queues for the new selection: Only queue in white list can preempt from other queues - When a reserved container from white-list queue created beyond configured timeout, we will look at the node which reserves the container, and select container from non-whitelisted queue to preempt. Thoughts and suggestions? [~curino], [~eepayne], [~sunilg]. Attached patch for review as well. > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org