[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-12-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755233#comment-15755233
 ] 

Wangda Tan commented on YARN-5864:
--

Thanks [~curino] for the quick response!

All great points, I will cover them in the doc, and will cover at least this 
feature related "tunables".

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-12-16 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754101#comment-15754101
 ] 

Carlo Curino commented on YARN-5864:


[~wangda] I like the direction of specifying more clearly what happens. I think 
working on a design doc that spells this out would be very valuable, I am happy 
to review and brainstorm with you if you think it is useful. (But FYI: I am on 
parental leave, and traveling abroad till mid-Jan.)

In writing the document, in particular I think you should address the semantics 
from all points of view, e.g., which guarantees do I get as a user of any of 
the queues (not just the one we are preempting in favor of)? It is clear that 
if I am running over-capacity I can be preempted, but what happens if I am 
(safely?) within my capacity? (This is related to the "abuses" I was describing 
before, e.g., one in which I ask for massive containers on the nodes I want, 
and then resize them down, after you have killed anyone in my way).  

Looking further ahead: Ideally, this document you are starting to capture the 
semantics of this feature can be expanded to slowly cover all "tunables" of the 
scheduler, and explore the many complex interactions among features and the 
semantics we can derive from that (I bet we might be able to get rid of some 
redundancies). This could become part of the documentation of YARN. Even nicer 
would be to codify this with SLS driven tests (so that any future feature will 
not mess up with the semantics you are capturing, without us noticing).

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-12-15 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752695#comment-15752695
 ] 

Wangda Tan commented on YARN-5864:
--

Offline discussed with [~vinodkv].

We can have a better semantic of this feature, which is we can add 
queue-priority property. (Credit to [~vinodkv] for the idea).

In existing scheduler, we sort queues based on (used-capacity / 
configured-capacity). But in some cases we have some apps/services need get 
resource first. For example, we allocate 85% to production queue, and 15% to 
test queue. When production queue is underutilized, we want scheduler give 
resource to production queue first regardless of test queue's utilization.

A rough plan is: we will assign priority to queues under the same parent. Each 
time scheduler picks underutilized queue with highest priority, if there's no 
underutilized queue, scheduler picks queue with lowest utilization.

And when we do preemption, if queue with higher priority has some special 
resource requests, such as very large memory, hard locality, placement 
constraint, etc. Scheduler will do relatively *conservative* preemption from 
other queues with lower priority regardless of utilization. 

That is just a rough idea, [~curino] please let us know your comments. I can 
formalize the design once we can agree with the approach generally.

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674481#comment-15674481
 ] 

Wangda Tan commented on YARN-5864:
--

Thanks [~curino] for sharing the firmament paper. I just read it, it provided a 
lot of insightful ideas. I believe it can work pretty well for a cluster which 
have homogeneous workload, but it may not be able to solve the mix workloads 
issues, as it stated:

bq. Firmament shows that a single scheduler can attain scalability, but its 
MCMF optimization does not trivially admit multiple independent schedulers. 

So in my mind, for YARN, we need borg-like architecture to make different kinds 
of workload can be scheduled using different pluggable scheduling policies and 
scorers. Firmament could be one of these scheduling policies. 

I agree your comment about we should make a better semantics of the feature, I 
will think it again and keep you posted.

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-15 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669047#comment-15669047
 ] 

Carlo Curino commented on YARN-5864:


[~wangda] I think we are on the same page on the problem side, and I agree that 
the scheduling invariants (that were once hard constraints) will eventually 
look more like soft-constraints, which we aim to meet/maximize but are ok to 
comprise over in some cases. 

Understanding how to trade one for the other, or how to make decisions that 
maximize the number/amount of met constraints is the hard problem. To this 
purpose I would argue that (2) is structurally better position to capture all 
the tradeoffs in a compact and easy to understand way, than any combination of 
heuristics.  Said this how to design (2) in a scalable/fast way is an open 
problem (an interesting direction recently appeared in OSDI 2016,  
http://www.firmament.io/, while it is not enough, it has some good ideas we 
could consider to leverage). So I am proposing it more as a north-star than as 
a short-term proposal of how to tackle this JIRA (or the scheduler issues in 
general).  On the other hand, (1) is an ongoing activity we can start 
right-away, and we should do it regardless of whether we eventually manage to 
do something like (2) or not. 

Regarding abuses/scope of the feature. I am certain that the initial scenarios 
you are designing for has all the right properties to be 
safe/reasonable/trusted, but once the feature is out there, people will start 
using it in the most baroque ways and some of the issues I allude it to, might 
come up.  Having very crisply defined semantics, configuration-validation 
mechanics (that prevent the worst configuration mistakes), and very tight unit 
tests are probably our best line of defense.



> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-10 Thread Tan, Wangda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655514#comment-15655514
 ] 

Tan, Wangda commented on YARN-5864:
---

Thanks [~curino] for sharing these insightful suggestions.

The problem you mentioned is totally true: we were putting lots of efforts to 
add features for various of resource constraints (such as limits, node 
partition, priority, etc.) but we paid less attention about how to make 
easier/consistent semantics.

I also agree that we do need to spend some time to think about what is the 
semantics that YARN scheduler should have. For example, the minimum guarantee 
of CS is queue should get at least their configured capacity, but a picky app 
could make an under-utilized queue waiting forever for the resource. And also 
as you mentioned above, non-preemptable queue can invalidate configured 
capacity as well.

However, I would argue that the scheduler is not able to run perfectly without 
invalidating all the constraints. It is not just a group of formulas we need to 
define and let the solver to optimize it, it involves lots of human's emotions 
and preferences. For example, user may not understand and glad to accept why a 
picky request cannot be allocated even if the queue/cluster have available 
capacity. And it may not be acceptable to a production cluster that a long 
running service for realtime queries cannot be launched because we don't want 
to kill some less-important batch jobs. My point is, if we can have these rules 
defined in the doc and user can know what happened from the UI/log, we can add 
them.

To improve these, I think your suggestion (1) will be more helpful and 
achievable in a short term, we can definitely remove some parameters, for 
example, existing user-limit definition is not good enough and 
user-limit-factor can always make a queue cannot fully utilize its capacity. 
And we can better define these semantics in doc and UI.

(2) Looks beautiful but it may not be able to solve the root problem directly: 
The first priority is to make our users feel happy to accept it instead of 
beautifully solving it in mathematics. For example, for the problem I put in 
description of the JIRA, I don't think (2) can get allocation without harming 
other applications. And in implementation's perspective, I'm not sure how to 
make a solver-based solution can handle both of fast allocation (we want to do 
allocation within milli-seconds for interactive queries) and good placement 
(such as gang scheduling with some other constraints like anti-affinity). It 
seems to me that we will sacrifice low latency to get better quality of 
placement for the option (2).

bq. This opens up many abuses, one that comes to mind ...
Actually this feature will be only used in a pretty controlled environment: 
Important long running services running in a separate queue, and admin/user 
agrees that it can preempt other batch jobs to get new containers. ACLs will be 
set to avoid normal user running inside these queues, all apps running in the 
queue should be trusted apps such as YARN native services (Slider), Spark, etc. 
And we can also make sure these apps will try best to respect other apps.
And please advice if you think we can improve the semantics of this feature.

Thanks,

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-09 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653315#comment-15653315
 ] 

Carlo Curino commented on YARN-5864:


[~wangda] I understand the need for this feature, but the general concern I 
have is with that the collection of features in CS have very poorly defined 
interactions, and worse they do violate each other invariants left, right and 
center. For example non-preemptable queues when in use break the fair 
over-capacity sharing semantics. Similarly locality and node labels have heavy 
and not fully clear redundancies, and user-limits / app priorities / request 
priorities / container types / etc... are further complicating this space. The 
mental model associated with the system is growing disproportionately for both 
users and operators, and this is a bad sign.

The new feature you propose seem to further push us down this slippery slope, 
where the semantics of what a user tenant gets for his/her money are very 
unclear. Up till before this feature the one invariant we had not violated yet 
was that, If I paid for capacity C, and I am within capacity C my containers 
will not be disturbed (regardless of other tenants desires). Now a queue may or 
may not be preempted within its capacity to accommodate some other queue large 
containers. 

This opens up many abuses, one that comes to mind:
 # I request a large container on node N1, 
 # preemption kicks out some other tenant, 
 # I get the container on N1, 
 # I reduce the size of the container on N1 to a normal size containers... 
 # (I repeat till I grab all the nodes I want).  
Through this little trick a nasty user can simply bully his way into the nodes 
he/she wants, regardless of the container size he really needs, and his/her 
capacity standing w.r.t. other tenants. I am sure if we squint hard enough 
there is a combination of configurations that can prevent this, but the general 
concern remains.


Bottomline, I don't want to stand in the way of progress and important 
features, but I don't see this ending well. 

I see two paths forward:
# a deep refactoring to make the code manageable, and an analysis that produces 
crisp semantics associated with each of the N! combination of our 
features---this should ideally lead to cutting all "nice on the box" features 
that are rarely/never used, or have undefined semantics. 
# Keep CS for legacy, and create a new -based 
scheduler for which we can prove clear semantics, and that allows 
users/operators to have a simple mental model of what the system is supposed to 
deliver.

(2) is my favorite option if I had a choice.


> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652608#comment-15652608
 ] 

Wangda Tan commented on YARN-5864:
--

The problem in the description is hard  because it's hard clearly explain why a 
queue will be preempted even if a queue is within its limit.

So I'm proposing to solve one use case only: in some of our customer's 
configuration, we have separate queues for long running services, for example 
LLAP-queue for LLAP services. LLAP services will scale up and down depends on 
the workload, they will ask container with lots of resource to make sure hosts 
running LLAP daemons not used by other applications.

And we want to allocate containers for such LRS sooner when they have 
requirements to scale up.

There's one quick approach in my mind to handle the use case above: 
- Add a new preemption selector (which make sure this feature can be disabled 
by configuration)
- Add a white-list of queues for the new selection: Only queue in white list 
can preempt from other queues
- When a reserved container from white-list queue created beyond configured 
timeout, we will look at the node which reserves the container, and select 
container from non-whitelisted queue to preempt.

Thoughts and suggestions? [~curino], [~eepayne], [~sunilg].

Attached patch for review as well.

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org