[jira] [Comment Edited] (YARN-6344) Rethinking OFF_SWITCH locality in CapacityScheduler

Konstantinos Karanasos (JIRA) Wed, 22 Mar 2017 19:22:05 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937580#comment-15937580
 ]


Konstantinos Karanasos edited comment on YARN-6344 at 3/23/17 2:21 AM:
-----------------------------------------------------------------------

As I mentioned, in the patch I uploaded, if the new parameter 
({{rack-locality-delay}}) is set to -1, the existing relax locality behavior is 
achieved.
The new functionality kicks in only for positive values of the parameter.

Since we are at it, it might be useful to discuss if the existing behavior is 
still desirable.
Consider a cluster of N nodes, and a resource request asking for C containers 
on L different locations (where locations is the sum of unique nodes and racks 
in the request).
Currently, rack assignment happens after node-locality-delay missed 
opportunities. The way this works is straightforward.
On the other hand, off-switch assignment happens after L * C / N missed 
opportunities, capped by the size of the cluster.
This means that we tend to allow off-switch assignments faster when: (a) the 
resource request targets a small number of nodes, (b) there are few containers 
requested, (c) the cluster is small, or for a combination thereof.

This seems to work well for apps requesting a big number of containers on a 
relatively small cluster. Let's see some examples:
* On a 100-node cluster, requesting 100+ containers, off-switch assignment is 
dictated by the size of the cluster. This should be a typical MR application on 
a common cluster.
* On a 100-node cluster, requesting 5 containers on 2 nodes of a single rack, 
will lead to off-switch assignment after 3 * 5 / 100 = 1.5 missed 
opportunities. This seems too pessimistic.
* On a 2000-node cluster, any combination of L * C > 2000 (which should be the 
case more often than not), off-switch assignment happens after a single missed 
opportunity.

Note that most of our applications fall in the third category.

So it seems that the L * C / N load factor not only is too pessimistic, but it 
also does not allow rack assignments, since it kicks in too fast: if off-switch 
assignment kicks in after a single missed opportunity, we essentially 
invalidate rack assignments.
One possible way to mitigate this problem could be to multiply this load factor 
with the node-locality-delay when it comes to rack assignments, and with the 
rack-locality-delay when it comes to off-switch assignments.
This way we also "relax" the node-locality-delay, increasing the probabilities 
for a rack assignment, and we make relax locality not kick in too soon.

But given it might affect the behavior of existing application, I would like to 
hear your opinion before making such a change.


was (Author: kkaranasos):
As I mentioned, in the patch I uploaded, if the new parameter 
({{rack-locality-delay}}) is set to -1, the existing relax locality behavior is 
achieved.
The new functionality kicks in only for positive values of the parameter.

Since we are at it, it might be useful to discuss if the existing behavior is 
still desirable.
Consider a cluster of N nodes, and a resource request asking for C containers 
on L different locations (where locations is the sum of unique nodes and racks 
in the request).
Currently, rack assignment happens after node-locality-delay missed 
opportunities. The way this works is straightforward.
On the other hand, off-switch assignment happens after L * C / N missed 
opportunities, capped by the size of the cluster.
This means that we tend to allow off-switch assignments faster when: (a) the 
resource request targets a small number of nodes, (b) there are few containers 
requested, (c) the cluster is small, or for a combination thereof.

This seems to work well for apps requesting a big number of containers on a 
relatively small cluster. Let's see some examples:
* On a 100-node cluster, requesting 100+ containers, off-switch assignment is 
dictated by the size of the cluster. This should be a typical MR application on 
a common cluster.
* On a 100-node cluster, requesting 5 containers on 2 nodes of a single rack, 
will lead to off-switch assignment after 3 * 5 / 100 = 1.5 missed 
opportunities. This seems too pessimistic.
* On a 2000-node cluster, any combination of L * C > 2000 (which should be the 
case more often than not), off-switch assignment happens after a single missed 
opportunity.
Note that most of our applications fall in the third category.

So it seems that the L * C / N load factor not only is too pessimistic, but it 
also does not allow rack assignments, since it kicks in too fast: if off-switch 
assignment kicks in after a single missed opportunity, we essentially 
invalidate rack assignments.
One possible way to mitigate this problem could be to multiply this load factor 
with the node-locality-delay when it comes to rack assignments, and with the 
rack-locality-delay when it comes to off-switch assignments.
This way we also "relax" the node-locality-delay, increasing the probabilities 
for a rack assignment, and we make relax locality not kick in too soon.

But given it might affect the behavior of existing application, I would like to 
hear your opinion before making such a change.

> Rethinking OFF_SWITCH locality in CapacityScheduler
> ---------------------------------------------------
>
>                 Key: YARN-6344
>                 URL: https://issues.apache.org/jira/browse/YARN-6344
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Konstantinos Karanasos
>            Assignee: Konstantinos Karanasos
>         Attachments: YARN-6344.001.patch
>
>
> When relaxing locality from node to rack, the {{node-locality-parameter}} is 
> used: when scheduling opportunities for a scheduler key are more than the 
> value of this parameter, we relax locality and try to assign the container to 
> a node in the corresponding rack.
> On the other hand, when relaxing locality to off-switch (i.e., assign the 
> container anywhere in the cluster), we are using a {{localityWaitFactor}}, 
> which is computed based on the number of outstanding requests for a specific 
> scheduler key, which is divided by the size of the cluster. 
> In case of applications that request containers in big batches (e.g., 
> traditional MR jobs), and for relatively small clusters, the 
> localityWaitFactor does not affect relaxing locality much.
> However, in case of applications that request containers in small batches, 
> this load factor takes a very small value, which leads to assigning 
> off-switch containers too soon. This situation is even more pronounced in big 
> clusters.
> For example, if an application requests only one container per request, the 
> locality will be relaxed after a single missed scheduling opportunity.
> The purpose of this JIRA is to rethink the way we are relaxing locality for 
> off-switch assignments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6344) Rethinking OFF_SWITCH locality in CapacityScheduler

Reply via email to