[ 
https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Weathers updated STORM-1766:
---------------------------------
    Description: 
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method is not effective since it does 
not consider the number of slots available. This method also fails in 
identifying racks that are not schedulable due to the exhaustion of one of the 
resources either memory, cpu, or slots. The current implementation also tries 
the initial scheduling on one rack and not try to schedule on all the racks 
before giving up which may cause topologies to be failed to be scheduled due to 
the above mentioned shortcomings in the current method. Also the current method 
does not consider failures of workers. When executors of a topology gets 
unassigned and needs to be scheduled again, the current logic in 
getBestClustering may be inadequate if not complete wrong. When executors needs 
to rescheduled due to a fault, getBestClustering will likely return a cluster 
that is different from where the majority of executors from the topology is 
originally scheduling in.

Thus, I propose a different strategy/algorithm to find the "best" cluster. I 
have come up with a ordering strategy I dub subordinate resource availability 
ordering (inspired by Dominant Resource Fairness) that sorts racks by the 
subordinate (not dominant) resource availability.

For example given 4 racks with the following resource availabilities
{code}
//generate some that has alot of memory but little of cpu
rack-3 Avail [ CPU 100.0 MEM 200000.0 Slots 40 ] Total [ CPU 100.0 MEM 200000.0 
Slots 40 ]
//generate some supervisors that are depleted of one resource
rack-2 Avail [ CPU 0.0 MEM 80000.0 Slots 40 ] Total [ CPU 0.0 MEM 80000.0 Slots 
40 ]
//generate some that has a lot of cpu but little of memory
rack-4 Avail [ CPU 6100.0 MEM 10000.0 Slots 40 ] Total [ CPU 6100.0 MEM 10000.0 
Slots 40 ]
//generate another rack of supervisors with less resources than rack-0
rack-1 Avail [ CPU 2000.0 MEM 40000.0 Slots 40 ] Total [ CPU 2000.0 MEM 40000.0 
Slots 40 ]
rack-0 Avail [ CPU 4000.0 MEM 80000.0 Slots 40( ] Total [ CPU 4000.0 MEM 
80000.0 Slots 40 ]
Cluster Overall Avail [ CPU 12200.0 MEM 410000.0 Slots 200 ] Total [ CPU 
12200.0 MEM 410000.0 Slots 200 ]
{code}

It is clear that rack-0 is the best cluster since its the most balanced and can 
potentially schedule the most executors, while rack-2 is the worst rack since 
rack-2 is depleted of cpu resource thus rendering it unschedulable even though 
there are other resources available.

We first calculate the resource availability percentage of all the racks for 
each resource by computing:
{code}
(resource available on rack) / (resource available in cluster)
{code}

We do this calculation to normalize the values otherwise the resource values 
would not be comparable.

So for our example:
{code}
rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] 
effective resources: 0.00819672131147541
rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 
0.0
rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective 
resources: 0.024390243902439025
rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] 
effective resources: 0.0975609756097561
rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] 
effective resources: 0.1951219512195122
{code}

The effective resource of a rack, which is also the subordinate resource, is 
computed by: 
{code}
MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
{code}
Then we order the racks by the effective resource.

Thus for our example:
{code}
Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
{code}
Also to deal with the presence of failures, if a topology is partially 
scheduled, we find the rack with the most scheduled executors for the topology 
and we try to schedule on that rack first.

Thus for the sorting for racks. We first sort by the number of executors 
already scheduled on the rack and then by the subordinate resource availability.

  was:
Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack 
based on which rack has the most available resources this may be insufficient 
and may cause topologies not to be able to be scheduled successfully even 
though there are enough resources to schedule it in the cluster. We attempt to 
find the rack with the most resources by find the rack with the biggest sum of 
available memory + available cpu. This method is not effective since it does 
not consider the number of slots available. This method also fails in 
identifying racks that are not schedulable due to the exhaustion of one of the 
resources either memory, cpu, or slots. The current implementation also tries 
the initial scheduling on one rack and not try to schedule on all the racks 
before giving up which may cause topologies to be failed to be scheduled due to 
the above mentioned shortcomings in the current method. Also the current method 
does not consider failures of workers. When executors of a topology gets 
unassigned and needs to be scheduled again, the current logic in 
getBestClustering may be inadequate if not complete wrong. When executors needs 
to rescheduled due to a fault, getBestClustering will likely return a cluster 
that is different from where the majority of executors from the topology is 
originally scheduling in.

Thus, I propose a different strategy/algorithm to find the "best" cluster. I 
have come up with a ordering strategy I dub subordinate resource availability 
ordering (inspired by Dominant Resource Fairness) that sorts racks by the 
subordinate (not dominant) resource availability.

For example given 4 racks with the following resource availabilities
{code}
//generate some that has alot of memory but little of cpu
rack-3 Avail [ CPU 100.0 MEM 200000.0 Slots 40 ] Total [ CPU 100.0 MEM 200000.0 
Slots 40 ]
//generate some supervisors that are depleted of one resource
rack-2 Avail [ CPU 0.0 MEM 80000.0 Slots 40 ] Total [ CPU 0.0 MEM 80000.0 Slots 
40 ]
//generate some that has a lot of cpu but little of memory
rack-4 Avail [ CPU 6100.0 MEM 10000.0 Slots 40 ] Total [ CPU 6100.0 MEM 10000.0 
Slots 40 ]
//generate another rack of supervisors with less resources than rack-0
rack-1 Avail [ CPU 2000.0 MEM 40000.0 Slots 40 ] Total [ CPU 2000.0 MEM 40000.0 
Slots 40 ]
rack-0 Avail [ CPU 4000.0 MEM 80000.0 Slots 40( ] Total [ CPU 4000.0 MEM 
80000.0 Slots 40 ]
Cluster Overall Avail [ CPU 12200.0 MEM 410000.0 Slots 200 ] Total [ CPU 
12200.0 MEM 410000.0 Slots 200 ]
{code}

It is clear that rack-0 is the best cluster since its the most balanced and can 
potentially schedule the most executors, while rack-2 is the worst rack since 
rack-2 is depleted of cpu resource thus rendering it unschedulable even though 
there are other resources available.

We first calculate the resource availability percentage of all the racks for 
each resource by computing: (resource available on rack) / (resource available 
in cluster)

We do this calculation to normalize the values otherwise the resource values 
would not be comparable.

So for our example:
{code}
rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] 
effective resources: 0.00819672131147541
rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 
0.0
rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective 
resources: 0.024390243902439025
rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] 
effective resources: 0.0975609756097561
rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] 
effective resources: 0.1951219512195122
{code}

The effective resource of a rack, which is also the subordinate resource, is 
computed by: 
{code}
MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
{code}
Then we order the racks by the effective resource.

Thus for our example:
{code}
Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
{code}
Also to deal with the presence of failures, if a topology is partially 
scheduled, we find the rack with the most scheduled executors for the topology 
and we try to schedule on that rack first.

Thus for the sorting for racks. We first sort by the number of executors 
already scheduled on the rack and then by the subordinate resource availability.


> A better algorithm server rack selection for RAS
> ------------------------------------------------
>
>                 Key: STORM-1766
>                 URL: https://issues.apache.org/jira/browse/STORM-1766
>             Project: Apache Storm
>          Issue Type: Improvement
>            Reporter: Boyang Jerry Peng
>            Assignee: Boyang Jerry Peng
>
> Currently the getBestClustering algorithm for RAS finds the "Best" 
> cluster/rack based on which rack has the most available resources this may be 
> insufficient and may cause topologies not to be able to be scheduled 
> successfully even though there are enough resources to schedule it in the 
> cluster. We attempt to find the rack with the most resources by find the rack 
> with the biggest sum of available memory + available cpu. This method is not 
> effective since it does not consider the number of slots available. This 
> method also fails in identifying racks that are not schedulable due to the 
> exhaustion of one of the resources either memory, cpu, or slots. The current 
> implementation also tries the initial scheduling on one rack and not try to 
> schedule on all the racks before giving up which may cause topologies to be 
> failed to be scheduled due to the above mentioned shortcomings in the current 
> method. Also the current method does not consider failures of workers. When 
> executors of a topology gets unassigned and needs to be scheduled again, the 
> current logic in getBestClustering may be inadequate if not complete wrong. 
> When executors needs to rescheduled due to a fault, getBestClustering will 
> likely return a cluster that is different from where the majority of 
> executors from the topology is originally scheduling in.
> Thus, I propose a different strategy/algorithm to find the "best" cluster. I 
> have come up with a ordering strategy I dub subordinate resource availability 
> ordering (inspired by Dominant Resource Fairness) that sorts racks by the 
> subordinate (not dominant) resource availability.
> For example given 4 racks with the following resource availabilities
> {code}
> //generate some that has alot of memory but little of cpu
> rack-3 Avail [ CPU 100.0 MEM 200000.0 Slots 40 ] Total [ CPU 100.0 MEM 
> 200000.0 Slots 40 ]
> //generate some supervisors that are depleted of one resource
> rack-2 Avail [ CPU 0.0 MEM 80000.0 Slots 40 ] Total [ CPU 0.0 MEM 80000.0 
> Slots 40 ]
> //generate some that has a lot of cpu but little of memory
> rack-4 Avail [ CPU 6100.0 MEM 10000.0 Slots 40 ] Total [ CPU 6100.0 MEM 
> 10000.0 Slots 40 ]
> //generate another rack of supervisors with less resources than rack-0
> rack-1 Avail [ CPU 2000.0 MEM 40000.0 Slots 40 ] Total [ CPU 2000.0 MEM 
> 40000.0 Slots 40 ]
> rack-0 Avail [ CPU 4000.0 MEM 80000.0 Slots 40( ] Total [ CPU 4000.0 MEM 
> 80000.0 Slots 40 ]
> Cluster Overall Avail [ CPU 12200.0 MEM 410000.0 Slots 200 ] Total [ CPU 
> 12200.0 MEM 410000.0 Slots 200 ]
> {code}
> It is clear that rack-0 is the best cluster since its the most balanced and 
> can potentially schedule the most executors, while rack-2 is the worst rack 
> since rack-2 is depleted of cpu resource thus rendering it unschedulable even 
> though there are other resources available.
> We first calculate the resource availability percentage of all the racks for 
> each resource by computing:
> {code}
> (resource available on rack) / (resource available in cluster)
> {code}
> We do this calculation to normalize the values otherwise the resource values 
> would not be comparable.
> So for our example:
> {code}
> rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] 
> effective resources: 0.00819672131147541
> rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 
> 0.0
> rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective 
> resources: 0.024390243902439025
> rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] 
> effective resources: 0.0975609756097561
> rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] 
> effective resources: 0.1951219512195122
> {code}
> The effective resource of a rack, which is also the subordinate resource, is 
> computed by: 
> {code}
> MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
> {code}
> Then we order the racks by the effective resource.
> Thus for our example:
> {code}
> Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
> {code}
> Also to deal with the presence of failures, if a topology is partially 
> scheduled, we find the rack with the most scheduled executors for the 
> topology and we try to schedule on that rack first.
> Thus for the sorting for racks. We first sort by the number of executors 
> already scheduled on the rack and then by the subordinate resource 
> availability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to