[ https://issues.apache.org/jira/browse/STORM-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418155#comment-15418155 ]
ASF GitHub Bot commented on STORM-1766: --------------------------------------- Github user harshach commented on a diff in the pull request: https://github.com/apache/storm/pull/1621#discussion_r74524961 --- Diff: storm-core/src/jvm/org/apache/storm/scheduler/Cluster.java --- @@ -103,6 +103,9 @@ public Cluster(Cluster src) { this.status.putAll(src.status); this.topologyResources.putAll(src.topologyResources); this.blackListedHosts.addAll(src.blackListedHosts); + if (src.networkTopography != null) { --- End diff -- is this supposed to be == null. why are we creating new Map if there is one already. > A better algorithm server rack selection for RAS > ------------------------------------------------ > > Key: STORM-1766 > URL: https://issues.apache.org/jira/browse/STORM-1766 > Project: Apache Storm > Issue Type: Improvement > Reporter: Boyang Jerry Peng > Assignee: Boyang Jerry Peng > Fix For: 2.0.0 > > > Currently the getBestClustering algorithm for RAS finds the "Best" > cluster/rack based on which rack has the most available resources this may be > insufficient and may cause topologies not to be able to be scheduled > successfully even though there are enough resources to schedule it in the > cluster. We attempt to find the rack with the most resources by find the rack > with the biggest sum of available memory + available cpu. This method is not > effective since it does not consider the number of slots available. This > method also fails in identifying racks that are not schedulable due to the > exhaustion of one of the resources either memory, cpu, or slots. The current > implementation also tries the initial scheduling on one rack and not try to > schedule on all the racks before giving up which may cause topologies to be > failed to be scheduled due to the above mentioned shortcomings in the current > method. Also the current method does not consider failures of workers. When > executors of a topology gets unassigned and needs to be scheduled again, the > current logic in getBestClustering may be inadequate if not complete wrong. > When executors needs to rescheduled due to a fault, getBestClustering will > likely return a cluster that is different from where the majority of > executors from the topology is originally scheduling in. > Thus, I propose a different strategy/algorithm to find the "best" cluster. I > have come up with a ordering strategy I dub subordinate resource availability > ordering (inspired by Dominant Resource Fairness) that sorts racks by the > subordinate (not dominant) resource availability. > For example given 4 racks with the following resource availabilities > {code} > //generate some that has alot of memory but little of cpu > rack-3 Avail [ CPU 100.0 MEM 200000.0 Slots 40 ] Total [ CPU 100.0 MEM > 200000.0 Slots 40 ] > //generate some supervisors that are depleted of one resource > rack-2 Avail [ CPU 0.0 MEM 80000.0 Slots 40 ] Total [ CPU 0.0 MEM 80000.0 > Slots 40 ] > //generate some that has a lot of cpu but little of memory > rack-4 Avail [ CPU 6100.0 MEM 10000.0 Slots 40 ] Total [ CPU 6100.0 MEM > 10000.0 Slots 40 ] > //generate another rack of supervisors with less resources than rack-0 > rack-1 Avail [ CPU 2000.0 MEM 40000.0 Slots 40 ] Total [ CPU 2000.0 MEM > 40000.0 Slots 40 ] > rack-0 Avail [ CPU 4000.0 MEM 80000.0 Slots 40( ] Total [ CPU 4000.0 MEM > 80000.0 Slots 40 ] > Cluster Overall Avail [ CPU 12200.0 MEM 410000.0 Slots 200 ] Total [ CPU > 12200.0 MEM 410000.0 Slots 200 ] > {code} > It is clear that rack-0 is the best cluster since its the most balanced and > can potentially schedule the most executors, while rack-2 is the worst rack > since rack-2 is depleted of cpu resource thus rendering it unschedulable even > though there are other resources available. > We first calculate the resource availability percentage of all the racks for > each resource by computing: > {code} > (resource available on rack) / (resource available in cluster) > {code} > We do this calculation to normalize the values otherwise the resource values > would not be comparable. > So for our example: > {code} > rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] > effective resources: 0.00819672131147541 > rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: > 0.0 > rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective > resources: 0.024390243902439025 > rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] > effective resources: 0.0975609756097561 > rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] > effective resources: 0.1951219512195122 > {code} > The effective resource of a rack, which is also the subordinate resource, is > computed by: > {code} > MIN(resource availability percentage of {CPU, Memory, # of free Slots}). > {code} > Then we order the racks by the effective resource. > Thus for our example: > {code} > Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2] > {code} > Also to deal with the presence of failures, if a topology is partially > scheduled, we find the rack with the most scheduled executors for the > topology and we try to schedule on that rack first. > Thus for the sorting for racks. We first sort by the number of executors > already scheduled on the rack and then by the subordinate resource > availability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)