[ 
https://issues.apache.org/jira/browse/CASSANDRA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902082#action_12902082
 ] 

Stu Hood commented on CASSANDRA-1418:
-------------------------------------

* One more component: trunk only implements 'case 2' of Ruhl's. It doesn't 
optimize for case 1, where a node is assuming load from a direct neighbor, and 
doesn't need to fully dump-data/leave-ring. We'll probably want a separate 
subtask to implement this portion.
* Calculation of whether a node should move (based on the ε ratio to it's 
neighbor) should probably change to additionally depend on whether the movement 
would bring both nodes closer to the ideal load ω.

> Automatic, online load balancing
> --------------------------------
>
>                 Key: CASSANDRA-1418
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1418
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>             Fix For: 0.8
>
>
> h2. Goal
> CASSANDRA-192 began with the intention of implementing full cluster load 
> balancing, but ended up being (wisely) limited to a manual load balancing 
> operation. This issue is an umbrella ticket for finishing the job of 
> implementing automatic, always-on load balancing.
> It is possible to implement very efficient load balancing operations with a 
> single process directing the rebalancing of all nodes, but avoiding such a 
> central process and allowing individual nodes to make their own movement 
> decisions would be ideal.
> h2. Components
> h3. Optimal movements for individual nodes
> h4. Ruhl
> One such approach is the Ruhl algorithm described on 192: 
> https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as 
> described, it performs excessive movement for large hotspots, and can take a 
> long time to reach equilibrium. Consider the following ring:
> ||token||load||
> |a|5|
> |c|5|
> |e|5|
> |f|40|
> |k|5|
> Assuming that node 'a' is the first to discover that 'f' is overloaded: it 
> will apply Case 2, and assume half of 'f's load by moving to 'i', leaving 
> both with 20 units. But this is not a optimal movement, because both 'f' and 
> 'a/i' will still be holding data that they will need to give away. 
> Additionally, 'a/i' can't begin giving the data away until it has finished 
> receiving it.
> If node 'e' is the first to discover that 'f' is overloaded, it will apply 
> Case 1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, 
> this is a non-optimal movement, because it will result in both 'e' and 'f/i' 
> holding data that they need to give away.
> h4. Adding load awareness to Ruhl
> Luckily, there appears to be a simple adjustment to the Ruhl algorithm that 
> solves this problem by taking advantage of the fact that Cassandra knows the 
> total load of a cluster, and can use it to calculate the average/ideal load 
> ω. Once node j has decided it should take load from node i (based on the ε 
> value in Ruhl), rather than node j taking 1/2 of the load on node i, it 
> should chose a token such that either i or j ends up with a load within ε*ω 
> of ω.
> Again considering the ring described above, and assuming ε == 1.0, the total 
> load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to 
> discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load 
> units from 'f'), leaving 'f' with a load of 28. When combined with the 
> improvement in the next section, this is closer to being an optimal movement, 
> because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a 
> position to start more movements.
> h3. Automatic load balancing
> Since the Ruhl algorithm only requires a node to make a decision based on 
> itself and one other node, it should be relatively straightforward to add a 
> timer on each node that periodically wakes up and executes the modifiied Ruhl 
> algorithm if it is not already in the process of moving (based on pending 
> ranges).
> Automatic balancing should probably be enabled by default, and should have a 
> configurable per-node bandwidth cap.
> h3. Allowing concurrent moves on a node
> Allowing a node to give away multiple ranges at once allows for the type of 
> quick balancing that is typically only attributed to vnodes. If a node is a 
> hotspot, such as in the example above, the node should be able to quickly 
> dump the load in a manner that causes minimal load on the rest of the 
> cluster. Rather than transferring to 1 target at 10 MB/s, a hotspot can give 
> to 5 targets at 2 MB/s each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to