[jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation

Branimir Lambov (JIRA) Fri, 09 Jan 2015 08:43:56 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271501#comment-14271501
 ]


Branimir Lambov edited comment on CASSANDRA-7032 at 1/9/15 4:41 PM:
--------------------------------------------------------------------

A work-in-progress algorithm for selecting vnodes in the replicated case is 
attached. The main idea of the algorithm is to select token positions for each 
new vnode in such a way as to get best improvement in replicated load variance 
(i.e. standard deviation) across nodes and vnodes *1. More specifically, it 
prepares a selection of token positions to try (by picking the middle positions 
between existing vnodes *2), evaluates the expected improvement in variance for 
each selection and chooses the best *3, continuing until all the vnodes of the 
new node have assigned tokens. To improve average performance, the expected 
improvement for all choices is calculated once; for the second and later vnode 
we only recalculate it for the best candidate until we find one that does not 
deteriorate to worse than the next option in the list *4.

Tested with simple factor-3 replication, it maintains the following utilization 
ranges: 
- 1 vnode: 70% - 135%
- 4 vnodes: 80% - 115%
- 16 vnodes: 83 - 106%
- 64 vnodes: 86 - 103%
- 256 vnodes: 87 - 102%

Unlike random allocation, the overutilization does not grow with the number of 
nodes, and a much smaller number of vnodes suffice (4 or 8 vnodes would 
probably be enough for most usecases).

The underutilization for this algorithm is affected less by the number of 
vnodes; this is due to the effect of replication on newly added vnodes: they 
necessarily have to take the share of one fewer vnode replica than the vnode 
they split (regardless of the algorithm we use, if we add a new node to a large 
enough perfectly balanced cluster where all vnodes are responsible for the same 
share of tokens, the new node will necessarily have at most 2/3 (for RF=3) of 
the average load.). This could possibly be improved if we manage to keep enough 
individual tokens with load closer to RF / (RF - 1), which I've yet to try.

The algorithm is implemented in the {{ReplicationAwareTokenDistributor}} in the 
attached file. Running the file simulates the effect of adding nodes using this 
algorithm on a randomly-generated cluster and prints out the minimum and 
maximum per-node and per-token replicated load after each step, as well as the 
standard deviation of the load. Sample results:
{code}
Random generation of 500 nodes with 8 tokens each
Size 500   node max 1.88 min 0.51 stddev 0.22193
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 501   node max 1.90 min 0.51 stddev 0.21922   Simple 3 replicas
Adding 4 node(s) using ReplicationAwareTokenDistributor
Size 505   node max 1.63 min 0.51 stddev 0.20580   token max 3.72 min 0.01 
stddev 0.58768   Simple 3 replicas
Adding 15 node(s) using ReplicationAwareTokenDistributor
Size 520   node max 1.51 min 0.53 stddev 0.17369   token max 3.83 min 0.01 
stddev 0.54526   Simple 3 replicas
Adding 105 node(s) using ReplicationAwareTokenDistributor
Size 625   node max 1.15 min 0.63 stddev 0.08069   token max 2.73 min 0.00 
stddev 0.40190   Simple 3 replicas
Adding 375 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03041   token max 1.99 min 0.00 
stddev 0.22341   Simple 3 replicas
Losing 1 nodes
Size 999   node max 1.09 min 0.84 stddev 0.03081   token max 1.98 min 0.00 
stddev 0.22429   Simple 3 replicas
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03019   token max 1.99 min 0.00 
stddev 0.22335   Simple 3 replicas
Losing 5 nodes
Size 995   node max 1.17 min 0.83 stddev 0.03380   token max 2.01 min 0.00 
stddev 0.22565   Simple 3 replicas
Adding 5 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03000   token max 1.99 min 0.00 
stddev 0.22181   Simple 3 replicas
Losing 20 nodes
Size 980   node max 1.19 min 0.88 stddev 0.04362   token max 2.44 min 0.00 
stddev 0.23370   Simple 3 replicas
Adding 20 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.89 stddev 0.02962   token max 1.99 min 0.00 
stddev 0.21681   Simple 3 replicas
Losing 125 nodes
Size 875   node max 1.31 min 0.79 stddev 0.08499   token max 2.81 min 0.00 
stddev 0.28763   Simple 3 replicas
Adding 125 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.90 stddev 0.02805   token max 1.85 min 0.00 
stddev 0.19258   Simple 3 replicas
{code}

This is far from finished as it is much slower than I'd like it to be.



Notes / other things I've tried:
 - *1 Only controlling individual vnode load: Because of the replication effect 
mentioned above, the ratio between largest and smallest node has to necessarily 
be at best 3:2 (for RF=3). If we don't control overall node size, about 30% 
over/underutilization is the best we can achieve regardless of vnode count. 
Moreover, when this is applied to a randomly-generated cluster, the first few 
added nodes tend to significantly increase the node overutilization in the 
cluster.
 - *1 Only controlling the load of the whole node: This gets better results, 
but it can enter pathological states where a large token is never split because 
that would create an even bigger new node. The latter happens too often for 
this option to be useable.
 - *2 Picking a choice of e.g. 4 tokens splitting each existing vnode in 1/5 
increments does not appear to give any noticeable improvement over just picking 
the middle for RF>1.
 - *3 The evaluation of the expected improvement needs to include a component 
for the load on the new node as well. To be able to assign vnodes independently 
(otherwise the complexity of the procedure would go out of control) we use the 
following heuristic: assume the new node starts at an optimal size, optimally 
split between the vnodes, and gradually replace this ideal view with the actual 
assigned tokens as they come. An additional weight of filled_vnodes/vnode_count 
is applied to the new node to allow for greater flexibility in choosing the 
first tokens, which helps avoid the pathological cases.
 - *4 This heuristic does not necessarily yield the optimal choice, because 
assigning a token changes the expected improvement for the load of the new node 
for the other candidates and there is a chance that we will pick a candidate 
before reaching a better option that has improved a lot. In practice the 
heuristic appears to give results that are very close to picking the actual 
best candidate.
 



was (Author: blambov):
A work-in-progress algorithm for selecting vnodes in the replicated case is 
attached. The main idea of the algorithm is to select token positions for each 
new vnode in such a way as to get best improvement in replicated load variance 
(i.e. standard deviation) across nodes and vnodes *1. More specifically, it 
prepares a selection of token positions to try (by picking the middle positions 
between existing vnodes *2), evaluates the expected improvement in variance for 
each selection and chooses the best *3, continuing until all the vnodes of the 
new node have assigned tokens. To improve average performance, the expected 
improvement for all choices is calculated once; for the second and later vnode 
we only recalculate it for the best candidate until we find one that does not 
deteriorate to worse than the next option in the list *4.

Tested with simple factor-3 replication, it maintains the following utilization 
ranges: 
- 1 vnode: 70% - 135%
- 4 vnodes: 80% - 115%
- 16 vnodes: 83 - 106%
- 64 vnodes: 86 - 103%
- 256 vnodes: 87 - 102%

Unlike random allocation, the overutilization does not grow with the number of 
nodes, and a much smaller number of vnodes suffice (4 or 8 vnodes would 
probably be enough for most usecases).

The underutilization for this algorithm is affected less by the number of 
vnodes; this is due to the effect of replication on newly added vnodes: they 
necessarily have to take the share of one fewer vnode replica than the vnode 
they split (regardless of the algorithm we use, if we add a new node to a large 
enough perfectly balanced cluster where all vnodes are responsible for the same 
share of tokens, the new node will necessarily have at most 2/3 (for RF=3) of 
the average load.). This could possibly be improved if we manage to keep enough 
individual tokens with load closer to RF / (RF - 1), which I've yet to try.

The algorithm is implemented in the {{ReplicationAwareTokenDistributor}} in the 
attached file. Running the file simulates the effect of adding nodes using this 
algorithm on a randomly-generated cluster and prints out the minimum and 
maximum per-node and per-token replicated load after each step, as well as the 
standard deviation of the load. Sample results:
{code}
Random generation of 500 nodes with 8 tokens each
Size 500   node max 1.88 min 0.51 stddev 0.22193
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 501   node max 1.90 min 0.51 stddev 0.21922    Simple 3 replicas
Adding 4 node(s) using ReplicationAwareTokenDistributor
Size 505   node max 1.63 min 0.51 stddev 0.20580   token max 3.72 min 0.01 
stddev 0.58768    Simple 3 replicas
Adding 15 node(s) using ReplicationAwareTokenDistributor
Size 520   node max 1.51 min 0.53 stddev 0.17369   token max 3.83 min 0.01 
stddev 0.54526    Simple 3 replicas
Adding 105 node(s) using ReplicationAwareTokenDistributor
Size 625   node max 1.15 min 0.63 stddev 0.08069   token max 2.73 min 0.00 
stddev 0.40190    Simple 3 replicas
Adding 375 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03041   token max 1.99 min 0.00 
stddev 0.22341    Simple 3 replicas
Losing 1 nodes
Size 999   node max 1.09 min 0.84 stddev 0.03081   token max 1.98 min 0.00 
stddev 0.22429    Simple 3 replicas
Adding 1 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03019   token max 1.99 min 0.00 
stddev 0.22335    Simple 3 replicas
Losing 5 nodes
Size 995   node max 1.17 min 0.83 stddev 0.03380   token max 2.01 min 0.00 
stddev 0.22565    Simple 3 replicas
Adding 5 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.84 stddev 0.03000   token max 1.99 min 0.00 
stddev 0.22181    Simple 3 replicas
Losing 20 nodes
Size 980   node max 1.19 min 0.88 stddev 0.04362   token max 2.44 min 0.00 
stddev 0.23370    Simple 3 replicas
Adding 20 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.89 stddev 0.02962   token max 1.99 min 0.00 
stddev 0.21681    Simple 3 replicas
Losing 125 nodes
Size 875   node max 1.31 min 0.79 stddev 0.08499   token max 2.81 min 0.00 
stddev 0.28763    Simple 3 replicas
Adding 125 node(s) using ReplicationAwareTokenDistributor
Size 1000   node max 1.08 min 0.90 stddev 0.02805   token max 1.85 min 0.00 
stddev 0.19258    Simple 3 replicas
{code}

This is far from finished as it is much slower than I'd like it to be.



Notes / other things I've tried:
 - *1 Only controlling individual vnode load: Because of the replication effect 
mentioned above, the ratio between largest and smallest node has to necessarily 
be at best 3:2 (for RF=3). If we don't control overall node size, about 30% 
over/underutilization is the best we can achieve regardless of vnode count. 
Moreover, when this is applied to a randomly-generated cluster, the first few 
added nodes tend to significantly increase the node overutilization in the 
cluster.
 - *1 Only controlling the load of the whole node: This gets better results, 
but it can enter pathological states where a large token is never split because 
that would create an even bigger new node. The latter happens too often for 
this option to be useable.
 - *2 Picking a choice of e.g. 4 tokens splitting each existing vnode in 1/5 
increments does not appear to give any noticeable improvement over just picking 
the middle for RF>1.
 - *3 The evaluation of the expected improvement needs to include a component 
for the load on the new node as well. To be able to assign vnodes independently 
(otherwise the complexity of the procedure would go out of control) we use the 
following heuristic: assume the new node starts at an optimal size, optimally 
split between the vnodes, and gradually replace this ideal view with the actual 
assigned tokens as they come. An additional weight of filled_vnodes/vnode_count 
is applied to the new node to allow for greater flexibility in choosing the 
first tokens, which helps avoid the pathological cases.
 - *4 This heuristic does not necessarily yield the optimal choice, because 
assigning a token changes the expected improvement for the load of the new node 
for the other candidates and there is a chance that we will pick a candidate 
before reaching a better option that has improved a lot. In practice the 
heuristic appears to give results that are very close to picking the actual 
best candidate.
 


> Improve vnode allocation
> ------------------------
>
>                 Key: CASSANDRA-7032
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>
>         Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, 
> TestVNodeAllocation.java, TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes 
> hotspots of ownership. It should be possible to improve dramatically on this 
> with deterministic allocation. I have quickly thrown together a simple greedy 
> algorithm that allocates vnodes efficiently, and will repair hotspots in a 
> randomly allocated cluster gradually as more nodes are added, and also 
> ensures that token ranges are fairly evenly spread between nodes (somewhat 
> tunably so). The allocation still permits slight discrepancies in ownership, 
> but it is bound by the inverse of the size of the cluster (as opposed to 
> random allocation, which strangely gets worse as the cluster size increases). 
> I'm sure there is a decent dynamic programming solution to this that would be 
> even better.
> If on joining the ring a new node were to CAS a shared table where a 
> canonical allocation of token ranges lives after running this (or a similar) 
> algorithm, we could then get guaranteed bounds on the ownership distribution 
> in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation

Reply via email to