[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098599#comment-16098599 ] Branimir Lambov commented on CASSANDRA-7032: I can't think of a way to do such a reduction on an existing DC. Adding a new node with the same number of vnodes should improve the distribution of tokens within the DC, but adding one with e.g. 8 in a 256-vnode DC will just give it 32-times smaller ownership. Trying to gradually transition will involve a period of severely overloading one or more of the non-converted nodes. It looks like DC-by-DC is you best bet, unfortunately. > Improve vnode allocation > > > Key: CASSANDRA-7032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Branimir Lambov > Labels: performance, vnodes > Fix For: 3.0 alpha 1 > > Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java > > > It's been known for a little while that random vnode allocation causes > hotspots of ownership. It should be possible to improve dramatically on this > with deterministic allocation. I have quickly thrown together a simple greedy > algorithm that allocates vnodes efficiently, and will repair hotspots in a > randomly allocated cluster gradually as more nodes are added, and also > ensures that token ranges are fairly evenly spread between nodes (somewhat > tunably so). The allocation still permits slight discrepancies in ownership, > but it is bound by the inverse of the size of the cluster (as opposed to > random allocation, which strangely gets worse as the cluster size increases). > I'm sure there is a decent dynamic programming solution to this that would be > even better. > If on joining the ring a new node were to CAS a shared table where a > canonical allocation of token ranges lives after running this (or a similar) > algorithm, we could then get guaranteed bounds on the ownership distribution > in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097733#comment-16097733 ] Jay Zhuang commented on CASSANDRA-7032: --- Hi [~blambov], [~dikanggu], any suggestion to reduce the token number for the existing cluster with traffics? For example, one solution is to re-create DC by DC with the new algorithm, but it requires tear down the whole DC. Is it possible to add new nodes 1 by 1 with the new algorithm and fewer vnodes, then decommission older nodes 1 by 1. Will the tokens are still balanced after that? > Improve vnode allocation > > > Key: CASSANDRA-7032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Branimir Lambov > Labels: performance, vnodes > Fix For: 3.0 alpha 1 > > Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java > > > It's been known for a little while that random vnode allocation causes > hotspots of ownership. It should be possible to improve dramatically on this > with deterministic allocation. I have quickly thrown together a simple greedy > algorithm that allocates vnodes efficiently, and will repair hotspots in a > randomly allocated cluster gradually as more nodes are added, and also > ensures that token ranges are fairly evenly spread between nodes (somewhat > tunably so). The allocation still permits slight discrepancies in ownership, > but it is bound by the inverse of the size of the cluster (as opposed to > random allocation, which strangely gets worse as the cluster size increases). > I'm sure there is a decent dynamic programming solution to this that would be > even better. > If on joining the ring a new node were to CAS a shared table where a > canonical allocation of token ranges lives after running this (or a similar) > algorithm, we could then get guaranteed bounds on the ownership distribution > in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857895#comment-15857895 ] burak ibrahim sevindi commented on CASSANDRA-7032: -- Are there any insights on lowering the number of default vnodes in cassanda.yaml? > Improve vnode allocation > > > Key: CASSANDRA-7032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Branimir Lambov > Labels: performance, vnodes > Fix For: 3.0 alpha 1 > > Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java > > > It's been known for a little while that random vnode allocation causes > hotspots of ownership. It should be possible to improve dramatically on this > with deterministic allocation. I have quickly thrown together a simple greedy > algorithm that allocates vnodes efficiently, and will repair hotspots in a > randomly allocated cluster gradually as more nodes are added, and also > ensures that token ranges are fairly evenly spread between nodes (somewhat > tunably so). The allocation still permits slight discrepancies in ownership, > but it is bound by the inverse of the size of the cluster (as opposed to > random allocation, which strangely gets worse as the cluster size increases). > I'm sure there is a decent dynamic programming solution to this that would be > even better. > If on joining the ring a new node were to CAS a shared table where a > canonical allocation of token ranges lives after running this (or a similar) > algorithm, we could then get guaranteed bounds on the ownership distribution > in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943678#comment-14943678 ] Sebastian Estevez commented on CASSANDRA-7032: -- Is there a way we can allow this to work without specifying the keyspace? How about using a weighted average of ownership across all existing keyspaces? > Improve vnode allocation > > > Key: CASSANDRA-7032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Branimir Lambov > Labels: performance, vnodes > Fix For: 3.0 alpha 1 > > Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java > > > It's been known for a little while that random vnode allocation causes > hotspots of ownership. It should be possible to improve dramatically on this > with deterministic allocation. I have quickly thrown together a simple greedy > algorithm that allocates vnodes efficiently, and will repair hotspots in a > randomly allocated cluster gradually as more nodes are added, and also > ensures that token ranges are fairly evenly spread between nodes (somewhat > tunably so). The allocation still permits slight discrepancies in ownership, > but it is bound by the inverse of the size of the cluster (as opposed to > random allocation, which strangely gets worse as the cluster size increases). > I'm sure there is a decent dynamic programming solution to this that would be > even better. > If on joining the ring a new node were to CAS a shared table where a > canonical allocation of token ranges lives after running this (or a similar) > algorithm, we could then get guaranteed bounds on the ownership distribution > in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602624#comment-14602624 ] Branimir Lambov commented on CASSANDRA-7032: Yes, if we change the vnode allocation to be done by this algorithm by default. Which may be the right thing to do, but I don't know if we shouldn't be a bit cautious and let it be tested in the real world first as a non-default. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601551#comment-14601551 ] T Jake Luciani commented on CASSANDRA-7032: --- Should we consider lowering the number of default vnodes in {{cassandra.yaml}} Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595838#comment-14595838 ] Branimir Lambov commented on CASSANDRA-7032: Done, including the static access change for the commitlog code. [testall|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7032-vnode-assignment-testall/] is OK, [dtest|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7032-vnode-assignment-dtest/] includes an {{incremental_repair_test.TestIncRepair.sstable_repairedset_test}} which last failed ~20 builds ago on trunk. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594566#comment-14594566 ] Benedict commented on CASSANDRA-7032: - Ok, it's looking good to me. A couple of final things to do before commit: # Swap order of static {access scope} to {access scope} static - this doesn't appear to be in our code style, which looks like an oversight, but there are thousands of occurrences of the latter, and only your occurrences of the former :) #* It may be worth sneaking in the swap for your CL code as well, although I can ninja that separately # Rebase, modify CHANGES.txt and get clean cassci runs We should then file a follow up for improving how users can access this functionality. It may be helpful to introduce a nodetool command for rebalancing the cluster through gradual application of this algorithm to more nodes. It would be neat to potentially introduce a modification to this algorithm for evaluating token _removal_, so that we could reduce the number of vnodes in an existing cluster, adding just the necessary number to get a good result. A more simple goal be to support bootstraping a vnode cluster without a keyspace present. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582033#comment-14582033 ] Branimir Lambov commented on CASSANDRA-7032: [Uploaded|https://github.com/apache/cassandra/compare/trunk...blambov:7032-vnode-assignment] a new version that interprets the token ranges correctly. Also incorporates your changes and fixes a conversion error in the token size calculations that was reducing the precision to float and rarely resulting in slightly wrong results. The results achieved are pretty close to what I was getting with the older interpretation. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577307#comment-14577307 ] Branimir Lambov commented on CASSANDRA-7032: bq. the token ranges we're optimising for in this ticket are not those the cluster actually runs on. They start at the token assigned to each endpoint, instead of end I checked this several times because the way I understood it complicates things a lot. I do not understand how I still managed to get it wrong. {{TokenAllocation.addOwnership()}} does the wrong thing, it should have picked a representative of the range rather than the start, and then it shows the problem... This changes (simplifies actually) things a lot; I have to do some extra testing to make sure everything is still working as well or better. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573452#comment-14573452 ] Benedict commented on CASSANDRA-7032: - I've pushed some minor suggestions [here|https://github.com/belliottsmith/cassandra/tree/7032-suggestions] I think the integration / mechanism for employing this functionality could be improved, but that can be delayed to a follow up ticket. Before that, though, AFAICT the token ranges we're optimising for in this ticket are not those the cluster actually runs on. They _start_ at the token assigned to each endpoint, instead of _end_ Other than that it LGTM. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521484#comment-14521484 ] Branimir Lambov commented on CASSANDRA-7032: Merged your changes into the [branch|https://github.com/apache/cassandra/compare/trunk...blambov:7032-vnode-assignment] and further refactored {{evaluateImprovement}}. It should now be much more understandable. I have verified that the running time is not significantly affected and that the results are the same as before (disregarding insignificant changes caused by floating point rounding). The {{expandable}} property means that the ownership range of the token can be expanded. It is not set in the case where currGroup == newGroup, but I swapped things around in {{findUpdatedReplicationStart}} to make what we do in that case clearer. On your other concern, group clumpings quickly create very heavy underutilization in the individual tokens, which the algorithm will not accept. A clumping may be a good short-term fix to a heavily overutilized node; I did a quick test and excluding neighbours of the same group does appear to give somewhat worse results (e.g. the included long test no longer passes). Excluding could potentially also cause problems when the number of groups is close to the replication factor, hence I think we should leave it as it is for now. It is not hard to change if we want to do so in the future. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.x Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394730#comment-14394730 ] Benedict commented on CASSANDRA-7032: - One other thing to consider optimising for: hash bit distribution. A lot of algorithmic optimisations can be made if we can assume each node has an approximately uniform distribution of hash bits. We should introduce some scoring based on this. e.g. try the best N candidates by our current evaluation, and select the one that delivers the best resulting hash bit distribution. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392912#comment-14392912 ] Benedict commented on CASSANDRA-7032: - I've pushed a version of this [here|https://github.com/belliottsmith/cassandra/tree/7032-nits] which has a lot of comments and some small refactors performed, that I needed to perform in order to understand the code. On the whole it seems like a good approach, and carefully optimised. I think some further changes are necessary for improved understanding, as the functions are currently very dense, with a focus on performance even in areas that won't make a tremendous difference. Algorithmically it's good, but we should IMO preferably separate the temporary state maintenance and cleanup from the main goal steps, and avoid inflating every nested property access into its own variable, as it pollutes the method body and makes it hard to track which actions are important. Inside {{evaluateImprovement}} the ownership adjustment calculations should all preferably be split out into separate methods, and if possible the {{unitsChain}} there and {{prevPopulate}} chain in {{populateTokenInfo}} should use a class like {{ReplicationVisitor}} that I introduced to extract some of the dense state maintenance and make the logic more declarative. I'm still a little confused by the {{expandable}} property, both because I don't understand its name, but also because its definition doesn't seem to match up to how it's used (it's quite likely I'm misunderstanding though, so if you could consider my code comment, and explain - preferably with clearer comments around its stated aim) I think there is a potential small modification to be made as well, which may or may not be helpful. It bothers me that we can in theory create very suboptimal chains for establishing the owners of a token, by having a bunching of groups within the same replication range. Since these should _generally_ lead to worse distribution outcomes this algorithm should generally fight against them, but it is possible to find problematic local minima. It would be nice to pre-exclude any candidates that would lead to an increased clumping of groups. Perhaps even to exclude any tokens that would shadow the ownership range of a partner (same group), if we have more than RF groups, and otherwise permitting a maximum traversal of one partner to define the replication range. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380174#comment-14380174 ] Branimir Lambov commented on CASSANDRA-7032: Patch is up for review [here|https://github.com/apache/cassandra/compare/trunk...blambov:7032-vnode-assignment]. It gives the option to specify a allocate_tokens_keyspace when bringing up a node. The node's tokens are then allocated to optimize the load distribution for the replication strategy of that keyspace. The allocation is currently restricted to Murmur3Partitioner and SimpleStrategy or NetworkTopologyStrategy (is there anything else we need to support?). With the latter it cannot deal with cases where the number of racks in the dc is more than one but less than the replication factor, which should not be a common case. There are a couple of things still left to do or explore, possibly in separate patches: - add a dtest starting several nodes with allocation - run a cstar_perf to see if it could show improvement for RF 2 in a 3-node cluster - optimization of the selection for the first RF nodes in the cluster to guarantee good distribution later (see ReplicationAwareTokenAllocator.testNewCluster) - (if deemed worthwhile) multiple different replication factors in one datacentre; the current code works ok when asked to allocate alternatingly but this could be improved if we consider all relevant strategies in parallel Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309190#comment-14309190 ] Benedict commented on CASSANDRA-7032: - Before going ahead and integrating it, could we confirm that the algorithm performs well under heterogenous data centre / rack configurations, by which I mean different numbers of nodes in each data centre / rack (not differing numbers of vnodes), for varying numbers of data centres and racks? And that it, most importantly, evenly distributes load across the disks within any single node? It would be great to perform a simulation of a wide range of state spaces, along each dimension of: # number of data centres # size variability of data centres # number of disks per node # variability of number of disks (perhaps only varying between data centres) # number of racks within a data centre And graph the variability of ownership _by disk_. Getting balanced disks is the really hard thing to do, and is really essential for CASSANDRA-6696. Data centres may also behave unexpectedly for replication; you may own an even amount of the ring, but it may translate into a disproportionately contiguous range within the data centre, so you end up owning an uneven quantity within the data centre. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309383#comment-14309383 ] Branimir Lambov commented on CASSANDRA-7032: The answer relies on a couple of mappings: - a Cassandra data centre is a separate ring to the algorithm. Data centres have their own allocation of replicas, hence one can only balance load within them, not across multiple DCs. The algorithm can simply ignore everything outside the data centre where the new node is spawned. - in current Cassandra terms, a node is a node to the algorithm. - in future Cassandra terms, a disk is a node to the algorithm*. - a Cassandra rack is a rack to the algorithm if there number within the data centre is equal to or higher than the replication factor, otherwise the Cassandra node is the rack of the algorithm. Implicitly all disks in a machine belong to the same rack; the machine level can be completely ignored if racks are defined. The only fine point is starred. I know your preference is to split the token space in a single range per disk. I see little benefit in this compared to simply assigning a subset of the machine's vnodes to each disk. The latter is more flexible, perhaps even easier to achieve when SSTables are per vnode. (Perhaps what's not so easy to see is that each vnode is always responsible for replicating a contiguous range of tokens and it is easy to convert between a vnode and the token range that it serves (see {{ReplicationStrategy.replicationStart}} in the attached).) Slower disks take a small number of vnodes (4?), faster ones take more (8/16?). Bigger machines have more disks hence more vnodes and automatically take a higher load. The algorithm handles heterogeneous rack and node configuration without issues as long as the number of distinct racks (in algorithm terms) is higher than the replication factor. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309401#comment-14309401 ] Branimir Lambov commented on CASSANDRA-7032: I will rename the Node concept in code, it should be Unit (or something similar) to make it obvious that its meaning can (and will) be different from a Cassandra node. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309403#comment-14309403 ] Benedict commented on CASSANDRA-7032: - Sounds good. I just wanted clarification these things were being explicitly addressed. It would be helpful still to simulate these characteristics explicitly now, though, and see how well it actually turns out don't you think? So we can get numbers on how well it will actually perform at various ends of the spectrum. I still have a nagging suspicion there may be subtelties in mapping DCs into the single address space without error, but that's a separate issue I guess, to be addressed as we tie it in. I note I don't have my heart set on any mechanism, so long as we can achieve the balance, and it does sound like your approach will generalise well. The thing I like about this is we can actually add extra nodes if we expand the number of disks on a node. We will need to think carefully about how we actually expose this node mapping to the user, in general, though, to ensure it is easy to reason about and robust. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277652#comment-14277652 ] Branimir Lambov commented on CASSANDRA-7032: No, I have not tried asymmetric configs yet, but I don't foresee significant problems adapting the method to them. Assuming configuration is done by specifying a number of vnodes in the node it should be just a matter of setting the optimal size based on that number; of course it remains to be seen if that won't cause some weird behaviour. The speed problem is now mostly solved, I'll post a new version soon and try that out. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275343#comment-14275343 ] Benedict commented on CASSANDRA-7032: - Sounds very promising! Have you tried this on asymmetric cluster configs? i.e. with racks or DCs of unequal size? Had a glance at the code and it looks like not, but in case I misread... it would be great to factor this in. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243148#comment-14243148 ] Jon Haddad commented on CASSANDRA-7032: --- Was the original 256 number chosen in hopes that it would minimize the chance of imbalance? If so, would this patch result in recommending fewer vnodes, say 16? If so, I imagine that would result in less time consuming repair as well as improvements to the spark driver, which, as of the last time I checked, did 1 query per token to achieve data locality. I would assume 16 nodes streaming data to a single one would still achieve the benefits of vnodes, but I'm just picking a number out of the air. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241332#comment-14241332 ] Branimir Lambov commented on CASSANDRA-7032: Ignoring replication for the time being (more on that below), and looking at what's the best thing we can do when we have an existing setup and we are trying to add a node, I came up with the following approach. We can only assign new vnodes, which means that we can only _take away_ load from other nodes, never add to it. On the one hand this means that underutilized nodes are hopeless until the cluster grows enough for their share to become normal. On the other it means that the best thing to do (aiming for the smallest overutilization, i.e. max deviation from mean) is to take the highest-load nodes and spread their load evenly between them and the new node. Adding a new node gives us vnodes many (_vn_) new tokens to issue, i.e. we can decrease the load in at most _vn_ other nodes. We can pick up the _vn_ highest-load ones, but some of them may already have a lower load than the target spread; we thus select the largest _n = vn_ highest load nodes such that the spread load _t_, which is their combined load divided by _n+1_, is lower than the load of each individual node. We can then choose how to assign _vn_ tokens splitting some of the ranges in these _n_ nodes to reduce the load of each node to _t_. This should also leave the new node with a load of _t_. The attached code implements a simple version of this which improves overutilization very quickly with every new node-- a typical simulation looks like: {code} Random generation of 1000 nodes with 256 tokens each Size 1000 max 1.24 min 0.80 No replication Adding 1 node(s) using NoReplicationTokenDistributor Size 1001 max 1.11 min 0.80 No replication Adding 9 node(s) using NoReplicationTokenDistributor Size 1010 max 1.05 min 0.81 No replication Adding 30 node(s) using NoReplicationTokenDistributor Size 1040 max 1.02 min 0.83 No replication Adding 210 node(s) using NoReplicationTokenDistributor Size 1250 max 1.00 min 1.00 No replication {code} It also constructs clusters from empty pretty well. However, when replication is present the load distribution of this allocation does not look good (the added node tends to take much more than it should; one reason for this is that it becomes a replica of the token ranges it splits), which is not unexpected. I am now trying to see how exactly taking replication into account affects the reasoning above. We can still only remove load, but the way splitting affects the loads is not that clear any more. As far as I can see the following simplification of Cassandra's replication strategies should suffice for handling the current and planned variations: * we have units made up of a number of vnodes whose load we want to be able to balance (currently unit==node, but in the future the unit could be smaller (a disk or core)) * units are bunched up in racks (if racks are not defined, a node is implicitly a rack for its units) * replicas of data must be placed on the closest higher vnodes that belong to different racks * the replication strategy specifies the number of replicas and the set of units belonging to each rack Datacentres are irrelevant as replication is specified within each dc, i.e. we can isolate the vnode allocation to the individual dc. If disk/core-level allocation is in place, the node boundaries within a rack can be ignored as well. Is there anything I'm missing? [~benedict]: I believe you prefer to split the disk/core workload inside the node by assigning a token range (e.g. the vnodes that intersect with a range corresponding to _1/n_ of the token ring are to be handled by that disk/core). I prefer to just choose _1/n_ of the vnodes, because it lets me directly balance them-- do you have any objections to this? Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241349#comment-14241349 ] Benedict commented on CASSANDRA-7032: - If you mean for V vnode tokens in ascending order [0..V), and e.g. D disks, the disks would own one of the token lists in the set { [dV/D..(d+1)V/D) : 0 = d D }, and you guarantee that the owned range of each list is balanced with the other lists, this seems pretty analogous to the approach I was describing and perfectly reasonable. The main goal is only that once a range or set of vnode tokens has been assigned to a given resource (disk, cpu, node, rack, whatever) that resource never needs to reassign its tokens. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232857#comment-14232857 ] Benedict commented on CASSANDRA-7032: - Well, NetworkTopologyStrategy already enforces some degree of balance across racks, and absolutely guarantees balance across DCs as far as replication ownership is concerned. It _would_ be nice to migrate this behaviour to the token selection so that we could reason about ownership a bit more clearly (NTS might enforce our general ownership constraints, but having a predictably cheap generation strategy for end points would be great, as the amount of state necessary to route queries could shrink dramatically. if we could rely on a sequence of adjacent tokens ensuring these properties, for instance), but a simpler goal of simply ensuring that for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon of perfect, should be more than sufficient. TL;DR; our goal should probably be: for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon* of perfect * with epsilon probably inversely proportional to the size of the slice Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231419#comment-14231419 ] Branimir Lambov commented on CASSANDRA-7032: [~benedict], can you point me to some more information on the imbalance that is known to appear and its behaviour with increasing number of nodes? I'd like to get a better understanding of the problem, and how much of it is caused by replica selection rather than token imbalance. It seems to me that the best approach here is to build in some replication strategy / network topology awareness into the algorithm to be able to account for replica selection. This will complicate the algorithm but, in addition to getting better balance, could also improve the time spent finding replicas (e.g. CASSANDRA-6976). Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231425#comment-14231425 ] Benedict commented on CASSANDRA-7032: - It's plain old statistics. Have a look at the java code I attached that simulates and reports the level of imbalance. Currently we randomly assign the tokens, and this results in some nodes happening to fall with all of their token ranges narrow vs the other existing tokens, and others wider. Consistent hashing is what Riak uses to achieve balance, which is one approach. Rendezvous hashing is another. But these would likely involve changing the tokens of every node in the cluster on adding a new node. This would be acceptable, but I expect with the amount of state space to work with we can design an algorithm that guarantees low bounds of imbalance without having to change the tokens assigned to any existing nodes. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231433#comment-14231433 ] Benedict commented on CASSANDRA-7032: - I should note that the dovetailing with CASSANDRA-6696 is very important. Acceptable imbalance _per node_ is actually not _too_ tricky to deliver. But ensuring each disk on each node will have a fair share of the pie is a little harder Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231496#comment-14231496 ] Branimir Lambov commented on CASSANDRA-7032: Per-disk balance is just another level in the hierarchy. Ideally we would like per-disk, per-node, per-rack and per-datacentre balance (configurable by number of vnodes), wouldn't we? Presumably with highest emphasis on the lower levels. Ignoring replica selection, this all comes for free if we can ensure equal vnode size (e.g. by reassigning all tokens on adding a node). With reassignment it should also be trivial to build the network topology into the token assignment. As I see it there are two separate objectives: - to build clusters incrementally by introducing and maintaining _some_ imbalance in the ring, which can be exploited to avoid reassignment. - to improve the balance in existing, probably highly unbalanced clusters, built without the above algorithm in mind. The former might be a solution to the latter, but it is not necessary that it is. In any case I intend to look at it in isolation first and then think how it would apply to existing clusters. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231559#comment-14231559 ] Jeremiah Jordan commented on CASSANDRA-7032: reassignment of existing tokens is not really an acceptable way to achieve this. That requires shuffling the data which belongs to those tokens as well. We have already seen that shuffle is hard to get right and resource intensive, and removed the previous attempt at that from the code base. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232062#comment-14232062 ] Tyler Hobbs commented on CASSANDRA-7032: I agree with [~jjordan] here. One of the main motivations for moving to multiple tokens was to avoid expensive rebalance operations when nodes are added or removed. Changing that is likely to be a no-go. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232076#comment-14232076 ] Branimir Lambov commented on CASSANDRA-7032: Perhaps I wasn't clear enough. Reassigning is not an option I am considering, I mentioned it as a way to imagine what an optimal balance would look like. The point is to get close enough results without it. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138983#comment-14138983 ] Jonathan Ellis commented on CASSANDRA-7032: --- Curious what [~soverton] thinks of this. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139167#comment-14139167 ] Brandon Williams commented on CASSANDRA-7032: - It seems that our main problem when choosing a distribution for vnodes is not knowing what the existing distribution in the cluster is. Now that we have shadow gossip (and already use it during bootstrap) we could know this before choosing our own tokens and be able to achieve a perfect distribution. Of course for a new cluster where you aren't bootstrapping it's still worth algorithmic improvements. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977190#comment-13977190 ] Tupshin Harper commented on CASSANDRA-7032: --- It seems like there might be a way to constrain vnode RDF (replication distribution factor) in the general scope of this ticket as well. Call it a nice to have, should it present itself. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968319#comment-13968319 ] Benedict commented on CASSANDRA-7032: - Uploaded new version with debugging info about per-node disk distribution, if we were to evenly split the token range across a given number of disks, as suggested for CASSANDRA-6696. It looks pretty promising here too: no more than ~5% stdev of ownership per disk (declining as more disks are added) as a proportion of the total owned range of the node. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969100#comment-13969100 ] Jeremiah Jordan commented on CASSANDRA-7032: Make sure to have rack awareness be part of the algorithm. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)