RFC: Cassandra Virtual Nodes
random tokens which will result in a portion of an existing partition being assigned to that host. * When a host leaves the ring it relinquishes its tokens which will result in its partitions becoming part of the neighbouring partitions. This is just a basic extension of Cassandra's existing distribution model, where instead of having 1 token per host, there are many tokens per host. It is the same scheme used by libketama[4] for consistent hashing among memcached instances, and is also the original scheme used by Dynamo as described in [3] before they migrated to their current scheme with fixed partitions. The random assignment of tokens may seem unintuitive given that currently in Cassandra a random token assigment leads to an unbalanced cluster. With many virtual nodes, a random token assignment leads to load being evenly balanced across the hosts in the cluster with high probability. As the number of virtual nodes is increased, the variance in load across hosts decreases, as demonstrated by simulation in [5]. This scheme has the following properties - (where N is the number of hosts and B is the total data stored in the cluster): * placement metadata size is O(N) which is the same as in Cassandra currently * partition size is O(B/N) so as data is inserted, if individual partitions become too large then adding nodes to the cluster reduces this. * the strategy shares the following properties in common with Cassandra currently ** tokens are randomly assigned ** partitioning is determined by placement (and vice-versa) ** no two nodes may share the same token ** when a node leaves the ring, all of its tokens are removed - there is no exchanging of partitions between nodes One design concern is that replicas of a key range are not stored on the same physical host, as failure of that host could cause the loss of more than one replica of the data. This will be achieved by using a placement strategy very similar the the existing NetworkTopologyStrategy, which treats each individual host the same way as NTS treats a rack - that is replicas are not assigned to two hosts on the same rack. I will shortly create a ticket in JIRA to track discussion of this design. We have also done some simulation of this scheme to observe the load balancing properties, node size distribution, cluster resizing and so on. I will attach some results of this simulation to the JIRA ticket in due course. We are keen to get the ball rolling on this and we look forward to your input, ideas and recommendations. Best Regards, Sam Overton [1] Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721 [2] Project Voldemort, Design: http://project-voldemort.com/design.php [3] Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf [4] Ketama: Consistent Hashing: http://www.audioscrobbler.net/development/ketama/ [5] Consistent Hashing: http://www.lexemetech.com/2007/11/consistent-hashing.html -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Sam -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
with more than two replicas increases with declustering [Xin et al. 2004] (declustering meaning increasing DF towards N) Regards, -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote: Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Yes, but this is good only for random partitioner. For ordered you need to be able split token space on highly loaded servers. With virtual tokens it will move load to random node. What if random node will be also hotspot node? Administration will be more difficult because you don't know where workload lands after you reduce number of tokens held by node. For OPP we envisage an external management process performing active load balancing. The initial token assignment would be random within some user-specified range corresponding to the range of their keys. The load would then be monitored and hot-spots would be moved by reassigning virtual nodes to lightly loaded machines, or introducing new tokens into hot ranges. It makes sense that this would not be a manual process, but there would certainly be more control than just increasing or decreasing the number of tokens assigned to a node. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away. That's a good point, but isn't that the same problem with trying to balance tokens with OPP currently?
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote: On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. Sounds good to me. What can an upgrading user expect in the way of disruption? What would be required to move an existing cluster from one token per node to virtual nodes? Could this be made transparent? The disruption for an end-user would be no more than the same rolling upgrade process that they have to go through currently to upgrade to a new version. This is how I envisage it working: * When a node is upgraded and the new version starts up in an old cluster, it would split its own token range into multiple sub-ranges by assigning itself more tokens in its own range * These tokens could then be gossiped to any other new versions in the cluster. The old versions don't need to know about these intermediate tokens because distribution is exactly the same - node ranges are still contiguous * Once every node has been upgraded, distribution is still the same as before, but now ranges are split into sub-ranges * The benefits of vnodes start to become apparent when adding new nodes to the cluster - a new node bootstrapping would take an even amount of data from each other node and would not require doubling the cluster to maintain balance * As more nodes are added to the cluster it gets closer to full vnode distribution as more of the original hosts' ranges get reassigned to new nodes If the user wants to migrate to full vnode functionality straight away then they can do a rolling migration (decommission/bootstrap). During this migration there would be some imbalance in the cluster, but once all of the old nodes have been migrated, the cluster would be balanced. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote: Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. I will have to re-read your orignal post. I seem to have missed something :) I did, and I may or may not understand what you mean. Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by hash + identity hash as you traverse down the topology tree? Yes. I was just trying to illustrate that it's not necessary to have CRUSH doing the partitioning and placement of primary replicas. The same functionality can be achieved by having logically separate placement (a ring with virtual nodes) and a replication strategy which implements the CRUSH algorithm for replica placement. I think you agreed with this further down your previous reply anyway, perhaps I was just being too verbose :) The reason I'm trying to make that distinction is because it will be less work than wholesale replacing the entire distribution logic in Cassandra with CRUSH. I'm not sure if that's exactly what your design is suggesting? -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
There is now a parent ticket for this issue in JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119 Comments and contributions are still welcome! Cheers, Sam On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. The summary is: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair[1]. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort[2] and described in the Dynamo paper[3]. We feel this scheme is too different from Cassandra's current distribution model to be a viable target for incremental development. Their scheme also fixes the number of virtual nodes for the lifetime of the cluster, which can prove to be a ceiling to scaling the cluster
Update: Cassandra Virtual Nodes
Hello cassandra-dev, This is an update on the previous discussion on virtual nodes - for anyone who missed that, an archive is here: http://www.mail-archive.com/dev@cassandra.apache.org/msg03837.html Work on our proposed design is being tracked on JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119 We feel that our current implementation has sufficient functionality to be useful in its own right, but is still a fairly self-contained set of changes that can be incrementally improved - something which was expressed as desirable in the last discussion. This is a good point for a first round of review and to reach out to anyone interested in testing or contributing. == Obtaining the source == The easiest way to test these changes is to clone our github repo and switch to the topic branch representing the top patch: git clone git://github.com/acunu/cassandra.git cd cassandra git checkout --track remotes/origin/p/4127/01_migration_path Now just build with ant. To create a cluster with virtual-nodes support just un-comment/edit the following parameter in cassandra.yaml: num_tokens: 256 Make sure you have left initial_token blank. The node should automatically assign itself this many tokens. You can view the token assignment as usual with nodetool ring but this becomes fairly useless with any large number of hosts/tokens, which is why we have added nodetool clusterinfo which shows ring ownership without cluttering the output with the token assignment. If you want to test with a specific token assignment, initial_token now supports a comma-separated list of tokens which will override the num_tokens setting. This is just for convenience of testing - users should no longer need to manually specify tokens as balancing is automatic. == Patches == The tickets which are in scope for this round of review are: https://issues.apache.org/jira/browse/CASSANDRA-4121 https://issues.apache.org/jira/browse/CASSANDRA-4122 https://issues.apache.org/jira/browse/CASSANDRA-4125 https://issues.apache.org/jira/browse/CASSANDRA-4127 Each of these tickets has links to the corresponding patches. The order they should be applied is just increasing numerical order of ticket number. These patches are based off another patch (currently pending review) for CASSANDRA-3881, which was an existing issue blocking virtual nodes. https://issues.apache.org/jira/browse/CASSANDRA-3881 Links to the individual patches can also be found all in one place on the github wiki: https://github.com/acunu/cassandra/wiki/CASSANDRA-4119 The current patches in the series are as follows: * p/4121/01_support_multiple_tokens_per_host support multiple tokens per host in TokenMetadata (CASSANDRA-4121) * p/4122/01_bootstrap_decommission support bootstrapping a node into the ring with multiple tokens and minor changes for decommission (CASSANDRA-4122) * p/4122/02_remove_tokens minor changes to support remove-token with multiple tokens (CASSANDRA-4122) * p/4125/01_admin_tools nodetool support (addition of nodetool clusterinfo and changes to nodetool ring) (CASSANDRA-4125) * p/4127/01_migration_path support for migration from 1-token to multi-token per node (CASSANDRA-4127) If you wish to contribute or work with a clone from github then it would be advisable to familiarise yourself with TopGit, the tool we have been using for branch-based patch queue management. We've written up a tutorial here: https://github.com/acunu/cassandra/wiki/TopGit-Tutorial == What's left? == We haven't included any patches for tickets CASSANDRA-4123 and CASSANDRA-4124 which relate to the replication strategy and repair. Currently replication and repair just work with the current patches without any additional changes required. CASSANDRA-4126 relates to testing. We're running virtual nodes builds through our own test suites but we will also be writing new tests in addition. I look forward to your questions and comments! -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: Update: Cassandra Virtual Nodes
On 27 June 2012 01:45, Jonathan Ellis jbel...@gmail.com wrote: I've started commenting on the issues, but to take a higher level view, I'd say that it looks a lot more tractable than I thought at first, and props to you and Eric for pushing through with it. Thanks! Besides the four (five including 3881) posted so far, I think 4123 needs to be on the critical path for 1.2; otherwise, you have an excellent chance of seriously reducing durability by deploying vnodes. Yes, I started on a replication strategy for 4123 which included DF and found that it required a lot more code changes than I had anticipated, so set it aside. This is something we can revisit now that the rest of the patches are there. I intend to write up some of my observations from the first time around. In the meantime, 3881 can clearly be committed on its own. I'd be most comfortable committing 4121, 4122, 4125, 4127, 4123 as a group. From what I've seen, review should proceed fairly quickly, so hopefully floating the earlier ones a bit longer won't add much pain. Agreed, that makes sense. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: Virtual nodes order preserving partitioning
Hi, On 25 July 2012 22:52, rektide rekt...@voodoowarez.com wrote: Hi all, forgive me if I've missed the discussion: what's the impact of virtual nodes on those using order preserving partitioning who need to do fast range-slice queries? The implementation currently in trunk allows you to specify tokens manually in initial_token of the yaml. Assuming that you know your key distribution you can manually select tokens which give you even load distribution. Currently there is no support in trunk for moving virtual nodes, but this is being worked on: https://issues.apache.org/jira/browse/CASSANDRA-4445 -- Sam Overton Acunu | http://www.acunu.com | @acunu
tracing improvements
Hello cassandra-dev, I would like to continue the momentum on improving Cassandra's tracing, following Mick's excellent work on pluggable tracing and Zipkin support. There are a couple of areas we can improve that would make tracing an even more useful tool for cluster operators to diagnose ongoing issues. The control we currently have over tracing is coarse and somewhat cumbersome. Enabling tracing from the client for a specific query is fine for application developers, particularly in an environment where Zipkin is being used to trace all parts of the system and show an aggregated view. For an operator investigating an issue however, this does not always give us the control that we need in order to obtain relevant data. We often need to diagnose an issue without the possibility of making any changes in the client, and often without the prior knowledge of which queries at the application level are experiencing poor performance. Our only other instigator of tracing is nodetool settraceprobability which only affects a single node and gives us no control over precisely which queries get traced. In practise, it is very difficult to find the relevant queries that we want to investigate, so we have often resorted to bulk loading the traces into an external tool for analysis, and this seems sub-optimal when cassandra could reduce much of the friction. I have a few proposals to improve tracing that I'd like to throw out to the mailing list to get feedback before I start implementing. 1. Include trace_probability as a CF level property, so sampled tracing can be enabled on a per-CF basis, cluster-wide, by changing the CF property. https://issues.apache.org/jira/browse/CASSANDRA-13154 2. Allow tracing at the CFS level. If we have a misbehaving host, then it would be useful to enable sampled tracing at the CFS layer on just that host so that we can investigate queries landing on that replica, rather than just queries passing through as a coordinator as is currently possible. https://issues.apache.org/jira/browse/CASSANDRA-13155 3. Add an interface allowing for custom filters which can decide whether tracing should be enabled for a given query. This is a similar idea to CASSANDRA-9193 [1] but following the same pattern that we have for IAuthenticator, IEndpointSnitch, ConfigurationLoader et al. where the intention is that useful default implementations are provided, but abstracted in such a way that custom implementations can be written for deployments where a specific type of functionality is required. This would then allow solutions such as CASSANDRA-11012 [2] without any specific support needing to be written in Cassandra. https://issues.apache.org/jira/browse/CASSANDRA-13156 Thanks for reading! Regards, Sam [1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write dynamic code to selectively trigger trace or log for queries [2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL of a specific client only, based on IP (range)