There is now a parent ticket for this issue in JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119
Comments and contributions are still welcome! Cheers, Sam On 16 March 2012 23:38, Sam Overton <s...@acunu.com> wrote: > Hello cassandra-dev, > > This is a long email. It concerns a significant change to Cassandra, so > deserves a thorough introduction. > > The summary is: we believe virtual nodes are the way forward. We would like > to add virtual nodes to Cassandra and we are asking for comments, criticism > and collaboration! > > Cassandra's current partitioning scheme is sub-optimal for bootstrap, > decommission, repair and re-balance operations, and places the burden on > users to properly calculate tokens (a common cause of mistakes), which is a > recurring pain-point. > > Virtual nodes have a variety of benefits over the one-to-one mapping of host > to key range which Cassandra currently supports. > > Among these benefits are: > > * Even load balancing when growing and shrinking the cluster > A virtual node scheme ensures that all hosts in a cluster have an even > portion of the total data, and a new node bootstrapped into the cluster will > assume its share of the data. Doubling, or halving the cluster to ensure > even load distribution would no longer be necessary. > > * Distributed rebuild > When sizing a cluster, one of the considerations is the amount of time > required to recover from a failed node. This is the exposure time, during > which a secondary failure could cause data loss. In order to guarantee an > upper bound on the exposure time, the amount of data which can be stored on > each host is limited by the amount of time taken to recover the required > replica count. At Acunu we have found that the exposure time is frequently > the limiting factor which dictates the maximum allowed node size in > customers' clusters. > > Using a virtual node scheme, the data stored on one host is not replicated > on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 > other virtual nodes which may be on a different set of physical hosts to > replicas of other virtual nodes stored on the same host. This means data for > one host is replicated evenly across the entire cluster. > > In the event of a failure then, restoring the replica count can be done in a > fully distributed way. Each host in the cluster participates in the rebuild, > drastically reducing the exposure time, allowing more data to be stored on a > single host while still maintaining an acceptable upper bound on the > likelihood of secondary failure. This reduces TCO concerns. > > * Greater failure tolerance in streaming > Operations which require streaming of a large range of data, eg. bootstrap, > decommission, repair, etc. incur a heavy cost if an error (eg. dropped > network connection) is encountered during the streaming. Currently the whole > range must be re-streamed, and this could constitute a very large amount of > data. Virtual nodes reduce the impact of streaming failures, since each > virtual node is a much smaller range of the key-space, so re-streaming a > whole virtual node is a much cheaper process. > > * Evenly distributed impact of streaming operations > Streaming operations such as bootstrap, repair, et al. would involve every > node in the cluster. This would distribute the load of these operations > across the whole cluster, and could be staggered so that only a small subset > of nodes were affected at once, similar to staggered repair[1]. > > * Possibility for active load balancing > Load balancing in Cassandra currently involves moving a token to > increase/reduce the amount of key-space for which a host is responsible. > This only allows load balancing between neighbouring nodes, so it could > involve moving more than one token just to redistribute a single overloaded > node. Virtual nodes could allow load balancing on a much finer granularity, > so heavily loaded portions of the key-space could be redistributed to > lighter-loaded hosts by reassigning one or more virtual nodes. > > > Implementing a virtual node scheme in Cassandra is not an insignificant > amount of work, and it will touch a large amount of the codebase related to > partitioning, placement, routing, gossip, and so on. We do believe that this > is possible to do incrementally, and in such a way that there is an easy > upgrade path for pre-virtual-node deployments. > > It would not however touch the storage layer. The virtual node concept is > solely for partitioning and placement, not for segregating the data storage > of the host, so all keys for all virtual nodes on a host would be stored in > the same SSTables. > > We are not proposing the adoption of the same scheme used by Voldemort[2] > and described in the Dynamo paper[3]. We feel this scheme is too different > from Cassandra's current distribution model to be a viable target for > incremental development. Their scheme also fixes the number of virtual nodes > for the lifetime of the cluster, which can prove to be a ceiling to scaling > the cluster if the virtual nodes grow too large. > > The proposed design is: > * Assign each host T random tokens. > * A partition is assigned to a host for each of its tokens, where the > partition is defined by the interval between a token and the previous token > on the ring. > * When a host joins the ring it is assigned T random tokens which will > result in a portion of an existing partition being assigned to that host. > * When a host leaves the ring it relinquishes its tokens which will result > in its partitions becoming part of the neighbouring partitions. > > This is just a basic extension of Cassandra's existing distribution model, > where instead of having 1 token per host, there are many tokens per host. It > is the same scheme used by libketama[4] for consistent hashing among > memcached instances, and is also the original scheme used by Dynamo as > described in [3] before they migrated to their current scheme with fixed > partitions. > > The random assignment of tokens may seem unintuitive given that currently in > Cassandra a random token assigment leads to an unbalanced cluster. With many > virtual nodes, a random token assignment leads to load being evenly balanced > across the hosts in the cluster with high probability. As the number of > virtual nodes is increased, the variance in load across hosts decreases, as > demonstrated by simulation in [5]. > > This scheme has the following properties - (where N is the number of hosts > and B is the total data stored in the cluster): > * placement metadata size is O(N) which is the same as in Cassandra > currently > * partition size is O(B/N) so as data is inserted, if individual partitions > become too large then adding nodes to the cluster reduces this. > * the strategy shares the following properties in common with Cassandra > currently > ** tokens are randomly assigned > ** partitioning is determined by placement (and vice-versa) > ** no two nodes may share the same token > ** when a node leaves the ring, all of its tokens are removed - there is no > exchanging of partitions between nodes > > One design concern is that replicas of a key range are not stored on the > same physical host, as failure of that host could cause the loss of more > than one replica of the data. This will be achieved by using a placement > strategy very similar the the existing NetworkTopologyStrategy, which treats > each individual host the same way as NTS treats a rack - that is replicas > are not assigned to two hosts on the same rack. > > I will shortly create a ticket in JIRA to track discussion of this design. > We have also done some simulation of this scheme to observe the load > balancing properties, node size distribution, cluster resizing and so on. I > will attach some results of this simulation to the JIRA ticket in due > course. > > We are keen to get the ball rolling on this and we look forward to your > input, ideas and recommendations. > > Best Regards, > > Sam Overton > > [1] Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721 > [2] Project Voldemort, Design: http://project-voldemort.com/design.php > [3] Dynamo: > http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf > [4] Ketama: Consistent Hashing: > http://www.audioscrobbler.net/development/ketama/ > [5] Consistent Hashing: > http://www.lexemetech.com/2007/11/consistent-hashing.html > > -- > Sam Overton > Acunu | http://www.acunu.com | @acunu -- Sam Overton Acunu | http://www.acunu.com | @acunu