RFC: Cassandra Virtual Nodes

2012-03-16 Thread Sam Overton
 random tokens which will
result in a portion of an existing partition being assigned to that host.
* When a host leaves the ring it relinquishes its tokens which will result
in its partitions becoming part of the neighbouring partitions.

This is just a basic extension of Cassandra's existing distribution model,
where instead of having 1 token per host, there are many tokens per host.
It is the same scheme used by libketama[4] for consistent hashing among
memcached instances, and is also the original scheme used by Dynamo as
described in [3] before they migrated to their current scheme with fixed
partitions.

The random assignment of tokens may seem unintuitive given that currently
in Cassandra a random token assigment leads to an unbalanced cluster. With
many virtual nodes, a random token assignment leads to load being evenly
balanced across the hosts in the cluster with high probability. As the
number of virtual nodes is increased, the variance in load across hosts
decreases, as demonstrated by simulation in [5].

This scheme has the following properties - (where N is the number of hosts
and B is the total data stored in the cluster):
* placement metadata size is O(N) which is the same as in Cassandra
currently
* partition size is O(B/N) so as data is inserted, if individual partitions
become too large then adding nodes to the cluster reduces this.
* the strategy shares the following properties in common with Cassandra
currently
** tokens are randomly assigned
** partitioning is determined by placement (and vice-versa)
** no two nodes may share the same token
** when a node leaves the ring, all of its tokens are removed - there is no
exchanging of partitions between nodes

One design concern is that replicas of a key range are not stored on the
same physical host, as failure of that host could cause the loss of more
than one replica of the data. This will be achieved by using a placement
strategy very similar the the existing NetworkTopologyStrategy, which
treats each individual host the same way as NTS treats a rack - that is
replicas are not assigned to two hosts on the same rack.

I will shortly create a ticket in JIRA to track discussion of this design.
We have also done some simulation of this scheme to observe the load
balancing properties, node size distribution, cluster resizing and so on. I
will attach some results of this simulation to the JIRA ticket in due
course.

We are keen to get the ball rolling on this and we look forward to your
input, ideas and recommendations.

Best Regards,

Sam Overton

[1] Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721
[2] Project Voldemort, Design: http://project-voldemort.com/design.php
[3] Dynamo:
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[4] Ketama: Consistent Hashing:
http://www.audioscrobbler.net/development/ketama/
[5] Consistent Hashing:
http://www.lexemetech.com/2007/11/consistent-hashing.html

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Sam Overton
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote:

 I don't like that every node will have same portion of data.

 1. We are using nodes with different HW sizes (number of disks)
 2.  especially with ordered partitioner there tends to be hotspots and you
 must assign smaller portion of data to nodes holding hotspots


Hi Radim,

The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterogeneity in
cluster hardware.

Sam

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
 with more than two
replicas increases with declustering [Xin et al. 2004]

(declustering meaning increasing DF towards N)


Regards,


--
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote:


 Hi Radim,

 The number of virtual nodes for each host would be configurable by the
 user, in much the same way that initial_token is configurable now. A host
 taking a larger number of virtual nodes (tokens) would have
 proportionately
 more of the data. This is how we anticipate support for heterogeneity in
 cluster hardware.

 Yes, but this is good only for random partitioner. For ordered you need to
 be able split token space on highly loaded servers. With virtual tokens it
 will move load to random node.
 What if random node will be also hotspot node? Administration will be more
 difficult because you don't know where workload lands after you reduce
 number of tokens held by node.

For OPP we envisage an external management process performing active
load balancing. The initial token assignment would be random within
some user-specified range corresponding to the range of their keys.
The load would then be monitored and hot-spots would be moved by
reassigning virtual nodes to lightly loaded machines, or introducing
new tokens into hot ranges. It makes sense that this would not be a
manual process, but there would certainly be more control than just
increasing or decreasing the number of tokens assigned to a node.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
 For OPP the problem of load balancing is more profound. Now you need
 vnodes per keyspace because you can not expect each keyspace to have
 the same distribution. With three keyspaces you are not unsure as to
 which was is causing the hotness. I think OPP should just go away.

That's a good point, but isn't that the same problem with trying to
balance tokens with OPP currently?


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:

 I'm guessing you're referring to Rick's proposal about ranges per node?


 May be, what i mean is little more simple than that... We can consider
 every node having a multiple conservative ranges and moving those ranges
 for bootstrap etc, instead of finding the mid point etc in the bootstrap
 code. Once we get that working all the way to the FS/Streaming then we can
 move those ranges and assign those ranges to nodes in random orders. Hope
 it makes sense.

I agree that this should be approached in incremental steps. Rick
already raised concerns about stability issues which might arise from
changing large parts of code at once.

I would anticipate the first step to be, exactly as you suggest, to
support multiple tokens per host instead of just one. Presumably in
your suggestion you imagine these tokens to define contiguous ranges
for a given host, so that the distribution model is the same as
before, but bootstrap can be done incrementally.

This would be a great first step. The extension to a virtual node
scheme as described previously is then fairly trivial. The only
additional change needed is to assign the tokens in some other way
which does not restrict the ranges to being contiguous.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote:
 On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote:
 On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
 May be, what i mean is little more simple than that... We can consider
 every node having a multiple conservative ranges and moving those ranges
 for bootstrap etc, instead of finding the mid point etc in the bootstrap
 code. Once we get that working all the way to the FS/Streaming then we can
 move those ranges and assign those ranges to nodes in random orders. Hope
 it makes sense.

 I agree that this should be approached in incremental steps. Rick
 already raised concerns about stability issues which might arise from
 changing large parts of code at once.

 I would anticipate the first step to be, exactly as you suggest, to
 support multiple tokens per host instead of just one. Presumably in
 your suggestion you imagine these tokens to define contiguous ranges
 for a given host, so that the distribution model is the same as
 before, but bootstrap can be done incrementally.

 This would be a great first step. The extension to a virtual node
 scheme as described previously is then fairly trivial. The only
 additional change needed is to assign the tokens in some other way
 which does not restrict the ranges to being contiguous.

 Sounds good to me.

 What can an upgrading user expect in the way of disruption?  What
 would be required to move an existing cluster from one token per node
 to virtual nodes?  Could this be made transparent?


The disruption for an end-user would be no more than the same rolling
upgrade process that they have to go through currently to upgrade to a
new version.

This is how I envisage it working:
* When a node is upgraded and the new version starts up in an old
cluster, it would split its own token range into multiple sub-ranges
by assigning itself more tokens in its own range
* These tokens could then be gossiped to any other new versions in the
cluster. The old versions don't need to know about these intermediate
tokens because distribution is exactly the same - node ranges are
still contiguous
* Once every node has been upgraded, distribution is still the same as
before, but now ranges are split into sub-ranges
* The benefits of vnodes start to become apparent when adding new
nodes to the cluster - a new node bootstrapping would take an even
amount of data from each other node and would not require doubling the
cluster to maintain balance
* As more nodes are added to the cluster it gets closer to full vnode
distribution as more of the original hosts' ranges get reassigned to
new nodes

If the user wants to migrate to full vnode functionality straight away
then they can do a rolling migration (decommission/bootstrap). During
this migration there would be some imbalance in the cluster, but once
all of the old nodes have been migrated, the cluster would be
balanced.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote:
 Using this ring bucket in the CRUSH topology, (with the hash function
 being the identity function) would give the exact same distribution
 properties as the virtual node strategy that I suggested previously,
 but of course with much better topology awareness.

 I will have to re-read your orignal post. I seem to have missed something :)

 I did, and I may or may not understand what you mean.

 Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
 hash + identity hash as you traverse down the topology tree?

Yes. I was just trying to illustrate that it's not necessary to have
CRUSH doing the partitioning and placement of primary replicas. The
same functionality can be achieved by having logically separate
placement (a ring with virtual nodes) and a replication strategy which
implements the CRUSH algorithm for replica placement. I think you
agreed with this further down your previous reply anyway, perhaps I
was just being too verbose :)

The reason I'm trying to make that distinction is because it will be
less work than wholesale replacing the entire distribution logic in
Cassandra with CRUSH. I'm not sure if that's exactly what your design
is suggesting?

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-04-10 Thread Sam Overton
There is now a parent ticket for this issue in JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119

Comments and contributions are still welcome!

Cheers,

Sam

On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote:
 Hello cassandra-dev,

 This is a long email. It concerns a significant change to Cassandra, so
 deserves a thorough introduction.

 The summary is: we believe virtual nodes are the way forward. We would like
 to add virtual nodes to Cassandra and we are asking for comments, criticism
 and collaboration!

 Cassandra's current partitioning scheme is sub-optimal for bootstrap,
 decommission, repair and re-balance operations, and places the burden on
 users to properly calculate tokens (a common cause of mistakes), which is a
 recurring pain-point.

 Virtual nodes have a variety of benefits over the one-to-one mapping of host
 to key range which Cassandra currently supports.

 Among these benefits are:

 * Even load balancing when growing and shrinking the cluster
 A virtual node scheme ensures that all hosts in a cluster have an even
 portion of the total data, and a new node bootstrapped into the cluster will
 assume its share of the data. Doubling, or halving the cluster to ensure
 even load distribution would no longer be necessary.

 * Distributed rebuild
 When sizing a cluster, one of the considerations is the amount of time
 required to recover from a failed node. This is the exposure time, during
 which a secondary failure could cause data loss. In order to guarantee an
 upper bound on the exposure time, the amount of data which can be stored on
 each host is limited by the amount of time taken to recover the required
 replica count. At Acunu we have found that the exposure time is frequently
 the limiting factor which dictates the maximum allowed node size in
 customers' clusters.

 Using a virtual node scheme, the data stored on one host is not replicated
 on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
 other virtual nodes which may be on a different set of physical hosts to
 replicas of other virtual nodes stored on the same host. This means data for
 one host is replicated evenly across the entire cluster.

 In the event of a failure then, restoring the replica count can be done in a
 fully distributed way. Each host in the cluster participates in the rebuild,
 drastically reducing the exposure time, allowing more data to be stored on a
 single host while still maintaining an acceptable upper bound on the
 likelihood of secondary failure. This reduces TCO concerns.

 * Greater failure tolerance in streaming
 Operations which require streaming of a large range of data, eg. bootstrap,
 decommission, repair, etc. incur a heavy cost if an error (eg. dropped
 network connection) is encountered during the streaming. Currently the whole
 range must be re-streamed, and this could constitute a very large amount of
 data. Virtual nodes reduce the impact of streaming failures, since each
 virtual node is a much smaller range of the key-space, so re-streaming a
 whole virtual node is a much cheaper process.

 * Evenly distributed impact of streaming operations
 Streaming operations such as bootstrap, repair, et al. would involve every
 node in the cluster. This would distribute the load of these operations
 across the whole cluster, and could be staggered so that only a small subset
 of nodes were affected at once, similar to staggered repair[1].

 * Possibility for active load balancing
 Load balancing in Cassandra currently involves moving a token to
 increase/reduce the amount of key-space for which a host is responsible.
 This only allows load balancing between neighbouring nodes, so it could
 involve moving more than one token just to redistribute a single overloaded
 node. Virtual nodes could allow load balancing on a much finer granularity,
 so heavily loaded portions of the key-space could be redistributed to
 lighter-loaded hosts by reassigning one or more virtual nodes.


 Implementing a virtual node scheme in Cassandra is not an insignificant
 amount of work, and it will touch a large amount of the codebase related to
 partitioning, placement, routing, gossip, and so on. We do believe that this
 is possible to do incrementally, and in such a way that there is an easy
 upgrade path for pre-virtual-node deployments.

 It would not however touch the storage layer. The virtual node concept is
 solely for partitioning and placement, not for segregating the data storage
 of the host, so all keys for all virtual nodes on a host would be stored in
 the same SSTables.

 We are not proposing the adoption of the same scheme used by Voldemort[2]
 and described in the Dynamo paper[3]. We feel this scheme is too different
 from Cassandra's current distribution model to be a viable target for
 incremental development. Their scheme also fixes the number of virtual nodes
 for the lifetime of the cluster, which can prove to be a ceiling to scaling
 the cluster

Update: Cassandra Virtual Nodes

2012-06-01 Thread Sam Overton
Hello cassandra-dev,

This is an update on the previous discussion on virtual nodes - for
anyone who missed that, an archive is here:
http://www.mail-archive.com/dev@cassandra.apache.org/msg03837.html

Work on our proposed design is being tracked on JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119

We feel that our current implementation has sufficient functionality
to be useful in its own right, but is still a fairly self-contained
set of changes that can be incrementally improved - something which
was expressed as desirable in the last discussion. This is a good
point for a first round of review and to reach out to anyone
interested in testing or contributing.

== Obtaining the source ==

The easiest way to test these changes is to clone our github repo and
switch to the topic branch representing the top patch:

 git clone git://github.com/acunu/cassandra.git
 cd cassandra
 git checkout --track remotes/origin/p/4127/01_migration_path

Now just build with ant. To create a cluster with virtual-nodes
support just un-comment/edit the following parameter in
cassandra.yaml:
 num_tokens: 256

Make sure you have left initial_token blank. The node should
automatically assign itself this many tokens. You can view the token
assignment as usual with nodetool ring but this becomes fairly
useless with any large number of hosts/tokens, which is why we have
added nodetool clusterinfo which shows ring ownership without
cluttering the output with the token assignment.

If you want to test with a specific token assignment, initial_token
now supports a comma-separated list of tokens which will override the
num_tokens setting. This is just for convenience of testing - users
should no longer need to manually specify tokens as balancing is
automatic.

== Patches ==

The tickets which are in scope for this round of review are:
https://issues.apache.org/jira/browse/CASSANDRA-4121
https://issues.apache.org/jira/browse/CASSANDRA-4122
https://issues.apache.org/jira/browse/CASSANDRA-4125
https://issues.apache.org/jira/browse/CASSANDRA-4127

Each of these tickets has links to the corresponding patches. The
order they should be applied is just increasing numerical order of
ticket number.

These patches are based off another patch (currently pending review)
for CASSANDRA-3881, which was an existing issue blocking virtual
nodes.
https://issues.apache.org/jira/browse/CASSANDRA-3881

Links to the individual patches can also be found all in one place on
the github wiki:
https://github.com/acunu/cassandra/wiki/CASSANDRA-4119

The current patches in the series are as follows:

* p/4121/01_support_multiple_tokens_per_host
support multiple tokens per host in TokenMetadata (CASSANDRA-4121)
* p/4122/01_bootstrap_decommission
support bootstrapping a node into the ring with multiple tokens and
minor changes for decommission (CASSANDRA-4122)
* p/4122/02_remove_tokens
minor changes to support remove-token with multiple tokens (CASSANDRA-4122)
* p/4125/01_admin_tools
nodetool support (addition of nodetool clusterinfo and changes to
nodetool ring) (CASSANDRA-4125)
* p/4127/01_migration_path
support for migration from 1-token to multi-token per node (CASSANDRA-4127)

If you wish to contribute or work with a clone from github then it
would be advisable to familiarise yourself with TopGit, the tool we
have been using for branch-based patch queue management. We've written
up a tutorial here:
https://github.com/acunu/cassandra/wiki/TopGit-Tutorial

== What's left? ==

We haven't included any patches for tickets CASSANDRA-4123 and
CASSANDRA-4124 which relate to the replication strategy and repair.
Currently replication and repair just work with the current patches
without any additional changes required.

CASSANDRA-4126 relates to testing. We're running virtual nodes builds
through our own test suites but we will also be writing new tests in
addition.


I look forward to your questions and comments!


-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: Update: Cassandra Virtual Nodes

2012-06-27 Thread Sam Overton
On 27 June 2012 01:45, Jonathan Ellis jbel...@gmail.com wrote:
 I've started commenting on the issues, but to take a higher level
 view, I'd say that it looks a lot more tractable than I thought at
 first, and props to you and Eric for pushing through with it.

Thanks!

 Besides the four (five including 3881) posted so far, I think 4123
 needs to be on the critical path for 1.2; otherwise, you have an
 excellent chance of seriously reducing durability by deploying vnodes.

Yes, I started on a replication strategy for 4123 which included DF
and found that it required a lot more code changes than I had
anticipated, so set it aside. This is something we can revisit now
that the rest of the patches are there. I intend to write up some of
my observations from the first time around.

 In the meantime, 3881 can clearly be committed on its own.  I'd be
 most comfortable committing 4121, 4122, 4125, 4127, 4123 as a group.
 From what I've seen, review should proceed fairly quickly, so
 hopefully floating the earlier ones a bit longer won't add much pain.

Agreed, that makes sense.


-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: Virtual nodes order preserving partitioning

2012-07-26 Thread Sam Overton
Hi,

On 25 July 2012 22:52, rektide rekt...@voodoowarez.com wrote:
 Hi all, forgive me if I've missed the discussion: what's the impact of 
 virtual nodes on
 those using order preserving partitioning  who need to do fast range-slice 
 queries?


The implementation currently in trunk allows you to specify tokens
manually in initial_token of the yaml. Assuming that you know your key
distribution you can manually select tokens which give you even load
distribution. Currently there is no support in trunk for moving
virtual nodes, but this is being worked on:
https://issues.apache.org/jira/browse/CASSANDRA-4445



-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


tracing improvements

2017-01-25 Thread Sam Overton
Hello cassandra-dev,

I would like to continue the momentum on improving Cassandra's tracing,
following Mick's excellent work on pluggable tracing and Zipkin support.

There are a couple of areas we can improve that would make tracing an even
more
useful tool for cluster operators to diagnose ongoing issues.

The control we currently have over tracing is coarse and somewhat
cumbersome.
Enabling tracing from the client for a specific query is fine for
application
developers, particularly in an environment where Zipkin is being used to
trace
all parts of the system and show an aggregated view. For an operator
investigating an issue however, this does not always give us the control
that we
need in order to obtain relevant data. We often need to diagnose an issue
without the possibility of making any changes in the client, and often
without
the prior knowledge of which queries at the application level are
experiencing
poor performance.

Our only other instigator of tracing is nodetool settraceprobability which
only
affects a single node and gives us no control over precisely which queries
get
traced. In practise, it is very difficult to find the relevant queries that
we
want to investigate, so we have often resorted to bulk loading the traces
into
an external tool for analysis, and this seems sub-optimal when cassandra
could
reduce much of the friction.

I have a few proposals to improve tracing that I'd like to throw out to
the mailing list to get feedback before I start implementing.

1. Include trace_probability as a CF level property, so sampled tracing can
be
enabled on a per-CF basis, cluster-wide, by changing the CF property.
https://issues.apache.org/jira/browse/CASSANDRA-13154

2. Allow tracing at the CFS level. If we have a misbehaving host, then it
would
be useful to enable sampled tracing at the CFS layer on just that host so
that
we can investigate queries landing on that replica, rather than just queries
passing through as a coordinator as is currently possible.
https://issues.apache.org/jira/browse/CASSANDRA-13155

3. Add an interface allowing for custom filters which can decide whether
tracing
should be enabled for a given query. This is a similar idea to
CASSANDRA-9193
[1] but following the same pattern that we have for IAuthenticator,
IEndpointSnitch, ConfigurationLoader et al. where the intention is that
useful
default implementations are provided, but abstracted in such a way that
custom
implementations can be written for deployments where a specific type of
functionality is required. This would then allow solutions such as
CASSANDRA-11012 [2] without any specific support needing to be written in
Cassandra.
https://issues.apache.org/jira/browse/CASSANDRA-13156

Thanks for reading!
Regards,

Sam


[1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write
dynamic
code to selectively trigger trace or log for queries

[2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL
of a
specific client only, based on IP (range)