Add snitch and range movements section on Operations
Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/4d444022 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/4d444022 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/4d444022 Branch: refs/heads/trunk Commit: 4d444022c3ace5928a15b2a54af0f95f4191c0d7 Parents: cad277b Author: Paulo Motta <pauloricard...@gmail.com> Authored: Tue Jun 14 17:54:38 2016 -0300 Committer: Sylvain Lebresne <sylv...@datastax.com> Committed: Wed Jun 15 10:31:17 2016 +0200 ---------------------------------------------------------------------- doc/source/operations.rst | 164 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 160 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d444022/doc/source/operations.rst ---------------------------------------------------------------------- diff --git a/doc/source/operations.rst b/doc/source/operations.rst index 8228746..40b3e09 100644 --- a/doc/source/operations.rst +++ b/doc/source/operations.rst @@ -24,15 +24,171 @@ Replication Strategies .. todo:: todo -Snitches --------- +Snitch +------ -.. todo:: todo +In cassandra, the snitch has two functions: + +- it teaches Cassandra enough about your network topology to route requests efficiently. +- it allows Cassandra to spread replicas around your cluster to avoid correlated failures. It does this by grouping + machines into "datacenters" and "racks." Cassandra will do its best not to have more than one replica on the same + "rack" (which may not actually be a physical location). + +Dynamic snitching +^^^^^^^^^^^^^^^^^ + +The dynamic snitch monitor read latencies to avoid reading from hosts that have slowed down. The dynamic snitch is +configured with the following properties on ``cassandra.yaml``: + +- ``dynamic_snitch``: whether the dynamic snitch should be enabled or disabled. +- ``dynamic_snitch_update_interval_in_ms``: controls how often to perform the more expensive part of host score + calculation. +- ``dynamic_snitch_reset_interval_in_ms``: if set greater than zero and read_repair_chance is < 1.0, this will allow + 'pinning' of replicas to hosts in order to increase cache capacity. +- ``dynamic_snitch_badness_threshold:``: The badness threshold will control how much worse the pinned host has to be + before the dynamic snitch will prefer other replicas over it. This is expressed as a double which represents a + percentage. Thus, a value of 0.2 means Cassandra would continue to prefer the static snitch values until the pinned + host was 20% worse than the fastest. + +Snitch classes +^^^^^^^^^^^^^^ + +The ``endpoint_snitch`` parameter in ``cassandra.yaml`` should be set to the class the class that implements +``IEndPointSnitch`` which will be wrapped by the dynamic snitch and decide if two endpoints are in the same data center +or on the same rack. Out of the box, Cassandra provides the snitch implementations: + +GossipingPropertyFileSnitch + This should be your go-to snitch for production use. The rack and datacenter for the local node are defined in + cassandra-rackdc.properties and propagated to other nodes via gossip. If ``cassandra-topology.properties`` exists, + it is used as a fallback, allowing migration from the PropertyFileSnitch. + +SimpleSnitch + Treats Strategy order as proximity. This can improve cache locality when disabling read repair. Only appropriate for + single-datacenter deployments. + +PropertyFileSnitch + Proximity is determined by rack and data center, which are explicitly configured in + ``cassandra-topology.properties``. + +Ec2Snitch + Appropriate for EC2 deployments in a single Region. Loads Region and Availability Zone information from the EC2 API. + The Region is treated as the datacenter, and the Availability Zone as the rack. Only private IPs are used, so this + will not work across multiple regions. + +Ec2MultiRegionSnitch + Uses public IPs as broadcast_address to allow cross-region connectivity (thus, you should set seed addresses to the + public IP as well). You will need to open the ``storage_port`` or ``ssl_storage_port`` on the public IP firewall + (For intra-Region traffic, Cassandra will switch to the private IP after establishing a connection). + +RackInferringSnitch + Proximity is determined by rack and data center, which are assumed to correspond to the 3rd and 2nd octet of each + node's IP address, respectively. Unless this happens to match your deployment conventions, this is best used as an + example of writing a custom Snitch class and is provided in that spirit. Adding, replacing, moving and removing nodes -------------------------------------------- -.. todo:: todo +Bootstrap +^^^^^^^^^ + +Adding new nodes is called "bootstrapping". The ``num_tokens`` parameter will define the amount of virtual nodes +(tokens) the joining node will be assigned during bootstrap. The tokens define the sections of the ring (token ranges) +the node will become responsible for. + +Token allocation +~~~~~~~~~~~~~~~~ + +With the default token allocation algorithm the new node will pick ``num_tokens`` random tokens to become responsible +for. Since tokens are distributed randomly, load distribution improves with a higher amount of virtual nodes, but it +also increases token management overhead. The default of 256 virtual nodes should provide a reasonable load balance with +acceptable overhead. + +On 3.0+ a new token allocation algorithm was introduced to allocate tokens based on the load of existing virtual nodes +for a given keyspace, and thus yield an improved load distribution with a lower number of tokens. To use this approach, +the new node must be started with the JVM option ``-Dcassandra.allocate_tokens_for_keyspace=<keyspace>``, where +``<keyspace>`` is the keyspace from which the algorithm can find the load information to optimize token assignment for. + +Manual token assignment +""""""""""""""""""""""" + +You may specify a comma-separated list of tokens manually with the ``initial_token`` ``cassandra.yaml`` parameter, and +if that is specified Cassandra will skip the token allocation process. This may be useful when doing token assignment +with an external tool or when restoring a node with its previous tokens. + +Range streaming +~~~~~~~~~~~~~~~~ + +After the tokens are allocated, the joining node will pick current replicas of the token ranges it will become +responsible for to stream data from. By default it will stream from the primary replica of each token range in order to +guarantee data in the new node will be consistent with the current state. + +In the case of any unavailable replica, the consistent bootstrap process will fail. To override this behavior and +potentially miss data from an unavailable replica, set the JVM flag ``-Dcassandra.consistent.rangemovement=false``. + +Resuming failed/hanged bootstrap +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On 2.2+, if the bootstrap process fails, it's possible to resume bootstrap from the previous saved state by calling +``nodetool bootstrap resume``. If for some reason the bootstrap hangs or stalls, it may also be resumed by simply +restarting the node. In order to cleanup bootstrap state and start fresh, you may set the JVM startup flag +``-Dcassandra.reset_bootstrap_progress=true``. + +On lower versions, when the bootstrap proces fails it is recommended to wipe the node (remove all the data), and restart +the bootstrap process again. + +Manual bootstrapping +~~~~~~~~~~~~~~~~~~~~ + +It's possible to skip the bootstrapping process entirely and join the ring straight away by setting the hidden parameter +``auto_bootstrap: false``. This may be useful when restoring a node from a backup or creating a new data-center. + +Removing nodes +^^^^^^^^^^^^^^ + +You can take a node out of the cluster with ``nodetool decommission`` to a live node, or ``nodetool removenode`` (to any +other machine) to remove a dead one. This will assign the ranges the old node was responsible for to other nodes, and +replicate the appropriate data there. If decommission is used, the data will stream from the decommissioned node. If +removenode is used, the data will stream from the remaining replicas. + +No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at +a different token on the ring, it should be removed manually. + +Moving nodes +^^^^^^^^^^^^ + +When ``num_tokens: 1`` it's possible to move the node position in the ring with ``nodetool move``. Moving is both a +convenience over and more efficient than decommission + bootstrap. After moving a node, ``nodetool cleanup`` should be +run to remove any unnecessary data. + +Replacing a dead node +^^^^^^^^^^^^^^^^^^^^^ + +In order to replace a dead node, start cassandra with the JVM startup flag +``-Dcassandra.replace_address_first_boot=<dead_node_ip>``. Once this property is enabled the node starts in a hibernate +state, during which all the other nodes will see this node to be down. + +The replacing node will now start to bootstrap the data from the rest of the nodes in the cluster. The main difference +between normal bootstrapping of a new node is that this new node will not accept any writes during this phase. + +Once the bootstrapping is complete the node will be marked "UP", we rely on the hinted handoff's for making this node +consistent (since we don't accept writes since the start of the bootstrap). + +.. Note:: If the replacement process takes longer than ``max_hint_window_in_ms`` you **MUST** run repair to make the + replaced node consistent again, since it missed ongoing writes during bootstrapping. + +Monitoring progress +^^^^^^^^^^^^^^^^^^^ + +Bootstrap, replace, move and remove progress can be monitored using ``nodetool netstats`` which will show the progress +of the streaming operations. + +Cleanup data after range movements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their token range due +to a range movement operation (bootstrap, move, replace). Run ``nodetool cleanup`` on the nodes that lost ranges to the +joining node when you are satisfied the new node is up and working. If you do not do this the old data will still be +counted against the load on that node. Repair ------