Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "Operations" page has been changed by LeifNelson. http://wiki.apache.org/cassandra/Operations?action=diff&rev1=80&rev2=81 -------------------------------------------------- A Cassandra cluster always divides up the key space into ranges delimited by Tokens as described above, but additional replica placement is customizable via IReplicaPlacementStrategy in the configuration file. The standard strategies are * !RackUnawareStrategy: replicas are always placed on the next (in increasing Token order) N-1 nodes along the ring - * !RackAwareStrategy: replica 2 is placed in the first node along the ring the belongs in '''another''' data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the '''same''' rack as the first + * !RackAwareStrategy: replica 2 is placed in the first node along the ring the belongs in '''another''' [[http://www.esds.co.in/data-center-services-india.php|data center]] than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the '''same''' rack as the first Note that with !RackAwareStrategy, succeeding nodes along the ring should alternate data centers to avoid hot spots. For instance, if you have nodes A, B, C, and D in increasing Token order, and instead of alternating you place A and B in DC1, and C and D in DC2, then nodes C and A will have disproportionately more data on them because they will be the replica destination for every Token range in the other data center. @@ -87, +87 @@ Here's a python program which can be used to calculate new tokens for the nodes. There's more info on the subject at Ben Black's presentation at Cassandra Summit 2010. http://www.datastax.com/blog/slides-and-videos-cassandra-summit-2010 - def tokens(nodes): + . def tokens(nodes): - for x in xrange(nodes): + . for x in xrange(nodes): - print 2 ** 127 / nodes * x + . print 2 ** 127 / nodes * x There's also `nodetool loadbalance`: essentially a convenience over decommission + bootstrap, only instead of telling the target node where to move on the ring it will choose its location based on the same heuristic as Token selection on bootstrap. You should not use this as it doesn't rebalance the entire ring. @@ -104, +104 @@ Cassandra repairs data in two ways: 1. Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas. - 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle tree of the data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient). + 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle tree of the data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient). - Running `nodetool repair`: - Like all nodetool operations, repair is non-blocking; it sends the command to the given node, but does not wait for the repair to actually finish. You can tell that repair is finished when (a) there are no active or pending tasks in the CompactionManager, and after that when (b) there are no active or pending tasks on o.a.c.concurrent.AE-SERVICE-STAGE, or o.a.c.service.StreamingService. + Running `nodetool repair`: Like all nodetool operations, repair is non-blocking; it sends the command to the given node, but does not wait for the repair to actually finish. You can tell that repair is finished when (a) there are no active or pending tasks in the CompactionManager, and after that when (b) there are no active or pending tasks on o.a.c.concurrent.AE-SERVICE-STAGE, or o.a.c.service.StreamingService. Repair should be run against one machine at a time. (This limitation will be fixed in 0.7.) === Frequency of nodetool repair === - - Unless your application performs no deletes, it is vital that production clusters run `nodetool repair` periodically on all nodes in the cluster. The hard requirement for repair frequency is the value used for GCGraceSeconds (see [[DistributedDeletes]]). Running nodetool repair often enough to guarantee that all nodes have performed a repair in a given period GCGraceSeconds long, ensures that deletes are not "forgotten" in the cluster. + Unless your application performs no deletes, it is vital that production clusters run `nodetool repair` periodically on all nodes in the cluster. The hard requirement for repair frequency is the value used for GCGraceSeconds (see DistributedDeletes). Running nodetool repair often enough to guarantee that all nodes have performed a repair in a given period GCGraceSeconds long, ensures that deletes are not "forgotten" in the cluster. Consider how to schedule your repairs. A repair causes additional disk and CPU activity on the nodes participating in the repair, and it will typically be a good idea to spread repairs out over time so as to minimize the chances of repairs running concurrently on many nodes. ==== Dealing with the consequences of nodetool repair not running within GCGraceSeconds ==== - - If `nodetool repair` has not been run often enough to the pointthat GCGraceSeconds has passed, you risk forgotten deletes (see [[DistributedDeletes]]). In addition to data popping up that has been deleted, you may see inconsistencies in data return from different nodes that will not self-heal by read-repair or further `nodetool repair`. Some further details on this latter effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]]. + If `nodetool repair` has not been run often enough to the pointthat GCGraceSeconds has passed, you risk forgotten deletes (see DistributedDeletes). In addition to data popping up that has been deleted, you may see inconsistencies in data return from different nodes that will not self-heal by read-repair or further `nodetool repair`. Some further details on this latter effect is documented in [[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]]. There are at least three ways to deal with this scenario. 1. Treat the node in question as failed, and replace it as described further below. - 2. To minimize the amount of forgotten deletes, first increase GCGraceSeconds across the cluster (rolling restart required), perform a full repair on all nodes, and then change GCRaceSeconds back again. This has the advantage of ensuring tombstones spread as much as possible, minimizing the amount of data that may "pop back up" (forgotten delete). + 1. To minimize the amount of forgotten deletes, first increase GCGraceSeconds across the cluster (rolling restart required), perform a full repair on all nodes, and then change GCRaceSeconds back again. This has the advantage of ensuring tombstones spread as much as possible, minimizing the amount of data that may "pop back up" (forgotten delete). - 3. Yet another option, that will result in more forgotten deletes than the previous suggestion but is easier to do, is to ensure 'nodetool repair' has been run on all nodes, and then perform a compaction to expire toombstones. Following this, read-repair and regular `nodetool repair` should cause the cluster to converge. + 1. Yet another option, that will result in more forgotten deletes than the previous suggestion but is easier to do, is to ensure 'nodetool repair' has been run on all nodes, and then perform a compaction to expire toombstones. Following this, read-repair and regular `nodetool repair` should cause the cluster to converge. === Handling failure === If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to deal with any inconsistent data. Remember though that if a node misses updates and is not repaired for longer than your configured GCGraceSeconds (default: 10 days), it could have missed remove operations permanently. Unless your application performs no removes, you should wipe its data directory, re-bootstrap it, and removetoken its old entry in the ring (see below). @@ -147, +144 @@ {{{ Exception in thread "main" java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory }}} - This is caused by the operating system trying to allocate the child "ln" process a memory space as large as the parent process (the cassandra server), even though '''it's not going to use it'''. So if you have a machine with 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will fail during this because the operating system wants 12 GB of virtual memory before allowing you to create the process. + This is caused by the operating system trying to allocate the child "ln" process a memory space as large as the parent process (the cassandra server), even though '''it's not going to use it'''. So if you have a machine with 8GB of RAM and no swap, and you gave 6GB to the cassandra server, it will fail during this because the operating system wants 12 GB of virtual memory before allowing you to create the process. This error can be worked around by either : @@ -156, +153 @@ OR * creating a swap file, snapshotting, removing swap file + OR + * turning on "memory overcommit" To restore a snapshot: @@ -199, +198 @@ Running `nodetool cfstats` can provide an overview of each Column Family, and important metrics to graph your cluster. Some folks prefer having to deal with non-jmx clients, there is a JMX-to-REST bridge available at http://code.google.com/p/polarrose-jmx-rest-bridge/ - Important metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These metrics can also be exposed using any JMX client such as `jconsole`. (See also [[http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html]] for how to proxy JConsole to firewalled machines.) + Important metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These metrics can also be exposed using any JMX client such as `jconsole`. (See also http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html for how to proxy JConsole to firewalled machines.) You can also use jconsole, and the MBeans tab to look at PendingTasks for thread pools. If you see one particular thread backing up, this can give you an indication of a problem. One example would be ROW-MUTATION-STAGE indicating that write requests are arriving faster than they can be handled. A more subtle example is the FLUSH stages: if these start backing up, cassandra is accepting writes into memory fast enough, but the sort-and-write-to-disk stages are falling behind. @@ -229, +228 @@ FLUSH-WRITER-POOL 0 0 218 HINTED-HANDOFF-POOL 0 0 154 }}} - == Monitoring with MX4J == - mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 cassandra lets you hook up mx4j very easily. + mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 cassandra lets you hook up mx4j very easily. To enable mx4j on a Cassandra node: - To enable mx4j on a Cassandra node: + * Download mx4j-tools.jar from http://mx4j.sourceforge.net/ * Add mx4j-tools.jar to the classpath (e.g. under lib/) * Start cassandra - * In the log you should see a message such as Http``Atapter started on port 8081 + * In the log you should see a message such as HttpAtapter started on port 8081 * To choose a different port (8081 is the default) or a different listen address (0.0.0.0 is not the default) edit conf/cassandra-env.sh and uncomment #MX4J_ADDRESS="-Dmx4jaddress=0.0.0.0" and #MX4J_PORT="-Dmx4jport=8081" Now browse to http://cassandra:8081/ and use the HTML interface.