Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "Operations_ZH" page has been changed by jian.huang.
http://wiki.apache.org/cassandra/Operations_ZH

--------------------------------------------------

New page:
== 硬件 ==
See [[CassandraHardware_ZH|Cassandra的硬件]]

== 性能调优 ==
See [[PerformanceTuning|Cassandra的性能调优]]

== 数据结构管理 ==
See [[LiveSchemaUpdates|联机数据结构跟新]] [refers to functionality in 0.7]

== 令牌环管理 ==
每台 !Cassandra 服务器 [节点] 被指定为一个唯一的令牌,作为键值的第一个副本。 
如果您对所有节点标记进行排序,每个是负责的键范围是(!PreviousToken, 
!MyToken],就是从该节点的上一个令牌(不包含,exclusive)到该节点的令牌(包含,inclusive)。持有最小令牌的节点负责的范围是小于该令牌的所有键,持有最da令牌的节点负责的范围是da于该令牌的所有键;
 这被称为一个"令牌环"。

(请注意,在单点故障时,没有什么是"主"副本。)

在使用时会随机分配从 0 到 2 ** 127 之间的整数令牌。键将转换为此范围的 MD5 
哈希算法的令牌比较。(因此,键始终是转换为令牌,但是相反并不总是如此。)

=== 令牌选择 ===
使用随机分配键,通过一种强烈的哈希函数将令牌均匀的分布在令牌环上,但如果您标记做不了范围平均划分仍然可以不平衡分配,因此应指定到您的第一个节点的初始化令牌作为 
`i * (2**127 / N)` for i = 1 .. N。

With order preserving partitioners, your key distribution will be 
application-dependent.  You should still take your best guess at specifying 
initial tokens (guided by sampling actual data, if possible), but you will be 
more dependent on active load balancing (see below) and/or adding new nodes to 
hot spots.

Once data is placed on the cluster, the partitioner may not be changed without 
wiping and starting over.

=== Replication ===
A Cassandra cluster always divides up the key space into ranges delimited by 
Tokens as described above, but additional replica placement is customizable via 
IReplicaPlacementStrategy in the configuration file.  The standard strategies 
are

 * !RackUnawareStrategy: replicas are always placed on the next (in increasing 
Token order) N-1 nodes along the ring
 * !RackAwareStrategy: replica 2 is placed in the first node along the ring the 
belongs in '''another''' data center than the first; the remaining N-2 
replicas, if any, are placed on the first nodes along the ring in the 
'''same''' rack as the first

Note that with !RackAwareStrategy, succeeding nodes along the ring should 
alternate data centers to avoid hot spots.  For instance, if you have nodes A, 
B, C, and D in increasing Token order, and instead of alternating you place A 
and B in DC1, and C and D in DC2, then nodes C and A will have 
disproportionately more data on them because they will be the replica 
destination for every Token range in the other data center.

 * The corollary to this is, if you want to start with a single DC and add 
another later, when you add the second DC you should add as many nodes as you 
have in the first rather than adding a node or two at a time gradually.

Replication factor is not really intended to be changed in a live cluster 
either, but increasing it may be done if you (a) read at 
ConsistencyLevel.QUORUM or ALL (depending on your existing replication factor) 
to make sure that a replica that actually has the data is consulted, (b) are 
willing to accept downtime while anti-entropy repair runs (see below), or (c) 
are willing to live with some clients potentially being told no data exists if 
they read from the new replica location(s) until repair is done.

The same options apply to changing replication strategy.

Reducing replication factor is easily done and only requires running cleanup 
afterwards to remove extra replicas.

=== Network topology ===
Besides datacenters, you can also tell Cassandra which nodes are in the same 
rack within a datacenter.  Cassandra will use this to route both reads and data 
movement for Range changes to the nearest replicas.  This is configured by a 
user-pluggable !EndpointSnitch class in the configuration file.

!EndpointSnitch is related to, but distinct from, replication strategy itself: 
!RackAwareStrategy needs a properly configured Snitch to place replicas 
correctly, but even absent a Strategy that cares about datacenters, the rest of 
Cassandra will still be location-sensitive.

There is an example of a custom Snitch implementation in 
http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.1/contrib/property_snitch/.

== Range changes ==
=== Bootstrap ===
Adding new nodes is called "bootstrapping."

To bootstrap a node, turn !AutoBootstrap on in the configuration file, and 
start it.

If you explicitly specify an !InitialToken in the configuration, the new node 
will bootstrap to that position on the ring.  Otherwise, it will pick a Token 
that will give it half the keys from the node with the most disk space used, 
that does not already have another node bootstrapping into its Range.

Important things to note:

 1. You should wait long enough for all the nodes in your cluster to become 
aware of the bootstrapping node via gossip before starting another bootstrap.  
The new node will log "Bootstrapping" when this is safe, 2 minutes after 
starting.  (90s to make sure it has accurate load information, and 30s waiting 
for other nodes to start sending it inserts happening in its to-be-assumed part 
of the token ring.)
 1. Relating to point 1, one can only boostrap N nodes at a time with automatic 
token picking, where N is the size of the existing cluster. If you need to more 
than double the size of your cluster, you have to wait for the first N nodes to 
finish until your cluster is size 2N before bootstrapping more nodes. So if 
your current cluster is 5 nodes and you want add 7 nodes, bootstrap 5 and let 
those finish before boostrapping the last two.
 1. As a safety measure, Cassandra does not automatically remove data from 
nodes that "lose" part of their Token Range to a newly added node.  Run 
`nodetool cleanup` on the source node(s) (neighboring nodes that shared the 
same subrange) when you are satisfied the new node is up and working. If you do 
not do this the old data will still be counted against the load on that node 
and future bootstrap attempts at choosing a location will be thrown off.
 1. When bootstrapping a new node, existing nodes have to divide the key space 
before beginning replication.  This can take awhile, so be patient.
 1. During bootstrap, a node will drop the Thrift port and will not be 
accessible from `nodetool`.
 1. Bootstrap can take many hours when a lot of data is involved.  See 
[[Streaming]] for how to monitor progress.

Cassandra is smart enough to transfer data from the nearest source node(s), if 
your !EndpointSnitch is configured correctly.  So, the new node doesn't need to 
be in the same datacenter as the primary replica for the Range it is 
bootstrapping into, as long as another replica is in the datacenter with the 
new one.

Bootstrap progress can be monitored using `nodetool` with the `streams` 
argument.

During bootstrap `nodetool` may report that the new node is not receiving nor 
sending any streams, this is because the sending node will copy out locally the 
data they will send to the receiving one, which can be seen in the sending node 
through the the "AntiCompacting... AntiCompacted" log messages.

== Moving or Removing nodes ==
=== Removing nodes entirely ===
You can take a node out of the cluster with `nodetool decommission` to a live 
node, or `nodetool removetoken` (to any other machine) to remove a dead one.  
This will assign the ranges the old node was responsible for to other nodes, 
and replicate the appropriate data there. If `decommission` is used, the data 
will stream from the decommissioned node. If `removetoken` is used, the data 
will stream from the remaining replicas.

No data is removed automatically from the node being decommissioned, so if you 
want to put the node back into service at a different token on the ring, it 
should be removed manually.

=== Moving nodes ===
`nodetool move`: move the target node to a given Token. Moving is essentially a 
convenience over decommission + bootstrap.

As with bootstrap, see [[Streaming]] for how to monitor progress.

=== Load balancing ===
`nodetool loadbalance`: also essentially a convenience over decommission + 
bootstrap, only instead of telling the target node where to move on the ring it 
will choose its location based on the same heuristic as Token selection on 
bootstrap.

The status of move and balancing operations can be monitored using `nodetool` 
with the `streams` argument.

== Consistency ==
Cassandra allows clients to specify the desired consistency level on reads and 
writes.  (See [[API]].)  If R + W > N, where R, W, and N are respectively the 
read replica count, the write replica count, and the replication factor, all 
client reads will see the most recent write.  Otherwise, readers '''may''' see 
older versions, for periods of typically a few ms; this is called "eventual 
consistency."  See 
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html and 
http://queue.acm.org/detail.cfm?id=1466448 for more.

See below about consistent backups.

=== Repairing missing or inconsistent data ===
Cassandra repairs data in two ways:

 1. Read Repair: every time a read is performed, Cassandra compares the 
versions at each replica (in the background, if a low consistency was requested 
by the reader to minimize latency), and the newest version is sent to any 
out-of-date replicas.
 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle 
tree of the data on that node, and compares it with the versions on other 
replicas, to catch any out of sync data that hasn't been read recently.  This 
is intended to be run infrequently (e.g., weekly) since computing the Merkle 
tree is relatively expensive in disk i/o and CPU, since it scans ALL the data 
on the machine (but it is is very network efficient).  

Running `nodetool repair`:
Like all nodetool operations, repair is non-blocking; it sends the command to 
the given node, but does not wait for the repair to actually finish.  You can 
tell that repair is finished when (a) there are no active or pending tasks in 
the CompactionManager, and after that when (b) there are no active or pending 
tasks on AE-SERVICE-STAGE.

Repair should be run against one machine at a time.  (This limitation will be 
fixed in 0.7.)

=== Handling failure ===
If a node goes down and comes back up, the ordinary repair mechanisms will be 
adequate to deal with any inconsistent data.  Remember though that if a node is 
down longer than your configured GCGraceSeconds (default: 10 days), it could 
have missed remove operations permanently.  Unless your application performs no 
removes, you should wipe its data directory, re-bootstrap it, and removetoken 
its old entry in ghe ring (see below).

If a node goes down entirely, then you have two options:

 1. (Recommended approach) Bring up the replacement node with a new IP address, 
and !AutoBootstrap set to true in storage-conf.xml. This will place the 
replacement node in the cluster and find the appropriate position 
automatically. Then the bootstrap process begins. While this process runs, the 
node will not receive reads until finished. Once this process is finished on 
the replacement node, run `nodetool removetoken` once, supplying the token of 
the dead node, and `nodetool cleanup` on each node.
 1. You can obtain the dead node's token by running `nodetool ring` on any live 
node, unless there was some kind of outage, and the others came up but not the 
down one -- in that case, you can retrieve the token from the live nodes' 
system tables.

 1. (Alternative approach) Bring up a replacement node with the same IP and 
token as the old, and run `nodetool repair`. Until the repair process is 
complete, clients reading only from this node may get no data back.  Using a 
higher !ConsistencyLevel on reads will avoid this.

The reason why you run `nodetool cleanup` on all live nodes is to remove old 
Hinted Handoff writes stored for the dead node.

== Backing up data ==
Cassandra can snapshot data while online using `nodetool snapshot`.  You can 
then back up those snapshots using any desired system, although leaving them 
where they are is probably the option that makes the most sense on large 
clusters.

With some combinations of operating system/jvm you may receive an error related 
to the inability to create a process during the snapshotting, such as this on 
Linux

{{{
Exception in thread "main" java.io.IOException: Cannot run program "ln": 
java.io.IOException: error=12, Cannot allocate memory
}}}
Remain calm. The operating system is trying to allocate for the "ln" process a 
memory space as large as the parent process (the cassandra server), even if 
'''it's not going to use it'''. So if you have a machine with 8GB of RAM and no 
swap, and you gave 6 to the cassandra server, it will fail during this because 
the operating system will wan 12 GB of memory before allowing you to create the 
process.

This can be worked around depending on the operating system by either creating 
a swap file, snapshotting, turning it off or by turning on "memory overcommit". 
Since the child process memory is the same as the parent, until it performs an 
`exec("ln")` call the operating system will not use the new memory and will 
just refer to the old one, and everything will work.

Currently, only flushed data is snapshotted (not data that only exists in the 
commitlog).  Run `nodetool flush` first and wait for that to complete, to make 
sure you get '''all''' data in the snapshot.

To revert to a snapshot, shut down the node, clear out the old commitlog and 
sstables, and move the sstables from the snapshot location to the live data 
directory.

=== Consistent backups ===
You can get an eventually consistent backup by flushing all nodes and 
snapshotting; no individual node's backup is guaranteed to be consistent but if 
you restore from that snapshot then clients will get eventually consistent 
behavior as usual.

There is no such thing as a consistent view of the data in the strict sense, 
except in the trivial case of writes with consistency level = ALL.

=== Import / export ===
As an alternative to taking snapshots it's possible to export SSTables to JSON 
format using the `bin/sstable2json` command:

{{{
Usage: sstable2json [-f outfile] <sstable> [-k key [-k key [...]]]
}}}
`bin/sstable2json` accepts as a required argument, the full path to an SSTable 
data file, (files ending in -Data.db), and an optional argument for an output 
file (by default, output is written to stdout). You can also pass the names of 
specific keys using the `-k` argument to limit what is exported.

Note: If you are not running the exporter on in-place SSTables, there are a 
couple of things to keep in mind.

 1. The corresponding configuration must be present (same as it would be to run 
a node).
 1. SSTables are expected to be in a directory named for the keyspace (same as 
they would be on a production node).

JSON exported SSTables can be "imported" to create new SSTables using 
`bin/json2sstable`:

{{{
Usage: json2sstable -K keyspace -c column_family <json> <sstable>
}}}
`bin/json2sstable` takes arguments for keyspace and column family names, and 
full paths for the JSON input file and the destination SSTable file name.

You can also import pre-serialized rows of data using the BinaryMemtable 
interface.  This is useful for importing via Hadoop or another source where you 
want to do some preprocessing of the data to import.

NOTE: Starting with version 0.7, json2sstable and sstable2json must be run in 
such a way that the schema can be loaded from system tables.  This means that 
cassandra.yaml must be found in the classpath and refer to valid storage 
directories.

== Monitoring ==
Cassandra exposes internal metrics as JMX data.  This is a common standard in 
the JVM world; OpenNMS, Nagios, and Munin at least offer some level of JMX 
support. The specifics of the JMX Interface are documented at JmxInterface.

Running `nodetool cfstats` can provide an overview of each Column Family, and 
important metrics to graph your cluster. Some folks prefer having to deal with 
non-jmx clients, there is a JMX-to-REST bridge available at 
http://code.google.com/p/polarrose-jmx-rest-bridge/

Important metrics to watch on a per-Column Family basis would be: '''Read 
Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell 
you if things are backing up. These metrics can also be exposed using any JMX 
client such as `jconsole`

You can also use jconsole, and the MBeans tab to look at PendingTasks for 
thread pools. If you see one particular thread backing up, this can give you an 
indication of a problem. One example would be ROW-MUTATION-STAGE indicating 
that write requests are arriving faster than they can be handled. A more subtle 
example is the FLUSH stages: if these start backing up, cassandra is accepting 
writes into memory fast enough, but the sort-and-write-to-disk stages are 
falling behind.

If you are seeing a lot of tasks being built up, your hardware or configuration 
tuning is probably the bottleneck.

Running `nodetool tpstats` will dump all of those threads to console if you 
don't want to use jconsole. Example:

{{{
Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            119
MESSAGING-SERVICE-POOL            3         4       81594002
STREAM-STAGE                      0         0              3
RESPONSE-STAGE                    0         0       48353537
ROW-READ-STAGE                    0         0          13754
LB-OPERATIONS                     0         0              0
COMMITLOG                         1         0       78080398
GMFD                              0         0        1091592
MESSAGE-DESERIALIZER-POOL         0         0      126022919
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           2899
ROW-MUTATION-STAGE                1         2       81719765
MESSAGE-STREAMING-POOL            0         0            129
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0            218
MEMTABLE-POST-FLUSHER             0         0            218
COMPACTION-POOL                   0         0            464
FLUSH-WRITER-POOL                 0         0            218
HINTED-HANDOFF-POOL               0         0            154
}}}

Reply via email to