[Cassandra Wiki] Update of "Operations_ZH" by jian. huang

Apache Wiki Mon, 02 Aug 2010 20:10:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "Operations_ZH" page has been changed by jian.huang.
http://wiki.apache.org/cassandra/Operations_ZH

--------------------------------------------------

New page:
== 硬件 ==
See [[CassandraHardware_ZH|Cassandra的硬件]]

== 性能调优 ==
See [[PerformanceTuning|Cassandra的性能调优]]

== 数据结构管理 ==
See [[LiveSchemaUpdates|联机数据结构跟新]] [refers to functionality in 0.7]

== 令牌环管理 ==
每台 !Cassandra 服务器 [节点] 被指定为一个唯一的令牌，作为键值的第一个副本。
如果您对所有节点标记进行排序，每个是负责的键范围是(!PreviousToken,
!MyToken]，就是从该节点的上一个令牌（不包含,exclusive）到该节点的令牌（包含,inclusive）。持有最小令牌的节点负责的范围是小于该令牌的所有键，持有最da令牌的节点负责的范围是da于该令牌的所有键；
这被称为一个"令牌环"。

(请注意,在单点故障时,没有什么是"主"副本。)

在使用时会随机分配从 0 到 2 ** 127 之间的整数令牌。键将转换为此范围的 MD5
哈希算法的令牌比较。(因此，键始终是转换为令牌，但是相反并不总是如此。)

=== 令牌选择 ===
使用随机分配键,通过一种强烈的哈希函数将令牌均匀的分布在令牌环上，但如果您标记做不了范围平均划分仍然可以不平衡分配，因此应指定到您的第一个节点的初始化令牌作为
`i * (2**127 / N)` for i = 1 .. N。

With order preserving partitioners, your key distribution will be
application-dependent. You should still take your best guess at specifying
initial tokens (guided by sampling actual data, if possible), but you will be
more dependent on active load balancing (see below) and/or adding new nodes to
hot spots.

Once data is placed on the cluster, the partitioner may not be changed without
wiping and starting over.

=== Replication ===
A Cassandra cluster always divides up the key space into ranges delimited by
Tokens as described above, but additional replica placement is customizable via
IReplicaPlacementStrategy in the configuration file. The standard strategies
are

* !RackUnawareStrategy: replicas are always placed on the next (in increasing
Token order) N-1 nodes along the ring
* !RackAwareStrategy: replica 2 is placed in the first node along the ring the
belongs in '''another''' data center than the first; the remaining N-2
replicas, if any, are placed on the first nodes along the ring in the
'''same''' rack as the first

Note that with !RackAwareStrategy, succeeding nodes along the ring should
alternate data centers to avoid hot spots. For instance, if you have nodes A,
B, C, and D in increasing Token order, and instead of alternating you place A
and B in DC1, and C and D in DC2, then nodes C and A will have
disproportionately more data on them because they will be the replica
destination for every Token range in the other data center.

* The corollary to this is, if you want to start with a single DC and add
another later, when you add the second DC you should add as many nodes as you
have in the first rather than adding a node or two at a time gradually.

Replication factor is not really intended to be changed in a live cluster
either, but increasing it may be done if you (a) read at
ConsistencyLevel.QUORUM or ALL (depending on your existing replication factor)
to make sure that a replica that actually has the data is consulted, (b) are
willing to accept downtime while anti-entropy repair runs (see below), or (c)
are willing to live with some clients potentially being told no data exists if
they read from the new replica location(s) until repair is done.

The same options apply to changing replication strategy.

Reducing replication factor is easily done and only requires running cleanup
afterwards to remove extra replicas.

=== Network topology ===
Besides datacenters, you can also tell Cassandra which nodes are in the same
rack within a datacenter. Cassandra will use this to route both reads and data
movement for Range changes to the nearest replicas. This is configured by a
user-pluggable !EndpointSnitch class in the configuration file.

!EndpointSnitch is related to, but distinct from, replication strategy itself:
!RackAwareStrategy needs a properly configured Snitch to place replicas
correctly, but even absent a Strategy that cares about datacenters, the rest of
Cassandra will still be location-sensitive.

There is an example of a custom Snitch implementation in
http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.1/contrib/property_snitch/.

== Range changes ==
=== Bootstrap ===
Adding new nodes is called "bootstrapping."

To bootstrap a node, turn !AutoBootstrap on in the configuration file, and
start it.

If you explicitly specify an !InitialToken in the configuration, the new node
will bootstrap to that position on the ring. Otherwise, it will pick a Token
that will give it half the keys from the node with the most disk space used,
that does not already have another node bootstrapping into its Range.

Important things to note:

1. You should wait long enough for all the nodes in your cluster to become
aware of the bootstrapping node via gossip before starting another bootstrap.
The new node will log "Bootstrapping" when this is safe, 2 minutes after
starting. (90s to make sure it has accurate load information, and 30s waiting
for other nodes to start sending it inserts happening in its to-be-assumed part
of the token ring.)
1. Relating to point 1, one can only boostrap N nodes at a time with automatic
token picking, where N is the size of the existing cluster. If you need to more
than double the size of your cluster, you have to wait for the first N nodes to
finish until your cluster is size 2N before bootstrapping more nodes. So if
your current cluster is 5 nodes and you want add 7 nodes, bootstrap 5 and let
those finish before boostrapping the last two.
1. As a safety measure, Cassandra does not automatically remove data from
nodes that "lose" part of their Token Range to a newly added node. Run
`nodetool cleanup` on the source node(s) (neighboring nodes that shared the
same subrange) when you are satisfied the new node is up and working. If you do
not do this the old data will still be counted against the load on that node
and future bootstrap attempts at choosing a location will be thrown off.
1. When bootstrapping a new node, existing nodes have to divide the key space
before beginning replication. This can take awhile, so be patient.
1. During bootstrap, a node will drop the Thrift port and will not be
accessible from `nodetool`.
1. Bootstrap can take many hours when a lot of data is involved. See
[[Streaming]] for how to monitor progress.

Cassandra is smart enough to transfer data from the nearest source node(s), if
your !EndpointSnitch is configured correctly. So, the new node doesn't need to
be in the same datacenter as the primary replica for the Range it is
bootstrapping into, as long as another replica is in the datacenter with the
new one.

Bootstrap progress can be monitored using `nodetool` with the `streams`
argument.

During bootstrap `nodetool` may report that the new node is not receiving nor
sending any streams, this is because the sending node will copy out locally the
data they will send to the receiving one, which can be seen in the sending node
through the the "AntiCompacting... AntiCompacted" log messages.

== Moving or Removing nodes ==
=== Removing nodes entirely ===
You can take a node out of the cluster with `nodetool decommission` to a live
node, or `nodetool removetoken` (to any other machine) to remove a dead one.
This will assign the ranges the old node was responsible for to other nodes,
and replicate the appropriate data there. If `decommission` is used, the data
will stream from the decommissioned node. If `removetoken` is used, the data
will stream from the remaining replicas.

No data is removed automatically from the node being decommissioned, so if you
want to put the node back into service at a different token on the ring, it
should be removed manually.

=== Moving nodes ===
`nodetool move`: move the target node to a given Token. Moving is essentially a
convenience over decommission + bootstrap.

As with bootstrap, see [[Streaming]] for how to monitor progress.

=== Load balancing ===
`nodetool loadbalance`: also essentially a convenience over decommission +
bootstrap, only instead of telling the target node where to move on the ring it
will choose its location based on the same heuristic as Token selection on
bootstrap.

The status of move and balancing operations can be monitored using `nodetool`
with the `streams` argument.

== Consistency ==
Cassandra allows clients to specify the desired consistency level on reads and
writes. (See [[API]].) If R + W > N, where R, W, and N are respectively the
read replica count, the write replica count, and the replication factor, all
client reads will see the most recent write. Otherwise, readers '''may''' see
older versions, for periods of typically a few ms; this is called "eventual
consistency." See
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html and
http://queue.acm.org/detail.cfm?id=1466448 for more.

See below about consistent backups.

=== Repairing missing or inconsistent data ===
Cassandra repairs data in two ways:

1. Read Repair: every time a read is performed, Cassandra compares the
versions at each replica (in the background, if a low consistency was requested
by the reader to minimize latency), and the newest version is sent to any
out-of-date replicas.
1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle
tree of the data on that node, and compares it with the versions on other
replicas, to catch any out of sync data that hasn't been read recently. This
is intended to be run infrequently (e.g., weekly) since computing the Merkle
tree is relatively expensive in disk i/o and CPU, since it scans ALL the data
on the machine (but it is is very network efficient).

Running `nodetool repair`:
Like all nodetool operations, repair is non-blocking; it sends the command to
the given node, but does not wait for the repair to actually finish. You can
tell that repair is finished when (a) there are no active or pending tasks in
the CompactionManager, and after that when (b) there are no active or pending
tasks on AE-SERVICE-STAGE.

Repair should be run against one machine at a time. (This limitation will be
fixed in 0.7.)

=== Handling failure ===
If a node goes down and comes back up, the ordinary repair mechanisms will be
adequate to deal with any inconsistent data. Remember though that if a node is
down longer than your configured GCGraceSeconds (default: 10 days), it could
have missed remove operations permanently. Unless your application performs no
removes, you should wipe its data directory, re-bootstrap it, and removetoken
its old entry in ghe ring (see below).

If a node goes down entirely, then you have two options:

1. (Recommended approach) Bring up the replacement node with a new IP address,
and !AutoBootstrap set to true in storage-conf.xml. This will place the
replacement node in the cluster and find the appropriate position
automatically. Then the bootstrap process begins. While this process runs, the
node will not receive reads until finished. Once this process is finished on
the replacement node, run `nodetool removetoken` once, supplying the token of
the dead node, and `nodetool cleanup` on each node.
1. You can obtain the dead node's token by running `nodetool ring` on any live
node, unless there was some kind of outage, and the others came up but not the
down one -- in that case, you can retrieve the token from the live nodes'
system tables.

1. (Alternative approach) Bring up a replacement node with the same IP and
token as the old, and run `nodetool repair`. Until the repair process is
complete, clients reading only from this node may get no data back. Using a
higher !ConsistencyLevel on reads will avoid this.

The reason why you run `nodetool cleanup` on all live nodes is to remove old
Hinted Handoff writes stored for the dead node.

== Backing up data ==
Cassandra can snapshot data while online using `nodetool snapshot`. You can
then back up those snapshots using any desired system, although leaving them
where they are is probably the option that makes the most sense on large
clusters.

With some combinations of operating system/jvm you may receive an error related
to the inability to create a process during the snapshotting, such as this on
Linux

{{{
Exception in thread "main" java.io.IOException: Cannot run program "ln":
java.io.IOException: error=12, Cannot allocate memory
}}}
Remain calm. The operating system is trying to allocate for the "ln" process a
memory space as large as the parent process (the cassandra server), even if
'''it's not going to use it'''. So if you have a machine with 8GB of RAM and no
swap, and you gave 6 to the cassandra server, it will fail during this because
the operating system will wan 12 GB of memory before allowing you to create the
process.

This can be worked around depending on the operating system by either creating
a swap file, snapshotting, turning it off or by turning on "memory overcommit".
Since the child process memory is the same as the parent, until it performs an
`exec("ln")` call the operating system will not use the new memory and will
just refer to the old one, and everything will work.

Currently, only flushed data is snapshotted (not data that only exists in the
commitlog). Run `nodetool flush` first and wait for that to complete, to make
sure you get '''all''' data in the snapshot.

To revert to a snapshot, shut down the node, clear out the old commitlog and
sstables, and move the sstables from the snapshot location to the live data
directory.

=== Consistent backups ===
You can get an eventually consistent backup by flushing all nodes and
snapshotting; no individual node's backup is guaranteed to be consistent but if
you restore from that snapshot then clients will get eventually consistent
behavior as usual.

There is no such thing as a consistent view of the data in the strict sense,
except in the trivial case of writes with consistency level = ALL.

=== Import / export ===
As an alternative to taking snapshots it's possible to export SSTables to JSON
format using the `bin/sstable2json` command:

{{{
Usage: sstable2json [-f outfile] <sstable> [-k key [-k key [...]]]
}}}
`bin/sstable2json` accepts as a required argument, the full path to an SSTable
data file, (files ending in -Data.db), and an optional argument for an output
file (by default, output is written to stdout). You can also pass the names of
specific keys using the `-k` argument to limit what is exported.

Note: If you are not running the exporter on in-place SSTables, there are a
couple of things to keep in mind.

1. The corresponding configuration must be present (same as it would be to run
a node).
1. SSTables are expected to be in a directory named for the keyspace (same as
they would be on a production node).

JSON exported SSTables can be "imported" to create new SSTables using
`bin/json2sstable`:

{{{
Usage: json2sstable -K keyspace -c column_family <json> <sstable>
}}}
`bin/json2sstable` takes arguments for keyspace and column family names, and
full paths for the JSON input file and the destination SSTable file name.

You can also import pre-serialized rows of data using the BinaryMemtable
interface. This is useful for importing via Hadoop or another source where you
want to do some preprocessing of the data to import.

NOTE: Starting with version 0.7, json2sstable and sstable2json must be run in
such a way that the schema can be loaded from system tables. This means that
cassandra.yaml must be found in the classpath and refer to valid storage
directories.

== Monitoring ==
Cassandra exposes internal metrics as JMX data. This is a common standard in
the JVM world; OpenNMS, Nagios, and Munin at least offer some level of JMX
support. The specifics of the JMX Interface are documented at JmxInterface.

Running `nodetool cfstats` can provide an overview of each Column Family, and
important metrics to graph your cluster. Some folks prefer having to deal with
non-jmx clients, there is a JMX-to-REST bridge available at
http://code.google.com/p/polarrose-jmx-rest-bridge/

Important metrics to watch on a per-Column Family basis would be: '''Read
Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell
you if things are backing up. These metrics can also be exposed using any JMX
client such as `jconsole`

You can also use jconsole, and the MBeans tab to look at PendingTasks for
thread pools. If you see one particular thread backing up, this can give you an
indication of a problem. One example would be ROW-MUTATION-STAGE indicating
that write requests are arriving faster than they can be handled. A more subtle
example is the FLUSH stages: if these start backing up, cassandra is accepting
writes into memory fast enough, but the sort-and-write-to-disk stages are
falling behind.

If you are seeing a lot of tasks being built up, your hardware or configuration
tuning is probably the bottleneck.

Running `nodetool tpstats` will dump all of those threads to console if you
don't want to use jconsole. Example:

{{{
Pool Name Active Pending Completed
FILEUTILS-DELETE-POOL 0 0 119
MESSAGING-SERVICE-POOL 3 4 81594002
STREAM-STAGE 0 0 3
RESPONSE-STAGE 0 0 48353537
ROW-READ-STAGE 0 0 13754
LB-OPERATIONS 0 0 0
COMMITLOG 1 0 78080398
GMFD 0 0 1091592
MESSAGE-DESERIALIZER-POOL 0 0 126022919
LB-TARGET 0 0 0
CONSISTENCY-MANAGER 0 0 2899
ROW-MUTATION-STAGE 1 2 81719765
MESSAGE-STREAMING-POOL 0 0 129
LOAD-BALANCER-STAGE 0 0 0
FLUSH-SORTER-POOL 0 0 218
MEMTABLE-POST-FLUSHER 0 0 218
COMPACTION-POOL 0 0 464
FLUSH-WRITER-POOL 0 0 218
HINTED-HANDOFF-POOL 0 0 154
}}}

[Cassandra Wiki] Update of "Operations_ZH" by jian. huang

Reply via email to