Hello,

We recently experienced (pretty severe) data loss after moving our 4 node
Cassandra cluster from one EC2 availability zone to another.  Our strategy
for doing so was as follows:

   - One at a time, bring up new nodes in the new availability zone and
   have them join the cluster.
   - One at a time, decommission the old nodes in the old availability zone
   and turn them off (stop the Cassandra process).

Everything seemed to work as expected.  As we decommissioned each node, we
checked the logs for messages indicating "yes, this node is done
decommissioning" before turning the node off.

Pretty quickly after the old nodes left the cluster, we started getting
client calls about data missing.

We immediately turned the old nodes back on and when they rejoined the
cluster *most* of the reported missing data returned.  For the rest of the
missing data, we had to spin up a new cluster from EBS snapshots and copy
it over.

What did we do wrong?

In hindsight, we noticed a few things which may be clues...

   - The new nodes had much lower load after joining the cluster than the
   old ones (3-4 gb as opposed to 10 gb).
   - We have EC2Snitch turned on, although we're using SimpleStrategy for
   replication.
   - The new nodes showed even ownership (via nodetool status) after
   joining the cluster.

Here's more info about our cluster...

   - Cassandra 1.2.10
   - Replication factor of 3
   - Vnodes with 256 tokens
   - All tables made via CQL
   - Data dirs on EBS (yes, we are aware of the performance implications)


Thanks for the help.

Reply via email to