We ran repair -pr on each node after we realized there was data loss and we
added the 4 original nodes back in the cluster.  I.e. we ran repair on the
8 node cluster that consisted of the 4 old and 4 new nodes, once we
realized there was a problem.

We are using quorum reads and writes.

One thing that I didn't mention, and I think may be the culprit after doing
a lot or mailing list reading, is that when we brought the 4 new nodes into
the cluster, they had themselves listed in the seeds list.  I read
yesterday that if a node has itself in the seeds list, then it won't
bootstrap properly.

-- C


On Tue, Nov 26, 2013 at 8:14 AM, Janne Jalkanen <janne.jalka...@ecyrd.com>wrote:

>
> That sounds bad!  Did you run repair at any stage?  Which CL are you
> reading with?
>
> /Janne
>
> On 25 Nov 2013, at 19:00, Christopher J. Bottaro <
> cjbott...@academicworks.com> wrote:
>
> Hello,
>
> We recently experienced (pretty severe) data loss after moving our 4 node
> Cassandra cluster from one EC2 availability zone to another.  Our strategy
> for doing so was as follows:
>
>    - One at a time, bring up new nodes in the new availability zone and
>    have them join the cluster.
>    - One at a time, decommission the old nodes in the old availability
>    zone and turn them off (stop the Cassandra process).
>
> Everything seemed to work as expected.  As we decommissioned each node, we
> checked the logs for messages indicating "yes, this node is done
> decommissioning" before turning the node off.
>
> Pretty quickly after the old nodes left the cluster, we started getting
> client calls about data missing.
>
> We immediately turned the old nodes back on and when they rejoined the
> cluster *most* of the reported missing data returned.  For the rest of the
> missing data, we had to spin up a new cluster from EBS snapshots and copy
> it over.
>
> What did we do wrong?
>
> In hindsight, we noticed a few things which may be clues...
>
>    - The new nodes had much lower load after joining the cluster than the
>    old ones (3-4 gb as opposed to 10 gb).
>    - We have EC2Snitch turned on, although we're using SimpleStrategy for
>    replication.
>    - The new nodes showed even ownership (via nodetool status) after
>    joining the cluster.
>
> Here's more info about our cluster...
>
>    - Cassandra 1.2.10
>    - Replication factor of 3
>    - Vnodes with 256 tokens
>    - All tables made via CQL
>    - Data dirs on EBS (yes, we are aware of the performance implications)
>
>
> Thanks for the help.
>
>
>

Reply via email to