Thanks. Unfortunately, we lost our system logs during all of this (had normal logs, but not system) due to an unrelated issue :/
Anyhow, as far as I can tell, we're doing okay. On Thu, Oct 20, 2016 at 11:18 PM, Jeremiah D Jordan < jeremiah.jor...@gmail.com> wrote: > The easiest way to figure out what happened is to examine the system log. > It will tell you what happened. But I’m pretty sure your nodes got new > tokens during that time. > > If you want to get back the data inserted during the 2 hours you could use > sstableloader to send all the data from the > /var/data/cassandra_new/cassandra/* > folders back into the cluster if you still have it. > > -Jeremiah > > > > On Oct 20, 2016, at 3:58 PM, Branton Davis <branton.da...@spanning.com> > wrote: > > Howdy folks. I asked some about this in IRC yesterday, but we're looking > to hopefully confirm a couple of things for our sanity. > > Yesterday, I was performing an operation on a 21-node cluster (vnodes, > replication factor 3, NetworkTopologyStrategy, and the nodes are balanced > across 3 AZs on AWS EC2). The plan was to swap each node's existing 1TB > volume (where all cassandra data, including the commitlog, is stored) with > a 2TB volume. The plan for each node (one at a time) was basically: > > - rsync while the node is live (repeated until there were only minor > differences from new data) > - stop cassandra on the node > - rsync again > - replace the old volume with the new > - start cassandra > > However, there was a bug in the rsync command. Instead of copying the > contents of /var/data/cassandra to /var/data/cassandra_new, it copied it to > /var/data/cassandra_new/cassandra. So, when cassandra was started after > the volume swap, there was some behavior that was similar to bootstrapping > a new node (data started streaming in from other nodes). But there > was also some behavior that was similar to a node replacement (nodetool > status showed the same IP address, but a different host ID). This > happened with 3 nodes (one from each AZ). The nodes had received 1.4GB, > 1.2GB, and 0.6GB of data (whereas the normal load for a node is around > 500-600GB). > > The cluster was in this state for about 2 hours, at which point cassandra > was stopped on them. Later, I moved the data from the original volumes > back into place (so, should be the original state before the operation) and > started cassandra back up. > > Finally, the questions. We've accepted the potential loss of new data > within the two hours, but our primary concern now is what was happening > with the bootstrapping nodes. Would they have taken on the token ranges > of the original nodes or acted like new nodes and got new token ranges? If > the latter, is it possible that any data moved from the healthy nodes to > the "new" nodes or would restarting them with the original data (and > repairing) put the cluster's token ranges back into a normal state? > > Hopefully that was all clear. Thanks in advance for any info! > > >