Hi All, After running through our backup and restore process FROM our test production TO our staging environment, we are seeing inconsistent reads from the cluster we restored to. We have the same number of nodes in both clusters. For example, we will select data from a column family on the newly restored cluster but sometimes the expected data is returned and other times it is not. These selects are carried out one after another with very little delay. It is almost as if the data only exists on some of the nodes, or perhaps the token ranges are dramatically different --again, we are using vnodes so I am not exactly sure how this plays into the equation.
We are running Cassadra 2.0.2 with vnodes and deploying via chef. The backup and restore process is currently orchestrated using bash scripts and chef's distributed SSH. I have outlined the process below for review. (I) Backup cluster-A (with existing prod data): 1. Run "nodetool flush" on each of the nodes in a 5 node ring. 2. Run "nodetool snapshot keyspace_name" on each of the nodes in a 5 node ring. 3. Archive the snapshot data from the snapshots directory in each node, creating a single archive of the snapshot. 4. Copy the snapshot data archive for each of the nodes to s3. (II) Restore backup FROM cluster-A TO cluster-B: *NOTE: cluster-B is a freshly deployed ring with no data, but a different cluster-name used for staging. 1. Deploy 5 nodes as part of the cluster-B ring. 2. Create keyspace_name keyspace and column families on cluster-B. 3. Stop Cassandra on all 5 nodes in the cluster-B ring. 4. Clear commit logs on cluster-B with: "rm -f /var/lib/cassandra/commitlog/*" 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes in the new cluster-B ring. 6. Extract the archives to /var/lib/cassandra/data/keyspace_name ensuring that the column family directories and associated .DB files are in place under /var/lib/cassandra/data/keyspace_name/columfamily1/ ….etc. 7.Start Cassandra on each of the nodes in cluster-B. 8. Run "nodetool repair" on each of the nodes in cluster-B. Please let me know if you see any major errors or deviation from best practices which could be contributing to our read inconsistencies. I'll be happy to answer any specific question you may have regarding our configuration. Thank you in advance! Best regards, -David Laube