Hi All,

After running through our backup and restore process FROM our test production 
TO our staging environment, we are seeing inconsistent reads from the cluster 
we restored to. We have the same number of nodes in both clusters. For example, 
we will select data from a column family on the newly restored cluster but 
sometimes the expected data is returned and other times it is not. These 
selects are carried out one after another with very little delay. It is almost 
as if the data only exists on some of the nodes, or perhaps the token ranges 
are dramatically different --again, we are using vnodes so I am not exactly 
sure how this plays into the equation.

We are running Cassadra 2.0.2 with vnodes and deploying via chef. The backup 
and restore process is currently orchestrated using bash scripts and chef's 
distributed SSH. I have outlined the process below for review. 


(I) Backup cluster-A (with existing prod data):
1. Run "nodetool flush" on each of the nodes in a 5 node ring.
2. Run "nodetool snapshot keyspace_name" on each of the nodes in a 5 node ring.
3. Archive the snapshot data from the snapshots directory in each node, 
creating a single archive of the snapshot.
4. Copy the snapshot data archive for each of the nodes to s3.


(II) Restore backup FROM cluster-A  TO  cluster-B:
*NOTE: cluster-B is a freshly deployed ring with no data, but a different 
cluster-name used for staging.

1. Deploy 5 nodes as part of the cluster-B ring. 
2. Create keyspace_name keyspace and column families on cluster-B.
3. Stop Cassandra on all 5 nodes in the cluster-B ring.
4. Clear commit logs on cluster-B with:  "rm -f /var/lib/cassandra/commitlog/*"
5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes 
in the new cluster-B ring.
6. Extract the archives to /var/lib/cassandra/data/keyspace_name ensuring that 
the column family directories and associated .DB files are in place under 
/var/lib/cassandra/data/keyspace_name/columfamily1/   ….etc.
7.Start Cassandra on each of the nodes in cluster-B.
8. Run "nodetool repair" on each of the nodes in cluster-B.


Please let me know if you see any major errors or deviation from best practices 
which could be contributing to our read inconsistencies. I'll be happy to 
answer any specific question you may have regarding our configuration. Thank 
you in advance!


Best regards,
-David Laube

Reply via email to