The branch, master has been updated via 06ac62f890299021220214327f1b611c3cf00145 (commit) via b1577a11d548479ff1a05702d106af9465921ad4 (commit) via 2438f3a4944f7adbcae4cc1b9d5452714244afe7 (commit) via cad3107b12e8392f786f9a758ee38cf3a3d58538 (commit) via feb1d40b21a160737aead22e398f3c34ff3be8de (commit) from 4c0cbfbe8b19f2e6fe17093b52c734bec63dd8b7 (commit)
http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=master - Log ----------------------------------------------------------------- commit 06ac62f890299021220214327f1b611c3cf00145 Author: Michael Adam <ob...@samba.org> Date: Wed Apr 17 13:08:49 2013 +0200 tests: add a comment to recovery db corruption test The comment explains that we use "ctdb stop" and "ctdb continue" but we should use "ctdb setcrecmasterrole off". Signed-off-by: Michael Adam <ob...@samba.org> commit b1577a11d548479ff1a05702d106af9465921ad4 Author: Amitay Isaacs <ami...@gmail.com> Date: Thu Apr 11 16:59:36 2013 +1000 tests: Add a test for subsequent recoveries corrupting databases Signed-off-by: Amitay Isaacs <ami...@gmail.com> Reviewed-by: Michael Adam <ob...@samba.org> commit 2438f3a4944f7adbcae4cc1b9d5452714244afe7 Author: Amitay Isaacs <ami...@gmail.com> Date: Thu Apr 11 16:58:34 2013 +1000 tests: Support waiting for "recovered" state in tests Signed-off-by: Amitay Isaacs <ami...@gmail.com> Reviewed-by: Michael Adam <ob...@samba.org> commit cad3107b12e8392f786f9a758ee38cf3a3d58538 Author: Michael Adam <ob...@samba.org> Date: Wed Apr 3 12:02:59 2013 +0200 ctdb_call: don't bump the rsn in ctdb_become_dmaster() any more This is now done in ctdb_ltdb_store_server(), so this extra bump can be spared. Signed-off-by: Michael Adam <ob...@samba.org> Reviewed-By: Amitay Isaacs <ami...@gmail.com> commit feb1d40b21a160737aead22e398f3c34ff3be8de Author: Michael Adam <ob...@samba.org> Date: Wed Apr 3 11:40:25 2013 +0200 Fix a severe recovery bug that can lead to data corruption for SMB clients. Problem: Recovery can under certain circumstances lead to old record copies resurrecting: Recovery selects the newest record copy purely by RSN. At the end of the recovery, the recovery master is the dmaster for all records in all (non-persistent) databases. And the other nodes locally hold the complete copy of the databases. The bug is that the recovery process does not increment the RSN on the recovery master at the end of the recovery. Now clients acting directly on the Recovery master will directly change a record's content on the recmaster without migration and hence without RSN bump. So a subsequent recovery can not tell that the recmaster's copy is newer than the copies on the other nodes, since their RSN is the same. Hence, if the recmaster is not node 0 (or more precisely not the active node with the lowest node number), the recovery will choose copies from nodes with lower number and stick to these. Here is how to reproduce: - assume we have a cluster with at least 2 nodes - ensure that the recmaster is not node 0 (maybe ensure with "onnode 0 ctdb setrecmasterrole off") say recmaster is node 1 - choose a new database name, say "test1.tdb" (make sure it is not yet attached as persistent) - choose a key name, say "key1" - all clustere nodes should ok and no recovery running - now do the following on node 1: 1. dbwrap_tool test1.tdb store key1 uint32 1 2. dbwrap_tool test1.tdb fetch key1 uint32 ==> 1 3. ctdb recover 4. dbwrap_tool test1.tdb store key1 uint32 2 5. dbwrap_tool test1.tdb fetch key1 uint32 ==> 2 4. ctdb recover 7. dbwrap_tool test1.tdb fetch key1 uint32 ==> 1 ==> BUG This is a very severe bug, since when applied to Samba's locking.tdb database, it means that for SMB clients on clustered Samba there is the potential for locking out oneself from previously opened files or even worse, data corruption: Case 1: locking out - client on recmaster opens file - recovery propagates open file handle (entry in locking.tdb) to other nodes - client closes file - client opens the same file - recovery resurrects old copy of open file record in locking.tdb from lower node - client closes file but fails to delete entry in locking.tdb - client tries to open same file again but fails, since the old record locks it out (since the client is still connected) Case 2: data corruption - clien1 on recmaster opens file - recovery propagates open file info to other nodes - client1 closes the file and disconnects - client2 opens the same file - recovery resurrects old copy of locking.tdb record, where client2 has no entry, but client1 has. - but client2 believes it still has a handle - client3 opens the file and succees without conflicting with client2 (the detached entry for client1 is discarded because the server does not exist any more). => both client2 and client3 believe they have exclusive access to the file and writing creates data corruption Fix: When storing a record on the dmaster, bump its RSN. The ctdb_ltdb_store_server() is the central function for storing a record to a local tdb from the ctdbd server context. So this is also the place where the RSN of the record to be stored should be incremented, when storing on the dmaster. For the case of the record migration, this is currently done in ctdb_become_dmaster() in ctdb_call.c, but there are other places such as in recovery, where we should bump the RSN, but currently don't do it. So moving the RSN incrementation into ctdb_ltdb_store_server fixes the recovery-record-resurrection bug. Signed-off-by: Michael Adam <ob...@samba.org> Reviewed-By: Amitay Isaacs <ami...@gmail.com> ----------------------------------------------------------------------- Summary of changes: server/ctdb_call.c | 2 +- server/ctdb_ltdb_server.c | 9 ++- tests/scripts/integration.bash | 5 +- tests/simple/77_ctdb_db_recovery.sh | 133 +++++++++++++++++++++++++++++++++++ 4 files changed, 144 insertions(+), 5 deletions(-) create mode 100755 tests/simple/77_ctdb_db_recovery.sh Changeset truncated at 500 lines: diff --git a/server/ctdb_call.c b/server/ctdb_call.c index 67219c4..a6c6389 100644 --- a/server/ctdb_call.c +++ b/server/ctdb_call.c @@ -343,7 +343,7 @@ static void ctdb_become_dmaster(struct ctdb_db_context *ctdb_db, DEBUG(DEBUG_DEBUG,("pnn %u dmaster response %08x\n", ctdb->pnn, ctdb_hash(&key))); ZERO_STRUCT(header); - header.rsn = rsn + 1; + header.rsn = rsn; header.dmaster = ctdb->pnn; header.flags = record_flags; diff --git a/server/ctdb_ltdb_server.c b/server/ctdb_ltdb_server.c index 0432e49..c5d9be1 100644 --- a/server/ctdb_ltdb_server.c +++ b/server/ctdb_ltdb_server.c @@ -125,12 +125,15 @@ static int ctdb_ltdb_store_server(struct ctdb_db_context *ctdb_db, } if (keep) { - if ((data.dsize == 0) && - !ctdb_db->persistent && + if (!ctdb_db->persistent && (ctdb_db->ctdb->pnn == header->dmaster) && !(header->flags & (CTDB_REC_RO_HAVE_DELEGATIONS|CTDB_REC_RO_HAVE_READONLY|CTDB_REC_RO_REVOKING_READONLY|CTDB_REC_RO_REVOKE_COMPLETE))) { - schedule_for_deletion = true; + header->rsn++; + + if (data.dsize == 0) { + schedule_for_deletion = true; + } } remove_from_delete_queue = !schedule_for_deletion; } diff --git a/tests/scripts/integration.bash b/tests/scripts/integration.bash index 2e5fb37..2ec827c 100644 --- a/tests/scripts/integration.bash +++ b/tests/scripts/integration.bash @@ -370,7 +370,7 @@ node_has_status () local pnn="$1" local status="$2" - local bits fpat mpat + local bits fpat mpat rpat case "$status" in (unhealthy) bits="?:?:?:1:*" ;; (healthy) bits="?:?:?:0:*" ;; @@ -386,6 +386,7 @@ node_has_status () (unfrozen) fpat='^[[:space:]]+frozen[[:space:]]+0$' ;; (monon) mpat='^Monitoring mode:ACTIVE \(0\)$' ;; (monoff) mpat='^Monitoring mode:DISABLED \(1\)$' ;; + (recovered) rpat='^Recovery mode:NORMAL \(0\)$' ;; *) echo "node_has_status: unknown status \"$status\"" return 1 @@ -410,6 +411,8 @@ node_has_status () $CTDB statistics -n "$pnn" | egrep -q "$fpat" elif [ -n "$mpat" ] ; then $CTDB getmonmode -n "$pnn" | egrep -q "$mpat" + elif [ -n "$rpat" ] ; then + $CTDB status -n "$pnn" | egrep -q "$rpat" else echo 'node_has_status: unknown mode, neither $bits nor $fpat is set' return 1 diff --git a/tests/simple/77_ctdb_db_recovery.sh b/tests/simple/77_ctdb_db_recovery.sh new file mode 100755 index 0000000..00fa096 --- /dev/null +++ b/tests/simple/77_ctdb_db_recovery.sh @@ -0,0 +1,133 @@ +#!/bin/bash + +test_info() +{ + cat <<EOF +Recovery can under certain circumstances lead to old record copies +resurrecting: Recovery selects the newest record copy purely by RSN. At +the end of the recovery, the recovery master is the dmaster for all +records in all (non-persistent) databases. And the other nodes locally +hold the complete copy of the databases. The bug is that the recovery +process does not increment the RSN on the recovery master at the end of +the recovery. Now clients acting directly on the Recovery master will +directly change a record's content on the recmaster without migration +and hence without RSN bump. So a subsequent recovery can not tell that +the recmaster's copy is newer than the copies on the other nodes, since +their RSN is the same. Hence, if the recmaster is not node 0 (or more +precisely not the active node with the lowest node number), the recovery +will choose copies from nodes with lower number and stick to these. + +Steps: + +1. Create a test database +2. Add a record with value value1 on recovery master +3. Force a recovery +4. Update the record with value value2 on recovery master +5. Force a recovery +6. Fetch the record + +Expected results: + +* The record should have value value2 and not value1 + +EOF +} + +. "${TEST_SCRIPTS_DIR}/integration.bash" + +ctdb_test_init "$@" + +set -e + +cluster_is_healthy + +# Reset configuration +ctdb_restart_when_done + +# +# Main test +# +TESTDB="rec_test.tdb" + +status=0 + +# Make sure node 0 is not the recovery master +echo "find out which node is recmaster" +try_command_on_node -q any $CTDB_TEST_WRAPPER ctdb recmaster +recmaster="$out" +if [ "$recmaster" = "0" ]; then + echo "node 0 is recmaster, disable recmasterrole on node 0" + # + # Note: + # It should be sufficient to run "ctdb setrecmasterrole off" + # on node 0 and wait for election and recovery to finish. + # But there were problems related to this in this automatic + # test, so for now use "ctdb stop" and "ctdb continue". + # + echo "stop node 0" + try_command_on_node -q 0 $CTDB_TEST_WRAPPER ctdb stop + wait_until_node_has_status 0 stopped + echo "continue node 0" + try_command_on_node -q 0 $CTDB_TEST_WRAPPER ctdb continue + wait_until_node_has_status 0 notstopped + + try_command_on_node -q any $CTDB_TEST_WRAPPER ctdb recmaster + recmaster="$out" + if [ "$recmaster" = "0" ]; then + echo "failed to move recmaster to different node" + exit 1 + fi +fi + +echo "Recmaster:$recmaster" + +# Create a temporary non-persistent database to test with +echo "create test database $TESTDB" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb attach $TESTDB + +# Wipe Test database +echo "wipe test database" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb wipedb $TESTDB + +# Add a record key=test1 data=value1 +echo "store key(test1) data(value1)" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb writekey $TESTDB test1 value1 + +# Fetch a record key=test1 +echo "read key(test1)" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1 +echo "$out" + +# Do a recovery +echo "force recovery" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb recover + +wait_until_node_has_status $recmaster recovered + +# Add a record key=test1 data=value2 +echo "store key(test1) data(value2)" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb writekey $TESTDB test1 value2 + +# Fetch a record key=test1 +echo "read key(test1)" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1 +echo "$out" + +# Do a recovery +echo "force recovery" +try_command_on_node -q $recmaster $CTDB_TEST_WRAPPER ctdb recover + +wait_until_node_has_status $recmaster recovered + +# Verify record key=test1 +echo "read key(test1)" +try_command_on_node $recmaster $CTDB_TEST_WRAPPER ctdb readkey $TESTDB test1 +echo "$out" +if [ "$out" = "Data: size:6 ptr:[value2]" ]; then + echo "GOOD: Recovery did not corrupt database" +else + echo "BAD: Recovery corrupted database" + status=1 +fi + +exit $status -- CTDB repository