The branch, 2.5 has been updated via b3df79485915d692cf685c812067f55ebf0b5ea1 (commit) via 4921fa00ac43d2766609e75f5cc9ac29d9c41a6b (commit) via d4d60ede26b478e9ffd315b338c6ece005296a33 (commit) from 7f3613c510c4549e381a78291ca87c76ece91710 (commit)
http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=2.5 - Log ----------------------------------------------------------------- commit b3df79485915d692cf685c812067f55ebf0b5ea1 Author: Martin Schwenke <mar...@meltin.net> Date: Mon Jun 16 10:59:20 2014 +1000 eventscripts: Ensure $GANRECDIR points to configured subdirectory Check that the $GANRECDIR symlink points to the location specified by $CTDB_GANESHA_REC_SUBDIR and replace it if incorrect. This handles reconfiguration and filesystem changes. While touching this code: * Create the $GANRECDIR link as a separate step if it doesn't exist. This means there is only 1 place where the link is created. * Change some variables names to the style used for local function variables. * Remove some "ln failed" error messages. ln failures will be logged anyway. * Add -v to various mkdir/rm/ln commands so that these actions are logged when they actually do something. Signed-off-by: Martin Schwenke <mar...@meltin.net> Reviewed-by: Amitay Isaacs <ami...@gmail.com> Autobuild-User(master): Amitay Isaacs <ami...@samba.org> Autobuild-Date(master): Fri Jun 20 05:40:16 CEST 2014 on sn-devel-104 (Imported from commit aac607d7271eb50e776423329f2446a1e33a2641) commit 4921fa00ac43d2766609e75f5cc9ac29d9c41a6b Author: Martin Schwenke <mar...@meltin.net> Date: Wed Mar 5 16:21:45 2014 +1100 daemon: Debugging for tickle updates This was useful for debugging the race fixed by commit 4f79fa6c7c843502fcdaa2dead534ea3719b9f69. It might be useful again. Also fix a nearby comment typo. Signed-off-by: Martin Schwenke <mar...@meltin.net> Reviewed-by: Amitay Isaacs <ami...@gmail.com> Autobuild-User(master): Amitay Isaacs <ami...@samba.org> Autobuild-Date(master): Fri Jun 20 02:07:48 CEST 2014 on sn-devel-104 (Imported from commit 6f43896e1258c4cf43401cbfeba24a50de3c3140) commit d4d60ede26b478e9ffd315b338c6ece005296a33 Author: Martin Schwenke <mar...@meltin.net> Date: Tue Jun 10 15:16:44 2014 +1000 tests: Try harder to avoid failures due to repeated recoveries About a year ago a check was added to _cluster_is_healthy() to make sure that node 0 isn't in recovery. This was to avoid unexpected recoveries causing tests to fail. However, it was misguided because each test initially calls cluster_is_healthy() and will now fail if an unexpected recovery occurs. Instead, have cluster_is_healthy() warn if the cluster is in recovery. Also: * Rename wait_until_healthy() to wait_until_ready() because it waits until both healthy and out of recovery. * Change the post-recovery sleep in restart_ctdb() to 2 seconds and add a loop to wait (for 2 seconds at a time) if the cluster is back in recovery. The logic here is that the re-recovery timeout has been set to 1 second, so sleeping for just 1 second might race against the next recovery. * Use reverse logic in node_has_status() so that it works for "all". * Tweak wait_until() so that it can handle timeouts with a recheck-interval specified. Signed-off-by: Martin Schwenke <mar...@meltin.net> Reviewed-by: Amitay Isaacs <ami...@gmail.com> (Imported from commit 6a552f1a12ebe43f946bbbee2a3846b5a640ae4f) ----------------------------------------------------------------------- Summary of changes: config/events.d/60.ganesha | 32 ++++++++++-------- server/ctdb_takeover.c | 11 +++++- tests/complex/34_nfs_tickle_restart.sh | 2 +- tests/scripts/integration.bash | 57 ++++++++++++++++++++++++-------- 4 files changed, 72 insertions(+), 30 deletions(-) Changeset truncated at 500 lines: diff --git a/config/events.d/60.ganesha b/config/events.d/60.ganesha index d348a4f..5640b74 100755 --- a/config/events.d/60.ganesha +++ b/config/events.d/60.ganesha @@ -84,25 +84,29 @@ create_ganesha_recdirs () { [ -n "$CTDB_GANESHA_REC_SUBDIR" ] || CTDB_GANESHA_REC_SUBDIR=".ganesha" - MOUNTS=$(mount -t $CTDB_CLUSTER_FILESYSTEM_TYPE) - if [ -z "$MOUNTS" ]; then + _mounts=$(mount -t $CTDB_CLUSTER_FILESYSTEM_TYPE) + if [ -z "$_mounts" ]; then echo "startup $CTDB_CLUSTER_FILESYSTEM_TYPE not ready" exit 0 fi - MNTPT=$(echo "$MOUNTS" | sort | awk 'NR == 1 {print $3}') - mkdir -p $MNTPT/$CTDB_GANESHA_REC_SUBDIR - if [ -e $GANRECDIR ]; then - if [ ! -L $GANRECDIR ] ; then - rm -rf $GANRECDIR - if ! ln -s $MNTPT/$CTDB_GANESHA_REC_SUBDIR $GANRECDIR ; then - echo "ln failed" - fi - fi - else - if ! ln -sf $MNTPT/$CTDB_GANESHA_REC_SUBDIR $GANRECDIR ; then - echo "ln failed" + _mntpt=$(echo "$_mounts" | sort | awk 'NR == 1 {print $3}') + _link_dst="${_mntpt}/${CTDB_GANESHA_REC_SUBDIR}" + mkdir -vp "$_link_dst" + if [ -e "$GANRECDIR" ]; then + if [ ! -L "$GANRECDIR" ] ; then + rm -vrf "$GANRECDIR" + else + _t=$(readlink "$GANRECDIR") + if [ "$_t" != "$_link_dst" ] ; then + rm -v "$GANRECDIR" + fi fi fi + # This is not an "else". It also re-creates the link if it was + # removed above! + if [ ! -e "$GANRECDIR" ]; then + ln -sv "$_link_dst" "$GANRECDIR" + fi mkdir -p $GANRECDIR2 mkdir -p $GANRECDIR3 diff --git a/server/ctdb_takeover.c b/server/ctdb_takeover.c index f8a26f0..aaf243a 100644 --- a/server/ctdb_takeover.c +++ b/server/ctdb_takeover.c @@ -3234,7 +3234,7 @@ int32_t ctdb_control_tcp_remove(struct ctdb_context *ctdb, TDB_DATA indata) /* - Called when another daemon starts - caises all tickles for all + Called when another daemon starts - causes all tickles for all public addresses we are serving to be sent to the new node on the next check. This actually causes the next scheduled call to tdb_update_tcp_tickles() to update all nodes. This is simple and @@ -3244,6 +3244,9 @@ int32_t ctdb_control_startup(struct ctdb_context *ctdb, uint32_t pnn) { struct ctdb_vnn *vnn; + DEBUG(DEBUG_INFO, ("Received startup control from node %lu\n", + (unsigned long) pnn)); + for (vnn = ctdb->vnn; vnn != NULL; vnn = vnn->next) { vnn->tcp_update_needed = true; } @@ -3912,6 +3915,9 @@ int32_t ctdb_control_set_tcp_tickle_list(struct ctdb_context *ctdb, TDB_DATA ind return -1; } + DEBUG(DEBUG_INFO, ("Received tickle update for public address %s\n", + ctdb_addr_to_str(&list->addr))); + vnn = find_public_ip_vnn(ctdb, &list->addr); if (vnn == NULL) { DEBUG(DEBUG_INFO,(__location__ " Could not set tcp tickle list, '%s' is not a public address\n", @@ -4060,6 +4066,9 @@ static void ctdb_update_tcp_tickles(struct event_context *ev, DEBUG(DEBUG_ERR,("Failed to send the tickle update for public address %s\n", ctdb_addr_to_str(&vnn->public_address))); } else { + DEBUG(DEBUG_INFO, + ("Sent tickle update for public address %s\n", + ctdb_addr_to_str(&vnn->public_address))); vnn->tcp_update_needed = false; } } diff --git a/tests/complex/34_nfs_tickle_restart.sh b/tests/complex/34_nfs_tickle_restart.sh index 93587e2..b7eea4c 100755 --- a/tests/complex/34_nfs_tickle_restart.sh +++ b/tests/complex/34_nfs_tickle_restart.sh @@ -79,7 +79,7 @@ try_command_on_node $rn $CTDB_TEST_WRAPPER restart_ctdb_1 echo "Setting NoIPTakeover on node ${rn}" try_command_on_node $rn $CTDB setvar NoIPTakeover 1 -wait_until_healthy +wait_until_ready echo "Getting TickleUpdateInterval..." try_command_on_node $test_node $CTDB getvar TickleUpdateInterval diff --git a/tests/scripts/integration.bash b/tests/scripts/integration.bash index 4a1f091..60f72b6 100644 --- a/tests/scripts/integration.bash +++ b/tests/scripts/integration.bash @@ -258,11 +258,19 @@ select_test_node_and_ips () ####################################### # Wait until either timeout expires or command succeeds. The command -# will be tried once per second. +# will be tried once per second, unless timeout has format T/I, where +# I is the recheck interval. wait_until () { local timeout="$1" ; shift # "$@" is the command... + local interval=1 + case "$timeout" in + */*) + interval="${timeout#*/}" + timeout="${timeout%/*}" + esac + local negate=false if [ "$1" = "!" ] ; then negate=true @@ -280,9 +288,12 @@ wait_until () echo "OK" return 0 fi - echo -n . - t=$(($t - 1)) - sleep 1 + local i + for i in $(seq 1 $interval) ; do + echo -n . + done + t=$(($t - $interval)) + sleep $interval done echo "*TIMEOUT*" @@ -302,14 +313,26 @@ sleep_for () _cluster_is_healthy () { - $CTDB nodestatus all >/dev/null && \ - node_has_status 0 recovered + $CTDB nodestatus all >/dev/null +} + +_cluster_is_recovered () +{ + node_has_status all recovered +} + +_cluster_is_ready () +{ + _cluster_is_healthy && _cluster_is_recovered } cluster_is_healthy () { if onnode 0 $CTDB_TEST_WRAPPER _cluster_is_healthy ; then echo "Cluster is HEALTHY" + if ! onnode 0 $CTDB_TEST_WRAPPER _cluster_is_recovered ; then + echo "WARNING: cluster in recovery mode!" + fi return 0 else echo "Cluster is UNHEALTHY" @@ -325,13 +348,13 @@ cluster_is_healthy () fi } -wait_until_healthy () +wait_until_ready () { local timeout="${1:-120}" - echo "Waiting for cluster to become healthy..." + echo "Waiting for cluster to become ready..." - wait_until $timeout onnode -q any $CTDB_TEST_WRAPPER _cluster_is_healthy + wait_until $timeout onnode -q any $CTDB_TEST_WRAPPER _cluster_is_ready } # This function is becoming nicely overloaded. Soon it will collapse! :-) @@ -356,7 +379,7 @@ node_has_status () (unfrozen) fpat='^[[:space:]]+frozen[[:space:]]+0$' ;; (monon) mpat='^Monitoring mode:ACTIVE \(0\)$' ;; (monoff) mpat='^Monitoring mode:DISABLED \(1\)$' ;; - (recovered) rpat='^Recovery mode:NORMAL \(0\)$' ;; + (recovered) rpat='^Recovery mode:RECOVERY \(1\)$' ;; *) echo "node_has_status: unknown status \"$status\"" return 1 @@ -382,7 +405,7 @@ node_has_status () elif [ -n "$mpat" ] ; then $CTDB getmonmode -n "$pnn" | egrep -q "$mpat" elif [ -n "$rpat" ] ; then - $CTDB status -n "$pnn" | egrep -q "$rpat" + ! $CTDB status -n "$pnn" | egrep -q "$rpat" else echo 'node_has_status: unknown mode, neither $bits nor $fpat is set' return 1 @@ -532,8 +555,8 @@ restart_ctdb () continue } - wait_until_healthy || { - echo "Cluster didn't become healthy. Restarting..." + wait_until_ready || { + echo "Cluster didn't become ready. Restarting..." continue } @@ -545,7 +568,13 @@ restart_ctdb () # help the cluster to stabilise before a subsequent test. echo "Forcing a recovery..." onnode -q 0 $CTDB recover - sleep_for 1 + sleep_for 2 + + if ! onnode -q any $CTDB_TEST_WRAPPER _cluster_is_recovered ; then + echo "Cluster has gone into recovery again, waiting..." + wait_until 30/2 onnode -q any $CTDB_TEST_WRAPPER _cluster_is_recovered + fi + # Cluster is still healthy. Good, we're done! if ! onnode 0 $CTDB_TEST_WRAPPER _cluster_is_healthy ; then -- CTDB repository