Hi, We have found a several problems with pgsql RA through our testing. It 'fails to failover' in some scenarios. I'm proposing a patch to fix them.
Problem description: 1) The first 'monitor' may fail even if the postmaster was successfully launched. This is because 'start' of the pgsql may return before the postmaster gets ready to answer to a psql query issued by 'monitor', since it only checks the existance of postmaster process. The postmaster can take a few minitues to get ready to answer, particularly when it needs to recover the database after a crash. Even if no recovery is necessary, we observed that it sometimes fails in some of our test cases. 2) The postmaster fails to startup when 'postmaster.pid' file was left over from the previous crash. 3) 'stop' doest not execute the fast mode shutdown effectively, because it executes the immediate mode shutdown at the very next moment. The fast mode shutdown can take a few minutes to complete to flush the database log. This isn't a critical problem, but it may result to take a time longer to complete the failover (according to our database team). It is preferable to wait to complete the fast mode shutdown as long as possible. Proposals to fix: 1) In 'start', wait until the postmaster gets ready to answer by checking as same as 'monitor' does. The maximum wait time to complete to startup can be customized by an additional parameter 'start_wait'. 2) Add a cleanup code for 'postmaster.pid' when stop and before starting. 3) In 'stop', wait until the postmaster completes to the fast mode shutdown. The maximum wait time to complete to shutdown can be customized by an additional parameter 'stop_wait. The attached patch is for the latest -dev. Regards, Keisuke MORI NTT DATA Intellilink Corporation
diff -r 7dbd2d974acc resources/OCF/pgsql.in --- a/resources/OCF/pgsql.in Mon Feb 19 15:25:07 2007 +0100 +++ b/resources/OCF/pgsql.in Tue Feb 20 21:25:52 2007 +0900 @@ -19,6 +19,8 @@ # OCF_RESKEY_pgport - Port where PostgreSQL is listening # OCF_RESKEY_pgdb - database to monitor. Default is template1 # OCF_RESKEY_logfile - Path to PostgreSQL log file. Default is /dev/null +# OCF_RESKEY_start_wait - Start waiting time. Default is 30 +# OCF_RESKEY_stop_wait - Stop waiting time. Default is 30 ############################################################################### # Initialization: @@ -127,6 +129,20 @@ Path to PostgreSQL server log output fil </longdesc> <shortdesc lang="en">logfile</shortdesc> <content type="string" default="/dev/null" /> +</parameter> +<parameter name="start_wait" unique="0" required="0"> +<longdesc lang="en"> +Start waiting time. +</longdesc> +<shortdesc lang="en">start_wait</shortdesc> +<content type="string" default="30" /> +</parameter> +<parameter name="stop_wait" unique="0" required="0"> +<longdesc lang="en"> +Stop waiting time. +</longdesc> +<shortdesc lang="en">stop_wait</shortdesc> +<content type="string" default="30" /> </parameter> </parameters> @@ -178,6 +194,9 @@ pgsql_start() { if [ -x $PGCTL ] then + # Remove postmastre.pid if it exists + rm -f $PIDFILE + # Check if we need to create a log file if ! check_log_file $LOGFILE then @@ -196,15 +215,35 @@ pgsql_start() { ocf_log err "$PGCTL not found!" return $OCF_ERR_GENERIC fi - - if ! pgsql_status - then - sleep 5 - if ! pgsql_status - then - echo "ERROR: PostgreSQL is not running!" - return $OCF_ERR_GENERIC - fi + + # start waiting + count=0 + PRESULT=1 + while [ $count -lt $START_WAIT ] + do + if pgsql_status + then + if [ -z "$PGHOST" ] + then + $PSQL -p $PGPORT -U $PGDBA $PGDB -c 'select now();' >/dev/null 2>&1 + else + $PSQL -h $PGHOST -p $PGPORT -U $PGDBA $PGDB -c 'select now();' >/dev/null 2>&1 + fi + PRESULT=$? + + if [ $PRESULT -eq 0 ] + then + break; + fi + fi + count=`expr $count + 1` + sleep 1 + done + + if [ $PRESULT -ne 0 ] + then + ocf_log err "PostgreSQL is not running!" + return $OCF_ERR_GENERIC fi return $OCF_SUCCESS @@ -221,11 +260,27 @@ pgsql_stop() { # Stop PostgreSQL do not wait for clients to disconnect runasowner "$PGCTL -D $PGDATA stop -m fast > /dev/null 2>&1" + # stop waiting + count=0 + while [ $count -lt $STOP_WAIT ] + do + if ! pgsql_status + then + #PostgreSQL stopped + break; + fi + count=`expr $count + 1` + sleep 1 + done + if pgsql_status then #PostgreSQL is still up. Use another shutdown mode. runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1" fi + + # Remove postmastre.pid if it exists + rm -f $PIDFILE return $OCF_SUCCESS } @@ -348,6 +403,8 @@ PGDB=${OCF_RESKEY_pgdb:-template1} PGDB=${OCF_RESKEY_pgdb:-template1} LOGFILE=${OCF_RESKEY_logfile:-/dev/null} PIDFILE=${PGDATA}/postmaster.pid +START_WAIT=${OCF_RESKEY_start_wait:-"30"} +STOP_WAIT=${OCF_RESKEY_stop_wait:-"30"} case "$1" in methods) pgsql_methods
_______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/