Hi,

We have found a several problems with pgsql RA through our testing.
It 'fails to failover' in some scenarios.
I'm proposing a patch to fix them.

Problem description:

1) The first 'monitor' may fail even if the postmaster was
   successfully launched.

   This is because 'start' of the pgsql may return before the
   postmaster gets ready to answer to a psql query issued by
   'monitor', since it only checks the existance of postmaster
   process. The postmaster can take a few minitues to get ready
   to answer, particularly when it needs to recover the database
   after a crash. Even if no recovery is necessary, we observed
   that it sometimes fails in some of our test cases.

2) The postmaster fails to startup when 'postmaster.pid' file
   was left over from the previous crash.

3) 'stop' doest not execute the fast mode shutdown effectively,
   because it executes the immediate mode shutdown at the very
   next moment.  The fast mode shutdown can take a few minutes
   to complete to flush the database log.

   This isn't a critical problem, but it may result to take a
   time longer to complete the failover (according to our
   database team). It is preferable to wait to complete the fast
   mode shutdown as long as possible.


Proposals to fix:

1) In 'start', wait until the postmaster gets ready to answer by
   checking as same as 'monitor' does.
   The maximum wait time to complete to startup can be
   customized by an additional parameter 'start_wait'. 

2) Add a cleanup code for 'postmaster.pid' when stop and before starting.

3) In 'stop', wait until the postmaster completes to the fast
   mode shutdown.
   The maximum wait time to complete to shutdown can be
   customized by an additional parameter 'stop_wait. 


The attached patch is for the latest -dev.

Regards,

Keisuke MORI
NTT DATA Intellilink Corporation

diff -r 7dbd2d974acc resources/OCF/pgsql.in
--- a/resources/OCF/pgsql.in	Mon Feb 19 15:25:07 2007 +0100
+++ b/resources/OCF/pgsql.in	Tue Feb 20 21:25:52 2007 +0900
@@ -19,6 +19,8 @@
 #  OCF_RESKEY_pgport - Port where PostgreSQL is listening
 #  OCF_RESKEY_pgdb   - database to monitor. Default is template1
 #  OCF_RESKEY_logfile - Path to PostgreSQL log file. Default is /dev/null
+#  OCF_RESKEY_start_wait - Start waiting time. Default is 30
+#  OCF_RESKEY_stop_wait - Stop waiting time. Default is 30
 ###############################################################################
 # Initialization:
 
@@ -127,6 +129,20 @@ Path to PostgreSQL server log output fil
 </longdesc>
 <shortdesc lang="en">logfile</shortdesc>
 <content type="string" default="/dev/null" />
+</parameter>
+<parameter name="start_wait" unique="0" required="0">
+<longdesc lang="en">
+Start waiting time.
+</longdesc>
+<shortdesc lang="en">start_wait</shortdesc>
+<content type="string" default="30" />
+</parameter>
+<parameter name="stop_wait" unique="0" required="0">
+<longdesc lang="en">
+Stop waiting time.
+</longdesc>
+<shortdesc lang="en">stop_wait</shortdesc>
+<content type="string" default="30" />
 </parameter>
 </parameters>
 
@@ -178,6 +194,9 @@ pgsql_start() {
     
     if [ -x $PGCTL ]
     then
+	# Remove postmastre.pid if it exists
+	rm -f $PIDFILE
+
         # Check if we need to create a log file
         if ! check_log_file $LOGFILE
 	then
@@ -196,15 +215,35 @@ pgsql_start() {
 	ocf_log err "$PGCTL not found!"
 	return $OCF_ERR_GENERIC
     fi
-	
-    if ! pgsql_status
-    then
-	sleep 5
-	if ! pgsql_status
-	then	
-	    echo "ERROR: PostgreSQL is not running!"
-            return $OCF_ERR_GENERIC
-	fi
+
+    # start waiting
+    count=0
+    PRESULT=1
+    while [ $count -lt $START_WAIT ]
+    do
+        if pgsql_status
+        then
+            if [ -z "$PGHOST" ]
+            then
+               $PSQL -p $PGPORT -U $PGDBA $PGDB -c 'select now();' >/dev/null 2>&1
+            else
+               $PSQL -h $PGHOST -p $PGPORT -U $PGDBA $PGDB -c 'select now();' >/dev/null 2>&1
+            fi
+            PRESULT=$?
+
+            if [ $PRESULT -eq 0 ]
+	    then
+                break;
+            fi
+        fi
+        count=`expr $count + 1`
+        sleep 1
+    done
+
+    if [ $PRESULT -ne  0 ]
+    then
+	ocf_log err "PostgreSQL is not running!"
+	return $OCF_ERR_GENERIC
     fi
 
     return $OCF_SUCCESS
@@ -221,11 +260,27 @@ pgsql_stop() {
     # Stop PostgreSQL do not wait for clients to disconnect
     runasowner "$PGCTL -D $PGDATA stop -m fast > /dev/null 2>&1"
 
+    # stop waiting
+    count=0
+    while [ $count -lt $STOP_WAIT ]
+    do
+        if ! pgsql_status
+        then
+            #PostgreSQL stopped
+            break;
+        fi
+        count=`expr $count + 1`
+        sleep 1
+    done
+
     if pgsql_status
     then
        #PostgreSQL is still up. Use another shutdown mode.
        runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
     fi
+	
+    # Remove postmastre.pid if it exists
+    rm -f $PIDFILE
 
     return $OCF_SUCCESS
 }
@@ -348,6 +403,8 @@ PGDB=${OCF_RESKEY_pgdb:-template1}
 PGDB=${OCF_RESKEY_pgdb:-template1}
 LOGFILE=${OCF_RESKEY_logfile:-/dev/null}
 PIDFILE=${PGDATA}/postmaster.pid
+START_WAIT=${OCF_RESKEY_start_wait:-"30"}
+STOP_WAIT=${OCF_RESKEY_stop_wait:-"30"}
 
 case "$1" in
     methods)    pgsql_methods
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to