Package: zabbix-server-mysql
Version: 1:1.6.4-2
Priority: normal
Tags: patch

After testing zabbix server extensively in a lab environment (in a lab
practive I gave out to my students) I've found that under many circunstances
'/etc/init.d/zabbix-server restart' fails to start Zabbix properly.

In the lab environment we had configured the following in zabbix_server.conf:

StartPingers=10
StartDiscoverers=10

This lead to over 40 processes being spawned by zabbix_server. 
However, when the 'stop' action was called the child processes where not
killed on time.  In this case  2 second timeframe defined for the 'restart'
action in the init.d script seems no to be enough time for Zabbix to stop.

To reproduce this just set StartPingers and StartDiscoverers to a very large
value (100), execute the restart action and then look at the process list.
Zabbix_server is either not there or only the (dying) child processes are
there.

Consequently, there is a race condition in which the 'start' action fails
because the zabbix-server is actually stopping all of its childs and the
'stop' action has not yet finished.

I have patched the zabbix script in order to make the 'restart' action more
effectively by:

- (A) introducing a 'running' function to determine if the zabbix-server is up
- (B) introducing a 'force_stop' function to try to kill the server 'the
  hardway'
- changing the '2' second value to a variable (DODTIME) defined in the script
  header. This value could (potentially) be set by sourcing an external
  configuration file (think /etc/default/zabbix-server)

- use function (A) to stop the server only if present (optimisation)
- use function (A) after the server is supposedly stopped to try to determine
  if its running, if it is then try to kill it (using (B))
[ Note: Additionally, there are several functions to detect children and to
kill them too ]

- introduce a 'status' action which makes use of (A)

I've attached a patch to the debian/zabbix-server-pgsql.zabbix-server.init
file used in the package. Please review it and consider it for inclusion in
the package.

These changes could be easily be moved into the zabbix-server psql package.
By the way, why are there two init.d scripts in the sources (one for psql and
one for mysql) when they are exactly the same file? Wouldn't it be better to
use a single file here?


Regards

Javier

Note: This happened to me in 9 different servers, each of them had
Intel Celeron 2GHz CPUs and 1GB of RAM. With 10 discoverers/pingers each
server was running 40 Zabbix_server processes
--- zabbix-server-mysql.zabbix-server.init.orig	2009-05-26 01:47:11.000000000 +0200
+++ zabbix-server-mysql.zabbix-server.init	2009-05-26 02:51:19.000000000 +0200
@@ -13,8 +13,12 @@
 
 test -f $DAEMON || exit 0
 
+# time to wait for daemons death, in seconds
+# don't set it too low or you might not let it die gracefully
+DODTIME=2
+MAX_DIETIME=5
 DIR=/var/run/zabbix-server
-PID=$DIR/$NAME.pid
+PIDFILE=$DIR/$NAME.pid
 
 if test ! -d "$DIR"; then
         mkdir "$DIR"
@@ -23,28 +27,133 @@
 
 set -e
 
+# Check if a given process pid's cmdline matches a given name
+running()
+{
+    # No pidfile, probably no daemon present
+    [ ! -f "$PIDFILE" ] && return 1
+    pid=`cat $PIDFILE`
+
+    # No pid, probably no daemon present
+    [ -z "$pid" ] && return 1
+
+    [ ! -d /proc/$pid ] &&  return 1
+    cmd=`cat /proc/$pid/cmdline | tr "\000" "\n"|head -n 1 |cut -d : -f 1`
+    # Is this the expected child?
+    [ "$cmd" != "$DAEMON" ] &&  return 1
+
+    return 0
+}
+
+# Check if a given process' childrens are running
+running_child()
+{
+    [ -z "$NAME" ] && return 1
+    if ps -eo ppid,pid,comm |grep -q $NAME; then
+        return 0
+    fi
+    return 1
+}
+
+
+
+force_stop() {
+        [ ! -e "$PIDFILE" ] && return
+        if running ; then
+                pid=`cat $PIDFILE`
+                kill -15 $pid
+        # Is it really dead?
+                [ -n "$DODTIME" ] && sleep "$DODTIME"s
+                if running ; then
+                        kill -9 $pid
+                        [ -n "$DODTIME" ] && sleep "$DODTIME"s
+                        if running ; then
+                                echo "Cannot kill $DESC (pid=$pid)!"
+                                exit 1
+                        fi
+                fi
+        fi
+        rm -f $PIDFILE
+}
+
+# Maybe the process is not running, but its children are
+force_child_stop()
+{
+# Kill the children by name, it's safer not to use a variable here
+    killall -15 zabbix_server
+}
+
+# Checks if the process is properly dead
+check_death()
+{
+        [ -n "$DODTIME" ] && DODTIME=2
+        [ -n "$MAX_DIETIME" ] &&  MAX_DIETIME=15
+        sleep "$DODTIME"s
+        if running; then
+            echo "$DESC did not stop in $DODTIME seconds, forcing it to stop"
+            force_stop
+        fi
+        if running; then
+            echo "ERROR: $DESC did not die in the expected time, consider increasing DODTIME (currently $DODTIME)"
+            exit 1
+        fi
+# Wait for the children to stop
+        if running_child; then
+            echo -n "Waiting for child processes to die"
+            force_child_stop
+            for wait in `seq $MAX_DIETIME`; do
+                if ! running_child; then break ; fi
+                echo -n "."
+                force_child_stop
+                sleep $wait
+            done
+            echo
+        fi
+        if running_name; then
+            echo "ERROR: $DESC's children processes did not die in the expected time, consider increasing MAX_DIETIME (currently $MAX_DIETIME)"
+            exit 1
+        fi
+}
+
+
+
 export PATH="${PATH:+$PATH:}/usr/sbin:/sbin"
 
 case "$1" in
   start)
-    rm -f $PID
+    rm -f $PIDFILE
 	echo "Starting $DESC: $NAME"
-	start-stop-daemon --oknodo --start --pidfile $PID \
+	start-stop-daemon --oknodo --start --pidfile $PIDFILE \
 		--exec $DAEMON >/dev/null 2>&1
+        if ! running; then
+            echo "ERROR: $DESC did not start"
+            exit 1
+        fi
 	;;
   stop)
 	echo "Stopping $DESC: $NAME"
-	start-stop-daemon --oknodo --stop --pidfile $PID \
+        if running ; then
+            start-stop-daemon --oknodo --stop --pidfile $PIDFILE \
 		--exec $DAEMON
+        fi
 	;;
   restart|force-reload)
 	$0 stop
-	sleep 2
+        check_death
 	$0 start
 	;;
+  status)
+        echo -n "$DESC is "
+        if running ;  then
+            echo "running."
+        else
+            echo "not running."
+            exit 1
+        fi
+        ;;
   *)
 	N=/etc/init.d/$NAME
-	echo "Usage: $N {start|stop|restart|force-reload}" >&2
+	echo "Usage: $N {start|stop|restart|status|force-reload}" >&2
 	exit 1
 	;;
 esac

Reply via email to