Package: zabbix-server-mysql Version: 1:1.6.4-2 Priority: normal Tags: patch
After testing zabbix server extensively in a lab environment (in a lab practive I gave out to my students) I've found that under many circunstances '/etc/init.d/zabbix-server restart' fails to start Zabbix properly. In the lab environment we had configured the following in zabbix_server.conf: StartPingers=10 StartDiscoverers=10 This lead to over 40 processes being spawned by zabbix_server. However, when the 'stop' action was called the child processes where not killed on time. In this case 2 second timeframe defined for the 'restart' action in the init.d script seems no to be enough time for Zabbix to stop. To reproduce this just set StartPingers and StartDiscoverers to a very large value (100), execute the restart action and then look at the process list. Zabbix_server is either not there or only the (dying) child processes are there. Consequently, there is a race condition in which the 'start' action fails because the zabbix-server is actually stopping all of its childs and the 'stop' action has not yet finished. I have patched the zabbix script in order to make the 'restart' action more effectively by: - (A) introducing a 'running' function to determine if the zabbix-server is up - (B) introducing a 'force_stop' function to try to kill the server 'the hardway' - changing the '2' second value to a variable (DODTIME) defined in the script header. This value could (potentially) be set by sourcing an external configuration file (think /etc/default/zabbix-server) - use function (A) to stop the server only if present (optimisation) - use function (A) after the server is supposedly stopped to try to determine if its running, if it is then try to kill it (using (B)) [ Note: Additionally, there are several functions to detect children and to kill them too ] - introduce a 'status' action which makes use of (A) I've attached a patch to the debian/zabbix-server-pgsql.zabbix-server.init file used in the package. Please review it and consider it for inclusion in the package. These changes could be easily be moved into the zabbix-server psql package. By the way, why are there two init.d scripts in the sources (one for psql and one for mysql) when they are exactly the same file? Wouldn't it be better to use a single file here? Regards Javier Note: This happened to me in 9 different servers, each of them had Intel Celeron 2GHz CPUs and 1GB of RAM. With 10 discoverers/pingers each server was running 40 Zabbix_server processes
--- zabbix-server-mysql.zabbix-server.init.orig 2009-05-26 01:47:11.000000000 +0200 +++ zabbix-server-mysql.zabbix-server.init 2009-05-26 02:51:19.000000000 +0200 @@ -13,8 +13,12 @@ test -f $DAEMON || exit 0 +# time to wait for daemons death, in seconds +# don't set it too low or you might not let it die gracefully +DODTIME=2 +MAX_DIETIME=5 DIR=/var/run/zabbix-server -PID=$DIR/$NAME.pid +PIDFILE=$DIR/$NAME.pid if test ! -d "$DIR"; then mkdir "$DIR" @@ -23,28 +27,133 @@ set -e +# Check if a given process pid's cmdline matches a given name +running() +{ + # No pidfile, probably no daemon present + [ ! -f "$PIDFILE" ] && return 1 + pid=`cat $PIDFILE` + + # No pid, probably no daemon present + [ -z "$pid" ] && return 1 + + [ ! -d /proc/$pid ] && return 1 + cmd=`cat /proc/$pid/cmdline | tr "\000" "\n"|head -n 1 |cut -d : -f 1` + # Is this the expected child? + [ "$cmd" != "$DAEMON" ] && return 1 + + return 0 +} + +# Check if a given process' childrens are running +running_child() +{ + [ -z "$NAME" ] && return 1 + if ps -eo ppid,pid,comm |grep -q $NAME; then + return 0 + fi + return 1 +} + + + +force_stop() { + [ ! -e "$PIDFILE" ] && return + if running ; then + pid=`cat $PIDFILE` + kill -15 $pid + # Is it really dead? + [ -n "$DODTIME" ] && sleep "$DODTIME"s + if running ; then + kill -9 $pid + [ -n "$DODTIME" ] && sleep "$DODTIME"s + if running ; then + echo "Cannot kill $DESC (pid=$pid)!" + exit 1 + fi + fi + fi + rm -f $PIDFILE +} + +# Maybe the process is not running, but its children are +force_child_stop() +{ +# Kill the children by name, it's safer not to use a variable here + killall -15 zabbix_server +} + +# Checks if the process is properly dead +check_death() +{ + [ -n "$DODTIME" ] && DODTIME=2 + [ -n "$MAX_DIETIME" ] && MAX_DIETIME=15 + sleep "$DODTIME"s + if running; then + echo "$DESC did not stop in $DODTIME seconds, forcing it to stop" + force_stop + fi + if running; then + echo "ERROR: $DESC did not die in the expected time, consider increasing DODTIME (currently $DODTIME)" + exit 1 + fi +# Wait for the children to stop + if running_child; then + echo -n "Waiting for child processes to die" + force_child_stop + for wait in `seq $MAX_DIETIME`; do + if ! running_child; then break ; fi + echo -n "." + force_child_stop + sleep $wait + done + echo + fi + if running_name; then + echo "ERROR: $DESC's children processes did not die in the expected time, consider increasing MAX_DIETIME (currently $MAX_DIETIME)" + exit 1 + fi +} + + + export PATH="${PATH:+$PATH:}/usr/sbin:/sbin" case "$1" in start) - rm -f $PID + rm -f $PIDFILE echo "Starting $DESC: $NAME" - start-stop-daemon --oknodo --start --pidfile $PID \ + start-stop-daemon --oknodo --start --pidfile $PIDFILE \ --exec $DAEMON >/dev/null 2>&1 + if ! running; then + echo "ERROR: $DESC did not start" + exit 1 + fi ;; stop) echo "Stopping $DESC: $NAME" - start-stop-daemon --oknodo --stop --pidfile $PID \ + if running ; then + start-stop-daemon --oknodo --stop --pidfile $PIDFILE \ --exec $DAEMON + fi ;; restart|force-reload) $0 stop - sleep 2 + check_death $0 start ;; + status) + echo -n "$DESC is " + if running ; then + echo "running." + else + echo "not running." + exit 1 + fi + ;; *) N=/etc/init.d/$NAME - echo "Usage: $N {start|stop|restart|force-reload}" >&2 + echo "Usage: $N {start|stop|restart|status|force-reload}" >&2 exit 1 ;; esac