Hello all.
corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"
I run on some problems this week. I not sure if I need to make 3
separate letters, sorry if so.
1)
I set node to standby and then to online. And after this I get this:
2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53
/usr/lib/heartbeat/stonithd
2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43
/usr/lib/heartbeat/cib
2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45
/usr/lib/heartbeat/crmd
2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49
/usr/lib/heartbeat/attrd
I was in hurry and it`s a production server, so I kill this proc and
stop pacemakerd & corosync. Then start them again. And all was ok.
I suppose what pacemakerd and corosync was running while this problems
occurs. I assume this cos then I run stop on they init scripts it is
takes some time till they stop.
Any hints?
2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started
but it is not. We use slightly modifed version of "anything" agent for
our scripts but they are aware of OCF return codes and other staff.
I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
/usr/lib/ocf/resource.d/mail.ru/generic monitor
# generic[14992]: DEBUG: default monitor : 7
So our agent said what it is not running, but pacemaker still think it
does. I runs for 2 days and after I forced to cleanup it. And it find
what it`snot running in seconds.
This is really scary situation. I can`t reproduce it but I already have
it twice... may be more but I not see it, who knows.
I attach out agent script and that is how we run this script:
primitive dialogues_notify.pl ocf:mail.ru:generic \
op monitor interval="30" timeout="300" on-fail="restart" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params binfile="/usr/local/mpop/bin/my/dialogues_notify.pl" \
meta failure-timeout="120"
3)
This one it confusing and dangerous.
I use failure-timeout on most resources to wipe out temp warn messages
from crm_verify -LV - I use it for monitoring a cluster. All works good
but I found this:
1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in
crm_verify -LV - and it is good.
4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know
if it is stopped by a hands of admin or because of errors.
I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?
--
Best regards,
Proskurin Kirill
#!/bin/sh
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
if [ ! -z $OCF_RESKEY_binfile ]; then
basename=`basename ${OCF_RESKEY_binfile} .pl`
OCF_RESKEY_pidfile_default=/var/run/${basename}.pid
OCF_RESKEY_logfile_default=/var/log/${basename}.log
fi
OCF_RESKEY_external_pidfile_default=0
OCF_RESKEY_core_dump_default=0
: ${OCF_RESKEY_pidfile=$OCF_RESKEY_pidfile_default}
: ${OCF_RESKEY_logfile=$OCF_RESKEY_logfile_default}
: ${OCF_RESKEY_external_pidfile=$OCF_RESKEY_external_pidfile_default}
: ${OCF_RESKEY_core_dump=$OCF_RESKEY_core_dump_default}
#######################################################################
generic_usage() {
cat <<END
usage: $0 {start|stop|monitor|validate-all|meta-data}
Expects to have a fully populated OCF RA-compliant environment set.
END
}
generic_meta() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="generic">
<version>1.0</version>
<longdesc lang="en">
Resource agent for any script
</longdesc>
<shortdesc lang="en">Resource agent for any script</shortdesc>
<parameters>
<parameter name="binfile" required="1">
<longdesc lang="en">
The full name of the binary to be executed.
</longdesc>
<shortdesc lang="en">Full path name of the binary to be executed</shortdesc>
<content type="string" />
</parameter>
<parameter name="options" required="0">
<longdesc lang="en">
Command line options to pass to the binary
</longdesc>
<shortdesc lang="en">Command line options</shortdesc>
<content type="string" />
</parameter>
<parameter name="pidfile">
<longdesc lang="en">
Path to pidfile. Default is: /var/run/\${basename}.pid
</longdesc>
<shortdesc lang="en">Path to pidfile</shortdesc>
<content type="string" default="${OCF_RESKEY_pidfile_default}"/>
</parameter>
<parameter name="logfile">
<longdesc lang="en">
Path to logfile. Default is: /var/log/\${basename}.log
</longdesc>
<shortdesc lang="en">Path to logfile</shortdesc>
<content type="string" default="${OCF_RESKEY_logfile_default}"/>
</parameter>
<parameter name="external_pidfile">
<longdesc lang="en">
Write pidfile by ocf-agent, not running script.
</longdesc>
<shortdesc lang="en">Who writes pidfile</shortdesc>
<content type="boolean" default="${OCF_RESKEY_pidfile_external_default}" />
</parameter>
<parameter name="core_dump">
<longdesc lang="en">
Set core file size limit to unlimited.
</longdesc>
<shortdesc lang="en">Write core dump or not</shortdesc>
<content type="boolean" default="${OCF_RESKEY_core_dump_default}" />
</parameter>
</parameters>
<actions>
<action name="start" timeout="20s" />
<action name="stop" timeout="20s" />
<action name="monitor" depth="0" timeout="20s" interval="10" />
<action name="meta-data" timeout="5s" />
<action name="validate-all" timeout="5s" />
</actions>
</resource-agent>
END
}
generic_start() {
generic_validate || return $?
generic_monitor && return $?
if ocf_is_true $OCF_RESKEY_core_dump; then
ulimit -c unlimited || return $OCF_ERR_GENERIC
fi
local cmd="$OCF_RESKEY_binfile $OCF_RESKEY_options"
ocf_log debug "Running $cmd"
$cmd >>$OCF_RESKEY_logfile 2>>$OCF_RESKEY_logfile </dev/null &
local pid=$!
if ocf_is_true $OCF_RESKEY_external_pidfile; then
echo $pid >$OCF_RESKEY_pidfile
fi
while ! generic_monitor; do
if kill -0 $pid 2>/dev/null; then
sleep 1
else
return $OCF_ERR_GENERIC
fi
done
return $OCF_SUCCESS
}
generic_stop() {
generic_monitor
if [ $? == $OCF_NOT_RUNNING ]; then
return $OCF_SUCCESS
fi
local pid=`cat $OCF_RESKEY_pidfile`
kill -TERM $pid
while kill -0 $pid 2>/dev/null; do
sleep 1
done
generic_monitor >/dev/null
if [ $? == $OCF_NOT_RUNNING ]; then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
}
generic_monitor_light() {
ocf_pidfile_status $OCF_RESKEY_pidfile
case $? in
0) return $OCF_SUCCESS;;
1) if ! ocf_is_true $OCF_RESKEY_external_pidfile; then
ocf_log warn "Process exited without cleaning up its pidfile,
maybe it died"
fi
rm -f $OCF_RESKEY_pidfile
return $OCF_NOT_RUNNING
;;
2) return $OCF_NOT_RUNNING;;
*) return $OCF_ERR_GENERIC;;
esac
}
generic_monitor_medium() {
generic_monitor_light || return $?
local pid=`cat $OCF_RESKEY_pidfile`
local lines=$(stat -c %N /proc/$pid/fd/[12] | grep "\`$OCF_RESKEY_logfile'"
| wc -l)
if [ $lines != 2 ]; then
ocf_log warn "stdin or stdout file descriptor is not pointing to
logfile, sending HUP"
kill -HUP $pid
fi
if echo ping | nc localhost ${OCF_RESKEY_admin_console_port} | grep Ok
>/dev/null; then
return $OCF_SUCCESS
else
ocf_log err "admin_console_port ping failed"
return $OCF_ERR_GENERIC
fi
}
generic_monitor() {
if [ $OCF_CHECK_LEVEL -lt 10 ]; then
generic_monitor_light
else
generic_monitor_medium
fi
return $?
}
generic_validate() {
check_binary $OCF_RESKEY_binfile
return $OCF_SUCCESS
}
case $__OCF_ACTION in
meta-data) generic_meta
exit $OCF_SUCCESS
;;
start) generic_start;;
stop) generic_stop;;
monitor) generic_monitor;;
validate-all) generic_validate;;
usage|help) generic_usage
exit $OCF_SUCCESS
;;
*) generic_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker