Hello all.

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"

I run on some problems this week. I not sure if I need to make 3 separate letters, sorry if so.

1)
I set node to standby and then to online. And after this I get this:

2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53 /usr/lib/heartbeat/stonithd 2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43 /usr/lib/heartbeat/cib 2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45 /usr/lib/heartbeat/crmd 2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49 /usr/lib/heartbeat/attrd

I was in hurry and it`s a production server, so I kill this proc and stop pacemakerd & corosync. Then start them again. And all was ok. I suppose what pacemakerd and corosync was running while this problems occurs. I assume this cos then I run stop on they init scripts it is takes some time till they stop.

Any hints?

2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started but it is not. We use slightly modifed version of "anything" agent for our scripts but they are aware of OCF return codes and other staff.

I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl /usr/lib/ocf/resource.d/mail.ru/generic monitor
# generic[14992]: DEBUG: default monitor : 7

So our agent said what it is not running, but pacemaker still think it does. I runs for 2 days and after I forced to cleanup it. And it find what it`snot running in seconds.

This is really scary situation. I can`t reproduce it but I already have it twice... may be more but I not see it, who knows.

I attach out agent script and that is how we run this script:

primitive dialogues_notify.pl ocf:mail.ru:generic \
        op monitor interval="30" timeout="300" on-fail="restart" \
        op start interval="0" timeout="300" \
        op stop interval="0" timeout="300" \
        params binfile="/usr/local/mpop/bin/my/dialogues_notify.pl" \
        meta failure-timeout="120"

3)
This one it confusing and dangerous.

I use failure-timeout on most resources to wipe out temp warn messages from crm_verify -LV - I use it for monitoring a cluster. All works good but I found this:

1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in crm_verify -LV - and it is good.
4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know if it is stopped by a hands of admin or because of errors.

I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?

--
Best regards,
Proskurin Kirill
#!/bin/sh

#######################################################################
# Initialization:
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

if [ ! -z $OCF_RESKEY_binfile ]; then
    basename=`basename ${OCF_RESKEY_binfile} .pl`
    OCF_RESKEY_pidfile_default=/var/run/${basename}.pid
    OCF_RESKEY_logfile_default=/var/log/${basename}.log
fi
OCF_RESKEY_external_pidfile_default=0
OCF_RESKEY_core_dump_default=0

: ${OCF_RESKEY_pidfile=$OCF_RESKEY_pidfile_default}
: ${OCF_RESKEY_logfile=$OCF_RESKEY_logfile_default}
: ${OCF_RESKEY_external_pidfile=$OCF_RESKEY_external_pidfile_default}
: ${OCF_RESKEY_core_dump=$OCF_RESKEY_core_dump_default}

#######################################################################

generic_usage() {
    cat <<END
usage: $0 {start|stop|monitor|validate-all|meta-data}

Expects to have a fully populated OCF RA-compliant environment set.
END
}

generic_meta() {
    cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="generic">
<version>1.0</version>
<longdesc lang="en">
Resource agent for any script
</longdesc>
<shortdesc lang="en">Resource agent for any script</shortdesc>

<parameters>
<parameter name="binfile" required="1">
<longdesc lang="en">
The full name of the binary to be executed.
</longdesc>
<shortdesc lang="en">Full path name of the binary to be executed</shortdesc>
<content type="string" />
</parameter>
<parameter name="options" required="0">
<longdesc lang="en">
Command line options to pass to the binary
</longdesc>
<shortdesc lang="en">Command line options</shortdesc>
<content type="string" />
</parameter>
<parameter name="pidfile">
<longdesc lang="en">
Path to pidfile. Default is: /var/run/\${basename}.pid
</longdesc>
<shortdesc lang="en">Path to pidfile</shortdesc>
<content type="string" default="${OCF_RESKEY_pidfile_default}"/>
</parameter>
<parameter name="logfile">
<longdesc lang="en">
Path to logfile. Default is: /var/log/\${basename}.log
</longdesc>
<shortdesc lang="en">Path to logfile</shortdesc>
<content type="string" default="${OCF_RESKEY_logfile_default}"/>
</parameter>
<parameter name="external_pidfile">
<longdesc lang="en">
Write pidfile by ocf-agent, not running script.
</longdesc>
<shortdesc lang="en">Who writes pidfile</shortdesc>
<content type="boolean" default="${OCF_RESKEY_pidfile_external_default}" />
</parameter>
<parameter name="core_dump">
<longdesc lang="en">
Set core file size limit to unlimited.
</longdesc>
<shortdesc lang="en">Write core dump or not</shortdesc>
<content type="boolean" default="${OCF_RESKEY_core_dump_default}" />
</parameter>
</parameters>

<actions>
<action name="start" timeout="20s" />
<action name="stop" timeout="20s" />
<action name="monitor" depth="0" timeout="20s" interval="10" />
<action name="meta-data" timeout="5s" />
<action name="validate-all" timeout="5s" />
</actions>
</resource-agent>
END
}

generic_start() {
    generic_validate || return $?
    generic_monitor && return $?
    if ocf_is_true $OCF_RESKEY_core_dump; then
        ulimit -c unlimited || return $OCF_ERR_GENERIC
    fi
    local cmd="$OCF_RESKEY_binfile $OCF_RESKEY_options"
    ocf_log debug "Running $cmd"
    $cmd >>$OCF_RESKEY_logfile 2>>$OCF_RESKEY_logfile </dev/null &
    local pid=$!
    if ocf_is_true $OCF_RESKEY_external_pidfile; then
        echo $pid >$OCF_RESKEY_pidfile
    fi
    while ! generic_monitor; do
        if kill -0 $pid 2>/dev/null; then
            sleep 1
        else
            return $OCF_ERR_GENERIC
        fi
    done
    return $OCF_SUCCESS
}

generic_stop() {
    generic_monitor
    if [ $? == $OCF_NOT_RUNNING ]; then
        return $OCF_SUCCESS
    fi
    local pid=`cat $OCF_RESKEY_pidfile`
    kill -TERM $pid
    while kill -0 $pid 2>/dev/null; do
        sleep 1
    done
    generic_monitor >/dev/null
    if [ $? == $OCF_NOT_RUNNING ]; then
        return $OCF_SUCCESS
    else
        return $OCF_ERR_GENERIC
    fi
}

generic_monitor_light() {
    ocf_pidfile_status $OCF_RESKEY_pidfile
    case $? in
        0)  return $OCF_SUCCESS;;
        1)  if ! ocf_is_true $OCF_RESKEY_external_pidfile; then
                ocf_log warn "Process exited without cleaning up its pidfile, 
maybe it died"
            fi
            rm -f $OCF_RESKEY_pidfile
            return $OCF_NOT_RUNNING
            ;;
        2)  return $OCF_NOT_RUNNING;;
        *)  return $OCF_ERR_GENERIC;;
    esac
}

generic_monitor_medium() {
    generic_monitor_light || return $?
    local pid=`cat $OCF_RESKEY_pidfile`
    local lines=$(stat -c %N /proc/$pid/fd/[12] | grep "\`$OCF_RESKEY_logfile'" 
| wc -l)
    if [ $lines != 2 ]; then
        ocf_log warn "stdin or stdout file descriptor is not pointing to 
logfile, sending HUP"
        kill -HUP $pid
    fi
    if echo ping | nc localhost ${OCF_RESKEY_admin_console_port} | grep Ok 
>/dev/null; then
        return $OCF_SUCCESS
    else
        ocf_log err "admin_console_port ping failed"
        return $OCF_ERR_GENERIC
    fi
}

generic_monitor() {
    if [ $OCF_CHECK_LEVEL -lt 10 ]; then
        generic_monitor_light
    else
        generic_monitor_medium
    fi
    return $?
}

generic_validate() {
    check_binary $OCF_RESKEY_binfile
    return $OCF_SUCCESS
}

case $__OCF_ACTION in
    meta-data)      generic_meta
                    exit $OCF_SUCCESS
                    ;;
    start)          generic_start;;
    stop)           generic_stop;;
    monitor)        generic_monitor;;
    validate-all)   generic_validate;;
    usage|help)     generic_usage
                    exit $OCF_SUCCESS
                    ;;
    *)              generic_usage
                    exit $OCF_ERR_UNIMPLEMENTED
                    ;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to