Package: cluster-agents
Version: 1:1.0.3-3.1
Severity: important
Tags: patch


Hello,

The VirtualDomain OCF script fails erroneously on stop
(triggering unjustified fencing of the node where it fails).

The reason for this is that the VirtualDomain_Status function
bails out too quickly (inconditionously) when receiving an empty status
back from 'virsh domstate' and when in the context of a 'Stop' action.
It should loop/wait a further few seconds and keep checking the domain
state a few more times before bailing out and triggering a cascade of
errors which eventually prompt the HA layer to fence the node.
(see attached Syslog output for the chain of events)

A patch to fix this behavior is attached.

Conceptually the same fix has actually been applied upstream.
(see https://github.com/ClusterLabs/resource-agents)

I hope this can help.

Regards,

Cedric Dufour

-- System Information:
Debian Release: 6.0.1
  APT prefers stable
  APT policy: (990, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages cluster-agents depends on:
ii  cluster-glue            1.0.6-1          The reusable cluster components fo
ii  libc6                   2.11.2-10        Embedded GNU C Library: Shared lib
ii  libcluster-glue         1.0.6-1          The reusable cluster components fo
ii  libglib2.0-0            2.24.2-1         The GLib library of C routines
ii  libnet1                 1.1.4-2          library for the construction and h
ii  python                  2.6.6-3+squeeze6 interactive high-level object-orie

cluster-agents recommends no packages.

cluster-agents suggests no packages.

-- no debconf information
Jun  8 11:36:55 n501 VirtualDomain[28606]: INFO: Issuing graceful shutdown 
request for domain VRNSIDU_intranet.
Jun  8 11:36:55 n501 lrmd: [3232]: info: RA output: 
(VRNSIDU_intranet-VirtualDomain:stop:stdout) Domain VRNSIDU_intranet is being 
shutdown
Jun  8 11:36:55 n501 sSMTP[28598]: SSL connection using DHE_RSA_AES_128_CBC_SHA1
Jun  8 11:36:55 n501 crmd: [3235]: info: match_graph_event: Action 
STONITH-mfsys:0_start_0 (8) confirmed on n502.havc (rc=0)
Jun  8 11:36:55 n501 crmd: [3235]: info: te_rsc_command: Initiating action 9: 
monitor STONITH-mfsys:0_monitor_60000 on n502.havc
Jun  8 11:36:55 n501 crmd: [3235]: info: te_pseudo_action: Pseudo action 13 
fired and confirmed
Jun  8 11:36:56 n501 crmd: [3235]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=0, 
tag=transient_attributes, id=n502.havc, magic=NA, cib=0.33.20) : Transient 
attribute: update
Jun  8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort priority 
upgraded from 0 to 1000000
Jun  8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort action 
done superceeded by restart
Jun  8 11:36:56 n501 sSMTP[28598]: Sent mail for r...@n501.havc.idiap.ch (221 
2.0.0 Bye) uid=0 username=root outbytes=617
Jun  8 11:36:56 n501 crmd: [3235]: info: process_lrm_event: LRM operation 
GW.LUSTRE_intranet-MailTo_start_0 (call=246, rc=0, cib-update=1626, 
confirmed=true) ok
Jun  8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action 
GW.LUSTRE_intranet-MailTo_start_0 (22) confirmed on n501.havc (rc=0)
Jun  8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action 
STONITH-mfsys:0_monitor_60000 (9) confirmed on n502.havc (rc=0)
Jun  8 11:37:16 n501 kernel: [1044136.919946] br0: port 3(vnet1) entering 
disabled state
Jun  8 11:37:16 n501 kernel: [1044136.950311] device vnet1 left promiscuous mode
Jun  8 11:37:16 n501 kernel: [1044136.950315] br0: port 3(vnet1) entering 
disabled state
Jun  8 11:37:16 n501 libvirtd: 11:37:16.887: error : 
qemuMonitorCommandWithHandler:255 : cannot send monitor command 'info balloon': 
Connection reset by peer
Jun  8 11:37:16 n501 libvirtd: 11:37:16.887: error : 
qemuMonitorTextGetBalloonInfo:555 : operation failed: could not query memory 
balloon allocation
Jun  8 11:37:16 n501 lrmd: [3232]: info: RA output: 
(VRNSIDU_intranet-VirtualDomain:stop:stderr) error: operation failed: could not 
query memory balloon allocation
[BUG] Jun  8 11:37:16 n501 VirtualDomain[28606]: ERROR: Virtual domain 
VRNSIDU_intranet has no state during stop operation, bailing out.
[BUG] Jun  8 11:37:16 n501 VirtualDomain[28606]: INFO: Issuing forced shutdown 
(destroy) request for domain VRNSIDU_intranet.
[BUG] Jun  8 11:37:16 n501 libvirtd: 11:37:16.899: error : 
qemudDomainDestroy:4767 : Requested operation is not valid: domain is not 
running
[BUG] Jun  8 11:37:16 n501 lrmd: [3232]: info: RA output: 
(VRNSIDU_intranet-VirtualDomain:stop:stderr) error: Failed to destroy domain 
VRNSIDU_intranet#012error: Requested operation is not valid: domain is not 
running
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: info: process_lrm_event: LRM operation 
VRNSIDU_intranet-VirtualDomain_stop_0 (call=248, rc=1, cib-update=1627, 
confirmed=true) unknown error
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: WARN: status_from_rc: Action 27 
(VRNSIDU_intranet-VirtualDomain_stop_0) on n501.havc failed (target: 0 vs. rc: 
1): Error
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: WARN: update_failcount: Updating 
failcount for VRNSIDU_intranet-VirtualDomain on n501.havc after failed stop: 
rc=1 (update=INFINITY, time=1307525836)
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: 
match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, 
id=VRNSIDU_intranet-VirtualDomain_stop_0, 
magic=0:1;27:1068:0:a09be5db-e85d-429a-a8f7-2e876b
Jun  8 11:37:16 n501 crmd: [3235]: info: match_graph_event: Action 
VRNSIDU_intranet-VirtualDomain_stop_0 (27) confirmed on n501.havc (rc=4)
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-VRNSIDU_intranet-VirtualDomain (INFINITY)
Jun  8 11:37:16 n501 crmd: [3235]: info: run_graph: 
====================================================
Jun  8 11:37:16 n501 crmd: [3235]: notice: run_graph: Transition 1068 
(Complete=18, Pending=0, Fired=0, Skipped=9, Incomplete=0, 
Source=/var/lib/pengine/pe-error-5.bz2): Stopped
Jun  8 11:37:16 n501 crmd: [3235]: info: te_graph_trigger: Transition 1068 is 
now complete
Jun  8 11:37:16 n501 crmd: [3235]: info: do_state_transition: State transition 
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Jun  8 11:37:16 n501 crmd: [3235]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1628: Requesting 
the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 
513: fail-count-VRNSIDU_intranet-VirtualDomain=INFINITY
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush 
op to all hosts for: last-failure-VRNSIDU_intranet-VirtualDomain (1307525836)
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 
515: last-failure-VRNSIDU_intranet-VirtualDomain=1307525836
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke_callback: Invoking the 
PE: query=1628, ref=pe_calc-dc-1307525836-1642, seq=40, quorate=1
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.24) : Transient 
attribute: update
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.25) : Transient 
attribute: update
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1629: Requesting 
the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1630: Requesting 
the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 pengine: [3234]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jun  8 11:37:16 n501 pengine: [3234]: info: unpack_config: Node scores: 'red' = 
-INFINITY, 'yellow' = 0, 'green' = 0
Jun  8 11:37:16 n501 pengine: [3234]: info: determine_online_status: Node 
n501.havc is online
[BUG] Jun  8 11:37:16 n501 pengine: [3234]: WARN: unpack_rsc_op: Processing 
failed op VRNSIDU_intranet-VirtualDomain_stop_0 on n501.havc: unknown error (1)
[BUG] Jun  8 11:37:16 n501 pengine: [3234]: WARN: pe_fence_node: Node n501.havc 
will be fenced to recover from resource failure(s)

--- VirtualDomain.orig  2011-06-08 13:53:21.000000000 +0200
+++ VirtualDomain       2011-06-08 14:21:27.000000000 +0200
@@ -147,9 +147,12 @@
 }
 
 VirtualDomain_Status() {
+    local loop
     rc=$OCF_ERR_GENERIC
     status="no state"
+    loop=0
     while [ "$status" = "no state" ]; do
+        loop=$(( $loop+1 ))
         status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`"
         case "$status" in
            "shut off")
@@ -173,10 +176,11 @@
                # whenever virsh can't reliably obtain the domain
                # state.
                status="no state"
-               if [ "$__OCF_ACTION" = "stop" ]; then
+               if [ "$__OCF_ACTION" = "stop" -a $loop -gt 5 ]; then
                    # During the stop operation, we want to bail out
                    # quickly, so as to be able to force-stop (destroy)
-                   # the domain if necessary.
+                   # the domain if necessary (but not after making sure
+                    # we don't eventually get a valid state).
                    ocf_log error "Virtual domain $DOMAIN_NAME has no state 
during stop operation, bailing out."
                    return $OCF_ERR_GENERIC;
                else

Reply via email to