Package: cluster-agents Version: 1:1.0.3-3.1 Severity: important Tags: patch
Hello, The VirtualDomain OCF script fails erroneously on stop (triggering unjustified fencing of the node where it fails). The reason for this is that the VirtualDomain_Status function bails out too quickly (inconditionously) when receiving an empty status back from 'virsh domstate' and when in the context of a 'Stop' action. It should loop/wait a further few seconds and keep checking the domain state a few more times before bailing out and triggering a cascade of errors which eventually prompt the HA layer to fence the node. (see attached Syslog output for the chain of events) A patch to fix this behavior is attached. Conceptually the same fix has actually been applied upstream. (see https://github.com/ClusterLabs/resource-agents) I hope this can help. Regards, Cedric Dufour -- System Information: Debian Release: 6.0.1 APT prefers stable APT policy: (990, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/4 CPU cores) Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages cluster-agents depends on: ii cluster-glue 1.0.6-1 The reusable cluster components fo ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib ii libcluster-glue 1.0.6-1 The reusable cluster components fo ii libglib2.0-0 2.24.2-1 The GLib library of C routines ii libnet1 1.1.4-2 library for the construction and h ii python 2.6.6-3+squeeze6 interactive high-level object-orie cluster-agents recommends no packages. cluster-agents suggests no packages. -- no debconf information
Jun 8 11:36:55 n501 VirtualDomain[28606]: INFO: Issuing graceful shutdown request for domain VRNSIDU_intranet. Jun 8 11:36:55 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stdout) Domain VRNSIDU_intranet is being shutdown Jun 8 11:36:55 n501 sSMTP[28598]: SSL connection using DHE_RSA_AES_128_CBC_SHA1 Jun 8 11:36:55 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_start_0 (8) confirmed on n502.havc (rc=0) Jun 8 11:36:55 n501 crmd: [3235]: info: te_rsc_command: Initiating action 9: monitor STONITH-mfsys:0_monitor_60000 on n502.havc Jun 8 11:36:55 n501 crmd: [3235]: info: te_pseudo_action: Pseudo action 13 fired and confirmed Jun 8 11:36:56 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=0, tag=transient_attributes, id=n502.havc, magic=NA, cib=0.33.20) : Transient attribute: update Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000 Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort action done superceeded by restart Jun 8 11:36:56 n501 sSMTP[28598]: Sent mail for r...@n501.havc.idiap.ch (221 2.0.0 Bye) uid=0 username=root outbytes=617 Jun 8 11:36:56 n501 crmd: [3235]: info: process_lrm_event: LRM operation GW.LUSTRE_intranet-MailTo_start_0 (call=246, rc=0, cib-update=1626, confirmed=true) ok Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action GW.LUSTRE_intranet-MailTo_start_0 (22) confirmed on n501.havc (rc=0) Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_monitor_60000 (9) confirmed on n502.havc (rc=0) Jun 8 11:37:16 n501 kernel: [1044136.919946] br0: port 3(vnet1) entering disabled state Jun 8 11:37:16 n501 kernel: [1044136.950311] device vnet1 left promiscuous mode Jun 8 11:37:16 n501 kernel: [1044136.950315] br0: port 3(vnet1) entering disabled state Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorCommandWithHandler:255 : cannot send monitor command 'info balloon': Connection reset by peer Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorTextGetBalloonInfo:555 : operation failed: could not query memory balloon allocation Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: operation failed: could not query memory balloon allocation [BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: ERROR: Virtual domain VRNSIDU_intranet has no state during stop operation, bailing out. [BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: INFO: Issuing forced shutdown (destroy) request for domain VRNSIDU_intranet. [BUG] Jun 8 11:37:16 n501 libvirtd: 11:37:16.899: error : qemudDomainDestroy:4767 : Requested operation is not valid: domain is not running [BUG] Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: Failed to destroy domain VRNSIDU_intranet#012error: Requested operation is not valid: domain is not running [BUG] Jun 8 11:37:16 n501 crmd: [3235]: info: process_lrm_event: LRM operation VRNSIDU_intranet-VirtualDomain_stop_0 (call=248, rc=1, cib-update=1627, confirmed=true) unknown error [BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: status_from_rc: Action 27 (VRNSIDU_intranet-VirtualDomain_stop_0) on n501.havc failed (target: 0 vs. rc: 1): Error [BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: update_failcount: Updating failcount for VRNSIDU_intranet-VirtualDomain on n501.havc after failed stop: rc=1 (update=INFINITY, time=1307525836) Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=VRNSIDU_intranet-VirtualDomain_stop_0, magic=0:1;27:1068:0:a09be5db-e85d-429a-a8f7-2e876b Jun 8 11:37:16 n501 crmd: [3235]: info: match_graph_event: Action VRNSIDU_intranet-VirtualDomain_stop_0 (27) confirmed on n501.havc (rc=4) Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-VRNSIDU_intranet-VirtualDomain (INFINITY) Jun 8 11:37:16 n501 crmd: [3235]: info: run_graph: ==================================================== Jun 8 11:37:16 n501 crmd: [3235]: notice: run_graph: Transition 1068 (Complete=18, Pending=0, Fired=0, Skipped=9, Incomplete=0, Source=/var/lib/pengine/pe-error-5.bz2): Stopped Jun 8 11:37:16 n501 crmd: [3235]: info: te_graph_trigger: Transition 1068 is now complete Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1628: Requesting the current CIB: S_POLICY_ENGINE Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 513: fail-count-VRNSIDU_intranet-VirtualDomain=INFINITY Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-VRNSIDU_intranet-VirtualDomain (1307525836) Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 515: last-failure-VRNSIDU_intranet-VirtualDomain=1307525836 Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke_callback: Invoking the PE: query=1628, ref=pe_calc-dc-1307525836-1642, seq=40, quorate=1 Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.24) : Transient attribute: update Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.25) : Transient attribute: update Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1629: Requesting the current CIB: S_POLICY_ENGINE Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1630: Requesting the current CIB: S_POLICY_ENGINE Jun 8 11:37:16 n501 pengine: [3234]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 8 11:37:16 n501 pengine: [3234]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jun 8 11:37:16 n501 pengine: [3234]: info: determine_online_status: Node n501.havc is online [BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: unpack_rsc_op: Processing failed op VRNSIDU_intranet-VirtualDomain_stop_0 on n501.havc: unknown error (1) [BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: pe_fence_node: Node n501.havc will be fenced to recover from resource failure(s)
--- VirtualDomain.orig 2011-06-08 13:53:21.000000000 +0200 +++ VirtualDomain 2011-06-08 14:21:27.000000000 +0200 @@ -147,9 +147,12 @@ } VirtualDomain_Status() { + local loop rc=$OCF_ERR_GENERIC status="no state" + loop=0 while [ "$status" = "no state" ]; do + loop=$(( $loop+1 )) status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`" case "$status" in "shut off") @@ -173,10 +176,11 @@ # whenever virsh can't reliably obtain the domain # state. status="no state" - if [ "$__OCF_ACTION" = "stop" ]; then + if [ "$__OCF_ACTION" = "stop" -a $loop -gt 5 ]; then # During the stop operation, we want to bail out # quickly, so as to be able to force-stop (destroy) - # the domain if necessary. + # the domain if necessary (but not after making sure + # we don't eventually get a valid state). ocf_log error "Virtual domain $DOMAIN_NAME has no state during stop operation, bailing out." return $OCF_ERR_GENERIC; else