Hi Andrew, all, Just found little bit strange crmd behaviour - DC timed out stop operation (which lead to node fencing) by itself, without even waiting for lrmd (on another node) to finish that operation.
Here is what it did: =============== Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: action_timer_callback: Timer popped (timeout=600000, abort_level=1000000, complete=false) Aug 9 07:30:07 mgmt01 crmd: [15781]: ERROR: print_elem: Aborting transition, action lost: [Action 228]: In-flight (id: vptest22.vds-ok.com-vm_stop_0, loc: v03-a, priority: 0) Aug 9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: cib_action_update: rsc_op 228: vptest22.vds-ok.com-vm_stop_0 on v03-a timed out Aug 9 07:30:07 mgmt01 crmd: [15781]: info: create_operation_update: cib_action_update: Updating resouce vptest22.vds-ok.com-vm after Timed Out stop op (interval=0) Aug 9 07:30:07 mgmt01 crmd: [15781]: notice: run_graph: Transition 1500 (Complete=64, Pending=0, Fired=0, Skipped=40, Incomplete=0, Source=/var/lib/pengine/pe-input-19.bz2): Stopped Aug 9 07:30:07 mgmt01 crmd: [15781]: info: te_graph_trigger: Transition 1500 is now complete Aug 9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Aug 9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: All 3 cluster nodes are eligible to run resources. Aug 9 07:30:07 mgmt01 crmd: [15781]: info: process_graph_event: Action vptest22.vds-ok.com-vm_stop_0 arrived after a completed transition Aug 9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph: process_graph_event:482 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=vptest22.vds-ok.com-vm_last_failure_0, magic=2:1;228:1500:0:c9c55d75-3a1e-4fcc-af7f-81ede42fa918, cib=0.4519.88) : Inactive graph Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: update_failcount: Updating failcount for vptest22.vds-ok.com-vm on v03-a after failed stop: rc=1 (update=INFINITY, time=1312875007) Aug 9 07:30:08 mgmt01 crmd: [15781]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-v03-a-fail-count-vptest22.vds-ok.com-vm, name=fail-count-vptest22.vds-ok.com-vm, value=INFINITY, magic=NA, cib=0.4519.91) : Transient attribute: update Aug 9 07:30:10 mgmt01 pengine: [15780]: WARN: pe_fence_node: Node v03-a will be fenced to recover from resource failure(s) Aug 9 07:30:10 mgmt01 pengine: [15780]: WARN: unpack_rsc_op: Processing failed op vptest22.vds-ok.com-vm_last_failure_0 on v03-a: unknown error (1) ============== That was a mass stop of 60 virtual machines on 2 12-core nodes. I suspect that nodes had very high load that time, so it is possible that crmd and lrmd on node being fenced had no chance to get enough CPU cycles from scheduler. I saw messages from lrmd there that some operations stayed in op list for a very long time (~10 min). But, last such message was logged 1 minute before DC decided to timeout that stop operation. During that minute only some informative messages from some resource agents were logged, nothing from lrmd or crmd. What could cause such behaviuor? High CPU load? Is it intended that DC may do that (mark operation as timed out)? What can be done to prevent such things in future? Bigger scheduling priority (up to real-time) for lrmd (or for whole pacemaker subsystem)? Bigger timeouts? Should I post hb_report for that (if it is not intended of course)? Thank you, Vladislav _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
