(Moving this to users@clusterlabs.org list, which is better suited for it) This is expected behavior with this configuration. You have several options to change it:
* The simplest would be to add pcmk_delay_max to the st-lxha parameters. This will insert a random delay up to whatever value you choose before executing fencing. Therefore, in a split, each side will wait a random amount of time before fencing, and it becomes unlikely that both will fence at the same time. * Another common approach is to use two devices (one for each host) instead of one. Then you can put a fixed delay on one of them with pcmk_delay_base to ensure that they don't fence at the same time (effectively choosing one node to win any race). * Another option would be to add a third node for quorum only. It could be a full cluster node that is not allowed to run any resources, or it could be a light-weight qdevice node (but I think that requires a newer corosync than you have). This option ensures that a node will not attempt to fence the other node unless it has connectivity to the quorum node. FYI, external/ssh is not a reliable fence mechanism, because it will fail if the target node is unresponsive or unreachable. If these are physical machines, they likely have IPMI, which would be a better choice than ssh, though it still cannot handle the case where the target node has lost power. Physical machines also likely have hardware watchdogs, which would be a much better choice (via sbd), however that would require either a third node for quorum, or a shared storage device. An intelligent power switch is another excellent choice. On Tue, 2018-09-25 at 20:38 +0800, zhongbin wrote: > Hi, > I created Active/Passive Clusters on Debian 6.0. > nodes: linx60147 linx60149 > corosync 2.3.4 + pacemaker 1.1.17 > > crm configure show: > > node 3232244115: linx60147 \ > attributes standby=off > node 3232244117: linx60149 \ > attributes standby=off > primitive rsc-cpu ocf:pacemaker:HealthCPU \ > params yellow_limit=60 red_limit=20 \ > op monitor interval=30s timeout=3m \ > op start interval=0 timeout=3m \ > op stop interval=0 timeout=3m \ > meta target-role=Started > primitive rsc-vip-public IPaddr \ > op monitor interval=30s timeout=3m start-delay=15 \ > op start interval=0 timeout=3m \ > op stop interval=0 timeout=3m \ > params ip=192.168.22.224 cidr_netmask=255.255.255.0 \ > meta migration-threshold=10 > primitive st-lxha stonith:external/ssh \ > params hostlist="linx60147 linx60149" \ > meta target-role=Started is-managed=true > group rsc-group rsc-vip-public rsc-cpu \ > meta target-role=Started > location rsc-loc1 rsc-group 200: linx60147 > location rsc-loc2 rsc-group 100: linx60149 > location rsc-loc3 st-lxha 100: linx60147 > location rsc-loc4 st-lxha 200: linx60149 > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.17-b36b869ca8 \ > cluster-infrastructure=corosync \ > expected-quorum-votes=2 \ > start-failure-is-fatal=false \ > stonith-enabled=true \ > stonith-action=reboot \ > no-quorum-policy=ignore \ > last-lrm-refresh=1536225282 > > When I pull out all heartbeat cables,Active-node and Passive-node > are both fenced(reboot) by each other at the same time. > > linux60147 corosync.log: > > Sep 25 19:34:08 [2198] linx60147 pengine: notice: > unpack_config: On loss of CCM Quorum: Ignore > Sep 25 19:34:08 [2198] linx60147 pengine: warning: > pe_fence_node: Cluster node linx60149 will be fenced: peer is no > longer part of the cluster > Sep 25 19:34:08 [2198] linx60147 pengine: warning: > determine_online_status: Node linx60149 is unclean > Sep 25 19:34:08 [2198] linx60147 pengine: info: > determine_online_status_fencing: Node linx60147 is active > Sep 25 19:34:08 [2198] linx60147 pengine: info: > determine_online_status: Node linx60147 is online > Sep 25 19:34:08 [2198] linx60147 pengine: info: > unpack_node_loop: Node 3232244117 is already processed > Sep 25 19:34:08 [2198] linx60147 pengine: info: > unpack_node_loop: Node 3232244115 is already processed > Sep 25 19:34:08 [2198] linx60147 pengine: info: > unpack_node_loop: Node 3232244117 is already processed > Sep 25 19:34:08 [2198] linx60147 pengine: info: > unpack_node_loop: Node 3232244115 is already processed > Sep 25 19:34:08 [2198] linx60147 pengine: info: group_print: > Resource Group: rsc-group > Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print: > rsc-vip-public (ocf::heartbeat:IPaddr): Started > linx60147 > Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print: > rsc-cpu (ocf::pacemaker:HealthCPU): Started linx60147 > Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print: > st-lxha (stonith:external/ssh): Started linx60149 (UNCLEAN) > Sep 25 19:34:08 [2198] linx60147 pengine: warning: > custom_action: Action st-lxha_stop_0 on linx60149 is unrunnable > (offline) > Sep 25 19:34:08 [2198] linx60147 pengine: warning: stage6: > Scheduling Node linx60149 for STONITH > Sep 25 19:34:08 [2198] linx60147 pengine: info: > native_stop_constraints: st-lxha_stop_0 is implicit after linx60149 > is fenced > Sep 25 19:34:08 [2198] linx60147 pengine: notice: > LogNodeActions: * Fence linx60149 > Sep 25 19:34:08 [2198] linx60147 pengine: info: LogActions: > Leave rsc-vip-public (Started linx60147) > Sep 25 19:34:08 [2198] linx60147 pengine: info: LogActions: > Leave rsc-cpu (Started linx60147) > Sep 25 19:34:08 [2198] linx60147 pengine: notice: LogActions: > Move st-lxha (Started linx60149 -> linx60147) > Sep 25 19:34:08 [2198] linx60147 pengine: warning: > process_pe_message: Calculated transition 2 (with warnings), > saving inputs in /var/lib/pacemaker/pengine/pe-warn-64.bz2 > Sep 25 19:34:08 [2199] linx60147 crmd: info: > do_state_transition: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=handle_response > Sep 25 19:34:08 [2199] linx60147 crmd: info: do_te_invoke: > Processing graph 2 (ref=pe_calc-dc-1537875248-29) derived from > /var/lib/pacemaker/pengine/pe-warn-64.bz2 > Sep 25 19:34:08 [2199] linx60147 crmd: notice: > te_fence_node: Requesting fencing (reboot) of node linx60149 | > action=15 timeout=60000 > Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice: > handle_request: Client crmd.2199.76b55dfe wants to fence (reboot) > 'linx60149' with device '(any)' > Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice: > initiate_remote_stonith_op: Requesting peer fencing (reboot) of > linx60149 | id=07b318da-0c28-476a-a9f3-d73d7a5142dc state=0 > Sep 25 19:34:08 [2199] linx60147 crmd: notice: > te_rsc_command: Initiating start operation st-lxha_start_0 locally > on linx60147 | action 13 > Sep 25 19:34:08 [2199] linx60147 crmd: info: > do_lrm_rsc_op: Performing key=13:2:0:05c1e621-d48e-4854-a666- > 4c664da9e32d op=st-lxha_start_0 > Sep 25 19:34:08 [2195] linx60147 lrmd: info: log_execute: > executing - rsc:st-lxha action:start call_id:18 > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > dynamic_list_search_cb: Refreshing port list for st-lxha > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > process_remote_stonith_query: Query result 1 of 1 from linx60147 > for linx60149/reboot (1 devices) 07b318da-0c28-476a-a9f3-d73d7a5142dc > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > process_remote_stonith_query: All query replies have arrived, > continuing (1 expected/1 received) > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > call_remote_stonith: Total timeout set to 60 for peer's fencing > of linx60149 for crmd.2199|id=07b318da-0c28-476a-a9f3-d73d7a5142dc > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > call_remote_stonith: Requesting that 'linx60147' perform op > 'linx60149 reboot' for crmd.2199 (72s, 0s) > Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice: > can_fence_host_with_device: st-lxha can fence (reboot) > linx60149: dynamic-list > Sep 25 19:34:08 [2194] linx60147 stonith-ng: info: > stonith_fence_get_devices_cb: Found 1 matching devices for > 'linx60149' > Sep 25 19:34:09 [2195] linx60147 lrmd: info: log_finished: > finished - rsc:st-lxha action:start call_id:18 exit-code:0 exec- > time:1024ms queue-time:0ms > Sep 25 19:34:09 [2199] linx60147 crmd: notice: > process_lrm_event: Result of start operation for st-lxha on > linx60147: 0 (ok) | call=18 key=st-lxha_start_0 confirmed=true cib- > update=51 > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to all (origin=local/crmd/51) > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_perform_op: Diff: --- 0.102.21 2 > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_perform_op: Diff: +++ 0.102.22 (null) > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_perform_op: + /cib: @num_updates=22 > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_perform_op: + /cib/status/node_state[@id='3232244115']: @crm- > debug-origin=do_update_resource > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_perform_op: + > /cib/status/node_state[@id='3232244115']/lrm[@id='3232244115']/lrm_re > sources/lrm_resource[@id='st-lxha']/lrm_rsc_op[@id='st- > lxha_last_0']: @operation_key=st-lxha_start_0, @operation=start, > @transition-key=13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d, > @transition-magic=0:0;13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d, > @call-id=18, @rc-code=0, @last-run=1537875248, @last-rc- > change=1537875248, @exec-time=1024 > Sep 25 19:34:09 [2193] linx60147 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=linx60147/crmd/51, version=0.102.22) > Sep 25 19:34:09 [2199] linx60147 crmd: info: > match_graph_event: Action st-lxha_start_0 (13) confirmed on > linx60147 (rc=0) > > linux60149 corosync.log: > > Sep 25 19:34:07 [2144] linx60149 pengine: notice: > unpack_config: On loss of CCM Quorum: Ignore > Sep 25 19:34:07 [2144] linx60149 pengine: info: > determine_online_status_fencing: Node linx60149 is active > Sep 25 19:34:07 [2144] linx60149 pengine: info: > determine_online_status: Node linx60149 is online > Sep 25 19:34:07 [2144] linx60149 pengine: warning: > pe_fence_node: Cluster node linx60147 will be fenced: peer is no > longer part of the cluster > Sep 25 19:34:07 [2144] linx60149 pengine: warning: > determine_online_status: Node linx60147 is unclean > Sep 25 19:34:07 [2144] linx60149 pengine: info: > unpack_node_loop: Node 3232244117 is already processed > Sep 25 19:34:07 [2144] linx60149 pengine: info: > unpack_node_loop: Node 3232244115 is already processed > Sep 25 19:34:07 [2144] linx60149 pengine: info: > unpack_node_loop: Node 3232244117 is already processed > Sep 25 19:34:07 [2144] linx60149 pengine: info: > unpack_node_loop: Node 3232244115 is already processed > Sep 25 19:34:07 [2144] linx60149 pengine: info: group_print: > Resource Group: rsc-group > Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print: > rsc-vip-public (ocf::heartbeat:IPaddr): Started > linx60147 (UNCLEAN) > Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print: > rsc-cpu (ocf::pacemaker:HealthCPU): Started linx60147 > (UNCLEAN) > Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print: > st-lxha (stonith:external/ssh): Started linx60149 > Sep 25 19:34:07 [2144] linx60149 pengine: warning: > custom_action: Action rsc-vip-public_stop_0 on linx60147 is > unrunnable (offline) > Sep 25 19:34:07 [2144] linx60149 pengine: info: RecurringOp: > Start recurring monitor (30s) for rsc-vip-public on linx60149 > Sep 25 19:34:07 [2144] linx60149 pengine: warning: > custom_action: Action rsc-cpu_stop_0 on linx60147 is unrunnable > (offline) > Sep 25 19:34:07 [2144] linx60149 pengine: info: RecurringOp: > Start recurring monitor (30s) for rsc-cpu on linx60149 > Sep 25 19:34:07 [2144] linx60149 pengine: warning: stage6: > Scheduling Node linx60147 for STONITH > Sep 25 19:34:07 [2144] linx60149 pengine: info: > native_stop_constraints: rsc-vip-public_stop_0 is implicit after > linx60147 is fenced > Sep 25 19:34:07 [2144] linx60149 pengine: info: > native_stop_constraints: rsc-cpu_stop_0 is implicit after linx60147 > is fenced > Sep 25 19:34:07 [2144] linx60149 pengine: notice: > LogNodeActions: * Fence linx60147 > Sep 25 19:34:07 [2144] linx60149 pengine: notice: LogActions: > Move rsc-vip-public (Started linx60147 -> linx60149) > Sep 25 19:34:07 [2144] linx60149 pengine: notice: LogActions: > Move rsc-cpu (Started linx60147 -> linx60149) > Sep 25 19:34:07 [2144] linx60149 pengine: info: LogActions: > Leave st-lxha (Started linx60149) > Sep 25 19:34:07 [2144] linx60149 pengine: warning: > process_pe_message: Calculated transition 0 (with warnings), > saving inputs in /var/lib/pacemaker/pengine/pe-warn-52.bz2 > Sep 25 19:34:07 [2145] linx60149 crmd: info: > do_state_transition: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=handle_response > Sep 25 19:34:07 [2145] linx60149 crmd: info: do_te_invoke: > Processing graph 0 (ref=pe_calc-dc-1537875247-15) derived from > /var/lib/pacemaker/pengine/pe-warn-52.bz2 > Sep 25 19:34:07 [2145] linx60149 crmd: notice: > te_fence_node: Requesting fencing (reboot) of node linx60147 | > action=15 timeout=60000 > Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice: > handle_request: Client crmd.2145.321125df wants to fence (reboot) > 'linx60147' with device '(any)' > Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice: > initiate_remote_stonith_op: Requesting peer fencing (reboot) of > linx60147 | id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c state=0 > Sep 25 19:34:07 [2141] linx60149 stonith-ng: info: > dynamic_list_search_cb: Refreshing port list for st-lxha > Sep 25 19:34:07 [2141] linx60149 stonith-ng: info: > process_remote_stonith_query: Query result 1 of 1 from linx60149 > for linx60147/reboot (1 devices) 05d67c3b-8ff2-4e8d-b56f-abb305d3133c > Sep 25 19:34:07 [2141] linx60149 stonith-ng: info: > call_remote_stonith: Total timeout set to 60 for peer's fencing > of linx60147 for crmd.2145|id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c > Sep 25 19:34:07 [2141] linx60149 stonith-ng: info: > call_remote_stonith: Requesting that 'linx60149' perform op > 'linx60147 reboot' for crmd.2145 (72s, 0s) > Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice: > can_fence_host_with_device: st-lxha can fence (reboot) > linx60147: dynamic-list > Sep 25 19:34:07 [2141] linx60149 stonith-ng: info: > stonith_fence_get_devices_cb: Found 1 matching devices for > 'linx60147' > > Is this behavior of cluster normal? Or is it configured with errors? > How can I avoid it? > > Thanks, > > zhongbin > > > > _______________________________________________ > Developers mailing list > develop...@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/developers -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org