Dear Community ,
Thank you Ken for your reply last time.
I attached the log messages as requested from the last thread.
I have a Pacemaker cluster with two cluster nodes with two network interfaces each, and two remote nodes and a prototyped fencing agent(GPFS-Fence) to cut a hosts access from the clustered filesystem.
I noticed that remote node gets fenced when the quorum node its connected to gets fenced or experiences network failure.
For example, when I disconnected srv-2 from the rest of the cluster by using iptables on srv-2
iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1] -j DROP
iptables -A INPUT -s [IP of srv-3] -j DROP ; iptables -A OUTPUT -s [IP of srv-3] -j DROP
iptables -A INPUT -s [IP of srv-4] -j DROP ; iptables -A OUTPUT -s [IP of srv-4] -j DROP
I expected that remote node jangcluster-srv-4 would failover to srv-1 given my location constraints,
but remote node’s monitor ‘jangcluster-srv-4_monitor’ failed and srv-4 was getting fenced before attempting to failover.
What would be the proper way to simulate the network failover?
How can I configure the cluster so that remote node srv-4 fails over instead of getting fenced?
Thank you
Janghyuk Boo.
(root@jangcluster-srv-2) /root
iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1] -j DROP
iptables -A INPUT -s [IP of srv-3] -j DROP ; iptables -A OUTPUT -s [IP of srv-3] -j DROP
iptables -A INPUT -s [IP of srv-4] -j DROP ; iptables -A OUTPUT -s [IP of srv-4] -j DROP
$ date
Fri Oct 22 12:20:31 EDT 2021
nodelist {
node {
ring0_addr: xxx
ring1_addr: xxx
name: jangcluster-srv-1
nodeid: 1
}
node {
ring0_addr: xxx
ring1_addr: xxx
name: jangcluster-srv-2
nodeid: 2
}
}
Every 2.0s: crm status jangcluster-srv-1: Fri Oct 22 12:21:09 2021
Cluster Summary:
* Stack: corosync
* Current DC: jangcluster-srv-2 (version 2.0.4-1.db2pcmk.el8-2deceaa3ae) - partition with quorum
* Last updated: Fri Oct 22 12:21:10 2021
* Last change: Fri Oct 22 12:16:34 2021 by root via cibadmin on jangcluster-srv-1
* 4 nodes configured
* 3 resource instances configured
Node List:
* Online: [ jangcluster-srv-1 jangcluster-srv-2 ]
* RemoteOnline: [ jangcluster-srv-3 ]
* RemoteOFFLINE: [ jangcluster-srv-4 ]
Full List of Resources:
* GPFS-Fence (stonith:fence_gpfs): Started jangcluster-srv-1
* jangcluster-srv-3 (ocf::pacemaker:remote): Started jangcluster-srv-1
* jangcluster-srv-4 (ocf::pacemaker:remote): FAILED
Failed Resource Actions:
* jangcluster-srv-4_monitor_30000 on jangcluster-srv-2 'error' (1): call=60, status='Timed Out', exitreason='', last-r
c-change='2021-10-22 12:21:09 -04:00', queued=0ms, exec=0ms
location prefer-node-jangcluster-srv-3 jangcluster-srv-3 100: jangcluster-srv-1
location prefer-node-jangcluster-srv-4 jangcluster-srv-4 100: jangcluster-srv-2
location prefer-node-jangcluster-srv-3-2 jangcluster-srv-3 50: jangcluster-srv-2
location prefer-node-jangcluster-srv-4-2 jangcluster-srv-4 50: jangcluster-srv-1
(root@jangcluster-srv-2) /root
iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1] -j DROP
iptables -A INPUT -s [IP of srv-3] -j DROP ; iptables -A OUTPUT -s [IP of srv-3] -j DROP
iptables -A INPUT -s [IP of srv-4] -j DROP ; iptables -A OUTPUT -s [IP of srv-4] -j DROP
$ date
Fri Oct 22 12:20:31 EDT 2021
Log
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (monitor_timeout_cb) info: Timed out waiting for remote poke response from jangcluster-srv-4
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (process_lrm_event) error: Result of monitor operation for jangcluster-srv-4 on jangcluster-srv-2: Timed Out | call=60 key=jangcluster-srv-4_monitor_30000 timeout=20000ms
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (lrmd_api_disconnect) info: Disconnecting TLS jangcluster-srv-4 executor connection
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (lrmd_tls_connection_destroy) info: TLS connection destroyed
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (remote_lrm_op_callback) error: Lost connection to Pacemaker Remote node jangcluster-srv-4
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld [776554] (lrmd_api_disconnect) info: Disconnecting TLS jangcluster-srv-4 executor connection
Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-based [776548] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/crmd/2451)
Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: --- 34.799.16 2
Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: +++ 34.799.17 (null)
Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: + /cib: @num_updates=17
Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='jangcluster-srv-4']: <lrm_rsc_op id="jangcluster-srv-4_last_failure_0" operation_key="jangcluster-srv-4_monitor_30000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.3.0" transition-key="9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f" transition-magic="2:1;9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f" exit-reason="" _on_node_="jangcluster-srv-2" call-id="60" rc-code="1" op-status="2" interval="30000" last-rc-change="1634919669" exec-time="0" queue-time="0" op-digest="b6d907f5ad12b5bb2549788ab4cbc314"/>
Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-based [776548] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=jangcluster-srv-2/crmd/2451, version=34.799.17)
Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-controld [776554] (abort_transition_graph) info: Transition 12 aborted by operation jangcluster-srv-4_monitor_30000 'create' on jangcluster-srv-2: Change in recurring result | magic=2:1;9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f cib=34.799.17 source=process_graph_event:406 complete=true
Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-controld [776554] (update_failcount) info: Updating failcount for jangcluster-srv-4 on jangcluster-srv-2 after failed monitor: rc=1 (update=value++, time=1634919669)
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-controld [776554] (process_graph_event) notice: Transition 9 action 9 (jangcluster-srv-4_monitor_30000 on jangcluster-srv-2): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=60 event='arrived after initial scheduling'
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd [776551] (attrd_client_update) info: Expanded fail-count-jangcluster-srv-4#monitor_30000=value++ to 1
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-controld [776554] (do_state_transition) notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd [776551] (attrd_peer_update) notice: Setting fail-count-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]: (unset) -> 1 | from jangcluster-srv-2
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd [776551] (write_attribute) info: Sent CIB request 39 with 1 change for fail-count-jangcluster-srv-4#monitor_30000 (id n/a, set n/a)
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd [776551] (attrd_peer_update) notice: Setting last-failure-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]: (unset) -> 1634919669 | from jangcluster-srv-2
Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd [776551] (write_attribute) info: Sent CIB request 40 with 1 change for last-failure-jangcluster-srv-4#monitor_30000 (id n/a, set n/a)
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/attrd/39)
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/attrd/40)
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: --- 34.799.17 2
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: +++ 34.799.18 (null)
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: + /cib: @num_updates=18
Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: <nvpair id="status-2-fail-count-jangcluster-srv-4.monitor_30000" name="fail-count-jangcluster-srv-4#monitor_30000" value="1"/>
Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-based [776548] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=jangcluster-srv-2/attrd/39, version=34.799.18)
Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-attrd [776551] (attrd_cib_callback) info: CIB update 39 result for fail-count-jangcluster-srv-4#monitor_30000: OK | rc=0
Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-attrd [776551] (attrd_cib_callback) info: * fail-count-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]=1
Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: --- 34.799.18 2
Oct 22 12:21:09.371 jangcluster-srv-2 pacemaker-based [776548] (cib_perform_op) info: Diff: +++ 34.799.19 (null)
Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure) warning: Unexpected result (error) was recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure) notice: jangcluster-srv-4 will not be started under current conditions
Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_fence_node) warning: Remote node jangcluster-srv-4 will be fenced: remote connection is unrecoverable
Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item) info: GPFS-Fence (stonith:fence_gpfs): Started jangcluster-srv-1
Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item) info: jangcluster-srv-3 (ocf::pacemaker:remote): Started jangcluster-srv-1
Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item) info: jangcluster-srv-4 (ocf::pacemaker:remote): FAILED jangcluster-srv-2
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (pcmk__native_allocate) info: Resource jangcluster-srv-4 cannot run anywhere
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (native_choose_node) info: Chose node jangcluster-srv-1 for GPFS-Fence from 2 nodes with score 100
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (stage6) warning: Scheduling Node jangcluster-srv-4 for STONITH
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogNodeActions) notice: * Fence (off) jangcluster-srv-4 'remote connection is unrecoverable'
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogActions) info: Leave GPFS-Fence (Started jangcluster-srv-1)
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogActions) info: Leave jangcluster-srv-3 (Started jangcluster-srv-1)
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogAction) notice: * Stop jangcluster-srv-4 ( jangcluster-srv-2 ) due to node availability
Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (pcmk__log_transition_summary) warning: Calculated transition 13 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-801.bz2
Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-controld [776554] (handle_response) info: pe_calc calculation pe_calc-dc-1634919669-90 is obsolete
Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status_fencing) info: Node jangcluster-srv-2 is active
Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status) info: Node jangcluster-srv-2 is online
Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status_fencing) info: Node jangcluster-srv-1 is active
Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status) info: Node jangcluster-srv-1 is online
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount) info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount) info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount) info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount) info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount) info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure) warning: Unexpected result (error) was recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure) notice: jangcluster-srv-4 will not be started under current conditions
Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_fence_node) warning: Remote node jangcluster-srv-4 will be fenced: remote connection is unrecoverable
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/