Dear Community ,

 

Thank you  Ken for your reply last time.

 

I attached the log messages as requested from the last thread.

 

I have a Pacemaker cluster with two cluster nodes with two network interfaces each, and two remote nodes and a prototyped fencing agent(GPFS-Fence) to cut a hosts access from the clustered filesystem.

 

I noticed that remote node gets fenced when the quorum node its connected to gets fenced or experiences network failure.

 For example, when I disconnected srv-2 from the rest of the cluster by using iptables on srv-2

 

iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1]  -j DROP

iptables -A INPUT -s [IP of srv-3]  -j DROP ; iptables -A OUTPUT -s [IP of srv-3]  -j DROP

iptables -A INPUT -s [IP of srv-4]  -j DROP ; iptables -A OUTPUT -s [IP of srv-4]  -j DROP

 

I expected that remote node jangcluster-srv-4  would failover to srv-1 given my location constraints,

but remote node’s monitor ‘jangcluster-srv-4_monitor’ failed and srv-4 was getting fenced  before attempting to failover. 

What would be the proper way to simulate the network failover?

How can I configure the cluster so that remote node srv-4 fails over instead of getting fenced?

 

 

Thank you

 

Janghyuk Boo.

 

 

 

 

(root@jangcluster-srv-2) /root

iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1]  -j DROP

iptables -A INPUT -s [IP of srv-3]  -j DROP ; iptables -A OUTPUT -s [IP of srv-3]  -j DROP

iptables -A INPUT -s [IP of srv-4]  -j DROP ; iptables -A OUTPUT -s [IP of srv-4]  -j DROP

$ date

Fri Oct 22 12:20:31 EDT 2021

 

 

 

 nodelist {

   node {

       ring0_addr: xxx

       ring1_addr: xxx

       name: jangcluster-srv-1

       nodeid: 1

   }

   node {

       ring0_addr: xxx

       ring1_addr: xxx

       name: jangcluster-srv-2

       nodeid: 2

   }

 }

 

 

Every 2.0s: crm status                                                       jangcluster-srv-1: Fri Oct 22 12:21:09 2021

 

Cluster Summary:

  * Stack: corosync

  * Current DC: jangcluster-srv-2 (version 2.0.4-1.db2pcmk.el8-2deceaa3ae) - partition with quorum

  * Last updated: Fri Oct 22 12:21:10 2021

  * Last change:  Fri Oct 22 12:16:34 2021 by root via cibadmin on jangcluster-srv-1

  * 4 nodes configured

  * 3 resource instances configured

 

Node List:

  * Online: [ jangcluster-srv-1 jangcluster-srv-2 ]

  * RemoteOnline: [ jangcluster-srv-3 ]

  * RemoteOFFLINE: [ jangcluster-srv-4 ]

 

Full List of Resources:

  * GPFS-Fence  (stonith:fence_gpfs):    Started jangcluster-srv-1

  * jangcluster-srv-3   (ocf::pacemaker:remote):         Started jangcluster-srv-1

  * jangcluster-srv-4   (ocf::pacemaker:remote):         FAILED

 

Failed Resource Actions:

  * jangcluster-srv-4_monitor_30000 on jangcluster-srv-2 'error' (1): call=60, status='Timed Out', exitreason='', last-r

c-change='2021-10-22 12:21:09 -04:00', queued=0ms, exec=0ms

 

 location prefer-node-jangcluster-srv-3 jangcluster-srv-3 100: jangcluster-srv-1

location prefer-node-jangcluster-srv-4 jangcluster-srv-4 100: jangcluster-srv-2

location prefer-node-jangcluster-srv-3-2 jangcluster-srv-3 50: jangcluster-srv-2

location prefer-node-jangcluster-srv-4-2 jangcluster-srv-4 50: jangcluster-srv-1

 

 

(root@jangcluster-srv-2) /root

iptables -A INPUT -s [IP of srv-1] -j DROP ; iptables -A OUTPUT -s [IP of srv-1]  -j DROP

iptables -A INPUT -s [IP of srv-3]  -j DROP ; iptables -A OUTPUT -s [IP of srv-3]  -j DROP

iptables -A INPUT -s [IP of srv-4]  -j DROP ; iptables -A OUTPUT -s [IP of srv-4]  -j DROP

$ date

Fri Oct 22 12:20:31 EDT 2021

 

Log

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (monitor_timeout_cb)         info: Timed out waiting for remote poke response from jangcluster-srv-4

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (process_lrm_event)  error: Result of monitor operation for jangcluster-srv-4 on jangcluster-srv-2: Timed Out | call=60 key=jangcluster-srv-4_monitor_30000 timeout=20000ms

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (lrmd_api_disconnect)        info: Disconnecting TLS jangcluster-srv-4 executor connection

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (lrmd_tls_connection_destroy)        info: TLS connection destroyed

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (remote_lrm_op_callback)     error: Lost connection to Pacemaker Remote node jangcluster-srv-4

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-controld  [776554] (lrmd_api_disconnect)        info: Disconnecting TLS jangcluster-srv-4 executor connection

Oct 22 12:21:09.363 jangcluster-srv-2 pacemaker-based     [776548] (cib_process_request)        info: Forwarding cib_modify operation for section status to all (origin=local/crmd/2451)

Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: --- 34.799.16 2

Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: +++ 34.799.17 (null)

Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: +  /cib:  @num_updates=17

Oct 22 12:21:09.365 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='jangcluster-srv-4']:  <lrm_rsc_op id="jangcluster-srv-4_last_failure_0" operation_key="jangcluster-srv-4_monitor_30000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.3.0" transition-key="9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f" transition-magic="2:1;9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f" exit-reason="" _on_node_="jangcluster-srv-2" call-id="60" rc-code="1" op-status="2" interval="30000" last-rc-change="1634919669" exec-time="0" queue-time="0" op-digest="b6d907f5ad12b5bb2549788ab4cbc314"/>

Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-based     [776548] (cib_process_request)        info: Completed cib_modify operation for section status: OK (rc=0, origin=jangcluster-srv-2/crmd/2451, version=34.799.17)

Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-controld  [776554] (abort_transition_graph)     info: Transition 12 aborted by operation jangcluster-srv-4_monitor_30000 'create' on jangcluster-srv-2: Change in recurring result | magic=2:1;9:9:0:344b74b3-a99d-43dd-a133-02bc4aba954f cib=34.799.17 source=process_graph_event:406 complete=true

Oct 22 12:21:09.366 jangcluster-srv-2 pacemaker-controld  [776554] (update_failcount)   info: Updating failcount for jangcluster-srv-4 on jangcluster-srv-2 after failed monitor: rc=1 (update=value++, time=1634919669)

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-controld  [776554] (process_graph_event)        notice: Transition 9 action 9 (jangcluster-srv-4_monitor_30000 on jangcluster-srv-2): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=60 event='arrived after initial scheduling'

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd     [776551] (attrd_client_update)        info: Expanded fail-count-jangcluster-srv-4#monitor_30000=value++ to 1

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-controld  [776554] (do_state_transition)        notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd     [776551] (attrd_peer_update)  notice: Setting fail-count-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]: (unset) -> 1 | from jangcluster-srv-2

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd     [776551] (write_attribute)    info: Sent CIB request 39 with 1 change for fail-count-jangcluster-srv-4#monitor_30000 (id n/a, set n/a)

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd     [776551] (attrd_peer_update)  notice: Setting last-failure-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]: (unset) -> 1634919669 | from jangcluster-srv-2

Oct 22 12:21:09.367 jangcluster-srv-2 pacemaker-attrd     [776551] (write_attribute)    info: Sent CIB request 40 with 1 change for last-failure-jangcluster-srv-4#monitor_30000 (id n/a, set n/a)

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_process_request)        info: Forwarding cib_modify operation for section status to all (origin=local/attrd/39)

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_process_request)        info: Forwarding cib_modify operation for section status to all (origin=local/attrd/40)

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: --- 34.799.17 2

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: +++ 34.799.18 (null)

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: +  /cib:  @num_updates=18

Oct 22 12:21:09.369 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:  <nvpair id="status-2-fail-count-jangcluster-srv-4.monitor_30000" name="fail-count-jangcluster-srv-4#monitor_30000" value="1"/>

Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-based     [776548] (cib_process_request)        info: Completed cib_modify operation for section status: OK (rc=0, origin=jangcluster-srv-2/attrd/39, version=34.799.18)

Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-attrd     [776551] (attrd_cib_callback)         info: CIB update 39 result for fail-count-jangcluster-srv-4#monitor_30000: OK | rc=0

Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-attrd     [776551] (attrd_cib_callback)         info: * fail-count-jangcluster-srv-4#monitor_30000[jangcluster-srv-2]=1

Oct 22 12:21:09.370 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: --- 34.799.18 2

Oct 22 12:21:09.371 jangcluster-srv-2 pacemaker-based     [776548] (cib_perform_op)     info: Diff: +++ 34.799.19 (null)

Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure)      warning: Unexpected result (error) was recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0

Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not be started under current conditions

Oct 22 12:21:09.377 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_fence_node)      warning: Remote node jangcluster-srv-4 will be fenced: remote connection is unrecoverable

Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item)      info: GPFS-Fence        (stonith:fence_gpfs):    Started jangcluster-srv-1

Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item)      info: jangcluster-srv-3 (ocf::pacemaker:remote):         Started jangcluster-srv-1

Oct 22 12:21:09.378 jangcluster-srv-2 pacemaker-schedulerd[776553] (log_list_item)      info: jangcluster-srv-4 (ocf::pacemaker:remote):         FAILED jangcluster-srv-2

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (pcmk__native_allocate)      info: Resource jangcluster-srv-4 cannot run anywhere

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (native_choose_node)         info: Chose node jangcluster-srv-1 for GPFS-Fence from 2 nodes with score 100

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (stage6)     warning: Scheduling Node jangcluster-srv-4 for STONITH

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogNodeActions)     notice:  * Fence (off) jangcluster-srv-4 'remote connection is unrecoverable'

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogActions)         info: Leave   GPFS-Fence        (Started jangcluster-srv-1)

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogActions)         info: Leave   jangcluster-srv-3 (Started jangcluster-srv-1)

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (LogAction)  notice:  * Stop       jangcluster-srv-4     ( jangcluster-srv-2 )   due to node availability

Oct 22 12:21:09.379 jangcluster-srv-2 pacemaker-schedulerd[776553] (pcmk__log_transition_summary)       warning: Calculated transition 13 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-801.bz2

Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-controld  [776554] (handle_response)    info: pe_calc calculation pe_calc-dc-1634919669-90 is obsolete

Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status_fencing)    info: Node jangcluster-srv-2 is active

Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status)    info: Node jangcluster-srv-2 is online

Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status_fencing)    info: Node jangcluster-srv-1 is active

Oct 22 12:21:09.388 jangcluster-srv-2 pacemaker-schedulerd[776553] (determine_online_status)    info: Node jangcluster-srv-1 is online

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount)   info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount)   info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount)   info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount)   info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_get_failcount)   info: jangcluster-srv-4 has failed 1 times on jangcluster-srv-2

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure)      warning: Unexpected result (error) was recorded for monitor of jangcluster-srv-4 on jangcluster-srv-2 at Oct 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (unpack_rsc_op_failure)      notice: jangcluster-srv-4 will not be started under current conditions

Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[776553] (pe_fence_node)      warning: Remote node jangcluster-srv-4 will be fenced: remote connection is unrecoverable

 

 

 

 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to