On 19.11.2021 19:26, john tillman wrote: ... >>> >>> If pacemaker tries to stop resources due to out of quorum condition, you >>> could set suitable failure-timeout; this will be equivalent to using >>> "pcs >>> resource refresh". Keep in mind that pacemaker only checks for >>> failure-timeout expiration every cluster-recheck-interval (15 minutes by >>> default). This still is not directly related to network availability, >>> but >>> if network outage resulted in node going out of quorum, when network is >>> back and node joined cluster again it will allow resources to be started >>> on node. >>> >> >> When quorum is lost I want all the resources to stop. The cluster is >> performing this step correctly for me. >> >> That cluster-recheck-interval would explain the intermittence I saw this >> morning. If I set that to 1 minute would that cause any gross negative >> issues? >> > > > I tried setting cluster-recheck-interval to 1 minute and I saw no change > to the resources after reconnecting the network. They were still listed > as However, "pcs resource refresh" started it, as usual in this scenario. > > Anyone have any other ideas for a configuration setting that will > effectively do whatever 'pcs resource refresh' is doing when quorum is > restored? >
I already told you above and it most certainly works here. Without failure-timeout resource is stuck in blocked state: Cluster Summary: * Stack: corosync * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum * Last updated: Sat Nov 20 10:48:48 2021 * Last change: Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1 * 3 nodes configured * 3 resource instances configured (1 BLOCKED from further action due to failure) Node List: * Online: [ ha1 ha2 qnetd ] Full List of Resources: * Clone Set: cln_Test [rsc_Test]: * rsc_Test (ocf::_local:Dummy): FAILED ha1 (blocked) * Started: [ ha2 ] * Stopped: [ qnetd ] Operations: * Node: ha2: * rsc_Test: migration-threshold=1000000: * (10) start * (11) monitor: interval="10000ms" * Node: ha1: * rsc_Test: migration-threshold=1000000 fail-count=1000000 last-failure='Sat Nov 20 10:47:14 2021': * (18) start * (30) stop Failed Resource Actions: * rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 +03:00', queued=0ms, exec=27ms With failure-timeout resource is restarted after expiration. Cluster Summary: * Stack: corosync * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum * Last updated: Sat Nov 20 10:53:51 2021 * Last change: Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2 * 3 nodes configured * 3 resource instances configured Node List: * Online: [ ha1 ha2 qnetd ] Full List of Resources: * Clone Set: cln_Test [rsc_Test]: * Started: [ ha1 ha2 ] * Stopped: [ qnetd ] Operations: * Node: ha2: * rsc_Test: migration-threshold=1000000: * (18) probe * (18) probe * (19) monitor: interval="10000ms" * Node: ha1: * rsc_Test: migration-threshold=1000000: * (40) probe * (40) probe * (41) monitor: interval="10000ms" Configuration: node 1: ha1 \ attributes pingd=1 \ utilization cpu=20 node 2: ha2 \ attributes pingd=1 \ utilization cpu=20 node 3: qnetd primitive rsc_Test ocf:_local:Dummy \ meta failure-timeout=30s \ op monitor interval=10s clone cln_Test rsc_Test location not_on_qnetd cln_Test -inf: qnetd property cib-bootstrap-options: \ cluster-infrastructure=corosync \ cluster-name=ha \ dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \ last-lrm-refresh=1637394576 \ stonith-enabled=false \ have-watchdog=true \ stonith-watchdog-timeout=0 \ placement-strategy=balanced _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/