[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8
>>> Howard schrieb am 18.06.2020 um 19:16 in Nachricht : > Thanks for the replies! I will look at the failure-timeout resource > attribute and at adjusting the timeout from 20 to 30 seconds. It is funny > that the 100 tries message is symbolic. > > It turns out that the VMware host was down temporarily at the time of the > alerts. I don't know when It came back up but pcs had already given up > trying to reestablish the connection. Out of curiosity: Does that mean your cluster node VM did not run on the VM host the cluster thought it would? Or was the cluster VM dead as well? > > On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot wrote: > >> Note that a failed start of a stonith device will not prevent the >> cluster from using that device for fencing. It just prevents the >> cluster from monitoring the device. >> >> On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote: >> > What about second fencing mechanism ? >> > You can add a shared (independent) vmdk as an sbd device. The >> > reconfiguration will require cluster downtime, but this is only >> > necessary once. >> > Once 2 fencing mechanisms are available - you can configure the order >> > easily. >> > Best Regards, >> > Strahil Nikolov >> > >> > >> > >> > >> > >> > >> > В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl < >> > ulrich.wi...@rz.uni-regensburg.de> написа: >> > >> > >> > >> > >> > >> > Hi! >> > >> > I can't give much detailed advice, but I think any network service >> > should have a timeout of at least 30 Sekonds (you have >> > timeout=2ms). >> > >> > And "after 100 failures" is symbolic, not literal: It means it >> > failed too often, so I won't retry. >> > >> > Regards, >> > Ulrich >> > >> > > > > Howard schrieb am 17.06.2020 um 21:05 in >> > > > > Nachricht >> > >> > <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g >> > fDL_2tAbKmw >> > mq...@mail.gmail.com>: >> > > Hello, recently I received some really great advice from this >> > > community >> > > regarding changing the token timeout value in corosync. Thank you! >> > > Since >> > > then the cluster has been working perfectly with no errors in the >> > > log for >> > > more than a week. >> > > >> > > This morning I logged in to find a stopped stonith device. If I'm >> > > reading >> > > the log right, it looks like it failed 1 million times in ~20 >> > > seconds then >> > > gave up. If you wouldn't mind looking at the logs below, is there >> > > some way >> > > that I can make this more robust so that it can recover? I'll be >> > > investigating the reason for the timeout but would like to help the >> > > system >> > > recover on its own. >> > > >> > > Servers: RHEL 8.2 >> > > >> > > Cluster name: cluster_pgperf2 >> > > Stack: corosync >> > > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition >> > > with >> > > quorum >> > > Last updated: Wed Jun 17 11:47:42 2020 >> > > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on >> > > srv1 >> > > >> > > 2 nodes configured >> > > 4 resources configured >> > > >> > > Online: [ srv1 srv2 ] >> > > >> > > Full list of resources: >> > > >> > > Clone Set: pgsqld-clone [pgsqld] (promotable) >> > > Masters: [ srv1 ] >> > > Slaves: [ srv2 ] >> > > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started >> > > srv1 >> > > vmfence(stonith:fence_vmware_soap):Stopped >> > > >> > > Failed Resource Actions: >> > > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, >> > > status=Timed Out, >> > > exitreason='', >> > > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, >> > > exec=20184ms >> > > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, >> > > status=Timed Out, >> > > exitreason='', >> > > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, >> > > exec=20008ms >> > > >> > > Daemon Status: >> > > corosync: active/disabled >> > > pacemaker: active/disabled >> > > pcsd: active/enabled >> > > >> > > pcs resource config >> > > Clone: pgsqld-clone >> > > Meta Attrs: notify=true promotable=true >> > > Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) >> > > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data >> > > Operations: demote interval=0s timeout=120s (pgsqld-demote- >> > > interval-0s) >> > > methods interval=0s timeout=5 (pgsqld-methods- >> > > interval-0s) >> > > monitor interval=15s role=Master timeout=60s >> > > (pgsqld-monitor-interval-15s) >> > > monitor interval=16s role=Slave timeout=60s >> > > (pgsqld-monitor-interval-16s) >> > > notify interval=0s timeout=60s (pgsqld-notify- >> > > interval-0s) >> > > promote interval=0s timeout=30s (pgsqld-promote- >> > > interval-0s) >> > > reload interval=0s timeout=20 (pgsqld-reload- >> > > interval-0s) >> > > start interval=0s timeout=60s (pgsqld-start- >> > > interval-0s) >> > > stop interval=0s
[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8
>>> Andrei Borzenkov schrieb am 18.06.2020 um 20:33 in Nachricht : > 18.06.2020 20:16, Howard пишет: >> Thanks for the replies! I will look at the failure-timeout resource >> attribute and at adjusting the timeout from 20 to 30 seconds. It is funny >> that the 100 tries message is symbolic. >> > > It is not symbolic, it is INFINITY. From pacemaker documentation That's why it is symbolic: The resource obviously did NOT fail 100 times. > > If the cluster property start-failure-is-fatal is set to true (which is > the default), start failures cause the failcount to be set to INFINITY > and thus always cause the resource to move immediately. > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8
>>> Ken Gaillot schrieb am 18.06.2020 um 21:29 in Nachricht <9b2cb2273f5e6d54e66e8b432f24c7df73addaa2.ca...@redhat.com>: > On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote: >> 18.06.2020 18:24, Ken Gaillot пишет: >> > Note that a failed start of a stonith device will not prevent the >> > cluster from using that device for fencing. It just prevents the >> > cluster from monitoring the device. >> > >> >> My understanding is that if stonith resource cannot run anywhere, it >> also won't be used for stonith. When failcount exceeds threshold, >> resource is banned from node. If it happens on all nodes, resource >> cannot run anywhere and so won't be used for stonith. Start failure >> automatically sets failcount to INFINITY. >> >> Or do I misunderstand something? > > I had to test to confirm, but a stonith resource stopped due to > failures can indeed be used. Only stonith resources stopped via > location constraints (bans) or target-role=Stopped are prevented from > being used. Yes, that's what I knew: Stonith can be used on a node where the stonith "resource" isn't running. Before I had wondered why the stonith resource isn't cloned for each node... > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8
>>> Howard schrieb am 19.06.2020 um 00:13 in Nachricht : > Thanks for all the help so far. With your assistance, I'm very close to > stable. > > Made the following changes to the vmfence stonith resource: > > Meta Attrs: failure-timeout=30m migration-threshold=10 > Operations: monitor interval=60s (vmfence-monitor-interval-60s) > > If I understand this correctly, it will check if the fencing device is > online every 60 seconds. It will try 10 times and then mark the node > ineligible. After 30 minutes it will start trying again. Did you add "meta failure-timeout=30m" to the stonith resource? Maybe you could also set the stonith timeout to a higher value, the threshold to a lower value (like 3), and also the failure-timeout to a higher value (like several hours or days). (The idea is that if you have like one failure every second day you don't want the resocre to be disabled after a week or two, because the failure count accumulated) Of course while testing you may use lower values for the impatient ;-) Regards, Ulrich > > On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot wrote: > >> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote: >> > 18.06.2020 18:24, Ken Gaillot пишет: >> > > Note that a failed start of a stonith device will not prevent the >> > > cluster from using that device for fencing. It just prevents the >> > > cluster from monitoring the device. >> > > >> > >> > My understanding is that if stonith resource cannot run anywhere, it >> > also won't be used for stonith. When failcount exceeds threshold, >> > resource is banned from node. If it happens on all nodes, resource >> > cannot run anywhere and so won't be used for stonith. Start failure >> > automatically sets failcount to INFINITY. >> > >> > Or do I misunderstand something? >> >> I had to test to confirm, but a stonith resource stopped due to >> failures can indeed be used. Only stonith resources stopped via >> location constraints (bans) or target-role=Stopped are prevented from >> being used. >> -- >> Ken Gaillot >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/