[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ulrich Windl
>>> Howard  schrieb am 18.06.2020 um 19:16 in Nachricht
:
> Thanks for the replies! I will look at the failure-timeout resource
> attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
> that the 100 tries message is symbolic.
> 
> It turns out that the VMware host was down temporarily at the time of the
> alerts. I don't know when It came back up but pcs had already given up
> trying to reestablish the connection.

Out of curiosity: Does that mean your cluster node VM did not run on the VM
host the cluster thought it would? Or was the cluster VM dead as well?


> 
> On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot  wrote:
> 
>> Note that a failed start of a stonith device will not prevent the
>> cluster from using that device for fencing. It just prevents the
>> cluster from monitoring the device.
>>
>> On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote:
>> > What about second fencing mechanism ?
>> > You can add a shared (independent) vmdk as an sbd device. The
>> > reconfiguration will require cluster downtime, but this is only
>> > necessary once.
>> > Once 2 fencing mechanisms are available - you can configure the order
>> > easily.
>> > Best Regards,
>> > Strahil Nikolov
>> >
>> >
>> >
>> >
>> >
>> >
>> > В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
>> > ulrich.wi...@rz.uni-regensburg.de> написа:
>> >
>> >
>> >
>> >
>> >
>> > Hi!
>> >
>> > I can't give much detailed advice, but I think any network service
>> > should have a timeout of at least 30 Sekonds (you have
>> > timeout=2ms).
>> >
>> > And "after 100 failures" is symbolic, not literal: It means it
>> > failed too often, so I won't retry.
>> >
>> > Regards,
>> > Ulrich
>> >
>> > > > > Howard  schrieb am 17.06.2020 um 21:05 in
>> > > > > Nachricht
>> >
>> > <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
>> > fDL_2tAbKmw
>> > mq...@mail.gmail.com>:
>> > > Hello, recently I received some really great advice from this
>> > > community
>> > > regarding changing the token timeout value in corosync. Thank you!
>> > > Since
>> > > then the cluster has been working perfectly with no errors in the
>> > > log for
>> > > more than a week.
>> > >
>> > > This morning I logged in to find a stopped stonith device.  If I'm
>> > > reading
>> > > the log right, it looks like it failed 1 million times in ~20
>> > > seconds then
>> > > gave up. If you wouldn't mind looking at the logs below, is there
>> > > some way
>> > > that I can make this more robust so that it can recover?  I'll be
>> > > investigating the reason for the timeout but would like to help the
>> > > system
>> > > recover on its own.
>> > >
>> > > Servers: RHEL 8.2
>> > >
>> > > Cluster name: cluster_pgperf2
>> > > Stack: corosync
>> > > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
>> > > with
>> > > quorum
>> > > Last updated: Wed Jun 17 11:47:42 2020
>> > > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
>> > > srv1
>> > >
>> > > 2 nodes configured
>> > > 4 resources configured
>> > >
>> > > Online: [ srv1 srv2 ]
>> > >
>> > > Full list of resources:
>> > >
>> > >   Clone Set: pgsqld-clone [pgsqld] (promotable)
>> > >   Masters: [ srv1 ]
>> > >   Slaves: [ srv2 ]
>> > >   pgsql-master-ip(ocf::heartbeat:IPaddr2):  Started
>> > > srv1
>> > >   vmfence(stonith:fence_vmware_soap):Stopped
>> > >
>> > > Failed Resource Actions:
>> > > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
>> > > status=Timed Out,
>> > > exitreason='',
>> > > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
>> > > exec=20184ms
>> > > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
>> > > status=Timed Out,
>> > > exitreason='',
>> > > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
>> > > exec=20008ms
>> > >
>> > > Daemon Status:
>> > >   corosync: active/disabled
>> > >   pacemaker: active/disabled
>> > >   pcsd: active/enabled
>> > >
>> > >   pcs resource config
>> > >   Clone: pgsqld-clone
>> > >   Meta Attrs: notify=true promotable=true
>> > >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> > > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>> > > Operations: demote interval=0s timeout=120s (pgsqld-demote-
>> > > interval-0s)
>> > > methods interval=0s timeout=5 (pgsqld-methods-
>> > > interval-0s)
>> > > monitor interval=15s role=Master timeout=60s
>> > > (pgsqld-monitor-interval-15s)
>> > > monitor interval=16s role=Slave timeout=60s
>> > > (pgsqld-monitor-interval-16s)
>> > > notify interval=0s timeout=60s (pgsqld-notify-
>> > > interval-0s)
>> > > promote interval=0s timeout=30s (pgsqld-promote-
>> > > interval-0s)
>> > > reload interval=0s timeout=20 (pgsqld-reload-
>> > > interval-0s)
>> > > start interval=0s timeout=60s (pgsqld-start-
>> > > interval-0s)
>> > > stop interval=0s

[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 18.06.2020 um 20:33 in
Nachricht :
> 18.06.2020 20:16, Howard пишет:
>> Thanks for the replies! I will look at the failure-timeout resource
>> attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
>> that the 100 tries message is symbolic.
>> 
> 
> It is not symbolic, it is INFINITY. From pacemaker documentation

That's why it is symbolic: The resource obviously did NOT fail 100 times.

> 
> If the cluster property start-failure-is-fatal is set to true (which is
> the default), start failures cause the failcount to be set to INFINITY
> and thus always cause the resource to move immediately.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 18.06.2020 um 21:29 in
Nachricht
<9b2cb2273f5e6d54e66e8b432f24c7df73addaa2.ca...@redhat.com>:
> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
>> 18.06.2020 18:24, Ken Gaillot пишет:
>> > Note that a failed start of a stonith device will not prevent the
>> > cluster from using that device for fencing. It just prevents the
>> > cluster from monitoring the device.
>> > 
>> 
>> My understanding is that if stonith resource cannot run anywhere, it
>> also won't be used for stonith. When failcount exceeds threshold,
>> resource is banned from node. If it happens on all nodes, resource
>> cannot run anywhere and so won't be used for stonith. Start failure
>> automatically sets failcount to INFINITY.
>> 
>> Or do I misunderstand something?
> 
> I had to test to confirm, but a stonith resource stopped due to
> failures can indeed be used. Only stonith resources stopped via
> location constraints (bans) or target-role=Stopped are prevented from
> being used.

Yes, that's what I knew: Stonith can be used on a node where the stonith
"resource" isn't running. Before I had wondered why the stonith resource isn't
cloned for each node...

> -- 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ulrich Windl
>>> Howard  schrieb am 19.06.2020 um 00:13 in Nachricht
:
> Thanks for all the help so far.  With your assistance, I'm very close to
> stable.
> 
> Made the following changes to the vmfence stonith resource:
> 
> Meta Attrs: failure-timeout=30m migration-threshold=10
>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> 
> If I understand this correctly, it will check if the fencing device is
> online every 60 seconds. It will try 10 times and then mark the node
> ineligible.  After 30 minutes it will start trying again.

Did you add "meta failure-timeout=30m" to the stonith resource?

Maybe you could also set the stonith timeout to a higher value, the threshold
to a lower value (like 3), and also the failure-timeout to a higher value (like
several hours or days).

(The idea is that if you have like one failure every second day you don't want
the resocre to be disabled after a week or two, because the failure count
accumulated)

Of course while testing you may use lower values for the impatient ;-)

Regards,
Ulrich

> 
> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot  wrote:
> 
>> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
>> > 18.06.2020 18:24, Ken Gaillot пишет:
>> > > Note that a failed start of a stonith device will not prevent the
>> > > cluster from using that device for fencing. It just prevents the
>> > > cluster from monitoring the device.
>> > >
>> >
>> > My understanding is that if stonith resource cannot run anywhere, it
>> > also won't be used for stonith. When failcount exceeds threshold,
>> > resource is banned from node. If it happens on all nodes, resource
>> > cannot run anywhere and so won't be used for stonith. Start failure
>> > automatically sets failcount to INFINITY.
>> >
>> > Or do I misunderstand something?
>>
>> I had to test to confirm, but a stonith resource stopped due to
>> failures can indeed be used. Only stonith resources stopped via
>> location constraints (bans) or target-role=Stopped are prevented from
>> being used.
>> --
>> Ken Gaillot 
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/