[ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ulrich Windl
Hi!

I can't give much detailed advice, but I think any network service should have 
a timeout of at least 30 Sekonds (you have timeout=2ms).

And "after 100 failures" is symbolic, not literal: It means it failed too 
often, so I won't retry.

Regards,
Ulrich

>>> Howard  schrieb am 17.06.2020 um 21:05 in Nachricht
<2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1gfDL_2tAbKmw
mq...@mail.gmail.com>:
> Hello, recently I received some really great advice from this community
> regarding changing the token timeout value in corosync. Thank you! Since
> then the cluster has been working perfectly with no errors in the log for
> more than a week.
> 
> This morning I logged in to find a stopped stonith device.  If I'm reading
> the log right, it looks like it failed 1 million times in ~20 seconds then
> gave up. If you wouldn't mind looking at the logs below, is there some way
> that I can make this more robust so that it can recover?  I'll be
> investigating the reason for the timeout but would like to help the system
> recover on its own.
> 
> Servers: RHEL 8.2
> 
> Cluster name: cluster_pgperf2
> Stack: corosync
> Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition with
> quorum
> Last updated: Wed Jun 17 11:47:42 2020
> Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on srv1
> 
> 2 nodes configured
> 4 resources configured
> 
> Online: [ srv1 srv2 ]
> 
> Full list of resources:
> 
>  Clone Set: pgsqld-clone [pgsqld] (promotable)
>  Masters: [ srv1 ]
>  Slaves: [ srv2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started srv1
>  vmfence(stonith:fence_vmware_soap):Stopped
> 
> Failed Resource Actions:
> * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, status=Timed Out,
> exitreason='',
> last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, exec=20184ms
> * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, status=Timed Out,
> exitreason='',
> last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, exec=20008ms
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
>  pcs resource config
>  Clone: pgsqld-clone
>   Meta Attrs: notify=true promotable=true
>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
>monitor interval=15s role=Master timeout=60s
> (pgsqld-monitor-interval-15s)
>monitor interval=16s role=Slave timeout=60s
> (pgsqld-monitor-interval-16s)
>notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
>promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
>reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
>start interval=0s timeout=60s (pgsqld-start-interval-0s)
>stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>monitor interval=60s timeout=60s
> (pgsqld-monitor-interval-60s)
>  Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
>   Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
>   start interval=0s timeout=20s
> (pgsql-master-ip-start-interval-0s)
>   stop interval=0s timeout=20s
> (pgsql-master-ip-stop-interval-0s)
> 
> pcs stonith config
>  Resource: vmfence (class=stonith type=fence_vmware_soap)
>   Attributes: ipaddr=xxx.xxx.xxx.xxx login=\
> passwd_script= pcmk_host_map=srv1:x;srv2:y ssl=1
> ssl_insecure=1
>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> 
> pcs resource failcount show
> Failcounts for resource 'vmfence'
>   srv1: INFINITY
>   srv2: INFINITY
> 
> Here are the versions installed:
> [postgres@srv1 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> pcs-0.10.2-4.el8.x86_64
> resource-agents-paf-2.3.0-1.noarch
> 
> Here are the errors and warnings from the pacemaker.log from the first
> warning until it gave up.
> 
> /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced
>  [26722] (child_timeout_callback) warning:
> fence_vmware_soap_monitor_1 process (PID 43095) timed out
> /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced
>  [26722] (operation_finished) warning:
> fence_vmware_soap_monitor_1:43095 - timed ou

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Strahil Nikolov
What about second fencing mechanism ?
You can add a shared (independent) vmdk as an sbd device. The reconfiguration 
will require cluster downtime, but this is only necessary once.
Once 2 fencing mechanisms are available - you can configure the order easily.
Best Regards,
Strahil Nikolov






В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl 
 написа: 





Hi!

I can't give much detailed advice, but I think any network service should have 
a timeout of at least 30 Sekonds (you have timeout=2ms).

And "after 100 failures" is symbolic, not literal: It means it failed too 
often, so I won't retry.

Regards,
Ulrich

>>> Howard  schrieb am 17.06.2020 um 21:05 in Nachricht
<2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1gfDL_2tAbKmw
mq...@mail.gmail.com>:
> Hello, recently I received some really great advice from this community
> regarding changing the token timeout value in corosync. Thank you! Since
> then the cluster has been working perfectly with no errors in the log for
> more than a week.
> 
> This morning I logged in to find a stopped stonith device.  If I'm reading
> the log right, it looks like it failed 1 million times in ~20 seconds then
> gave up. If you wouldn't mind looking at the logs below, is there some way
> that I can make this more robust so that it can recover?  I'll be
> investigating the reason for the timeout but would like to help the system
> recover on its own.
> 
> Servers: RHEL 8.2
> 
> Cluster name: cluster_pgperf2
> Stack: corosync
> Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition with
> quorum
> Last updated: Wed Jun 17 11:47:42 2020
> Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on srv1
> 
> 2 nodes configured
> 4 resources configured
> 
> Online: [ srv1 srv2 ]
> 
> Full list of resources:
> 
>  Clone Set: pgsqld-clone [pgsqld] (promotable)
>      Masters: [ srv1 ]
>      Slaves: [ srv2 ]
>  pgsql-master-ip        (ocf::heartbeat:IPaddr2):      Started srv1
>  vmfence        (stonith:fence_vmware_soap):    Stopped
> 
> Failed Resource Actions:
> * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, status=Timed Out,
> exitreason='',
>    last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, exec=20184ms
> * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, status=Timed Out,
> exitreason='',
>    last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, exec=20008ms
> 
> Daemon Status:
>  corosync: active/disabled
>  pacemaker: active/disabled
>  pcsd: active/enabled
> 
>  pcs resource config
>  Clone: pgsqld-clone
>  Meta Attrs: notify=true promotable=true
>  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>    Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>    Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>                methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
>                monitor interval=15s role=Master timeout=60s
> (pgsqld-monitor-interval-15s)
>                monitor interval=16s role=Slave timeout=60s
> (pgsqld-monitor-interval-16s)
>                notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
>                promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
>                reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
>                start interval=0s timeout=60s (pgsqld-start-interval-0s)
>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>                monitor interval=60s timeout=60s
> (pgsqld-monitor-interval-60s)
>  Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
>  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
>              start interval=0s timeout=20s
> (pgsql-master-ip-start-interval-0s)
>              stop interval=0s timeout=20s
> (pgsql-master-ip-stop-interval-0s)
> 
> pcs stonith config
>  Resource: vmfence (class=stonith type=fence_vmware_soap)
>  Attributes: ipaddr=xxx.xxx.xxx.xxx login=\
> passwd_script= pcmk_host_map=srv1:x;srv2:y ssl=1
> ssl_insecure=1
>  Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> 
> pcs resource failcount show
> Failcounts for resource 'vmfence'
>  srv1: INFINITY
>  srv2: INFINITY
> 
> Here are the versions installed:
> [postgres@srv1 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> pcs-0.10.2-4.el8.x86_64
> resource-agents-paf-2.3.0-1.noarch
> 
> Here are the errors and warnings from the pacemaker.log from the first
> warning until it gave 

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ken Gaillot
Note that a failed start of a stonith device will not prevent the
cluster from using that device for fencing. It just prevents the
cluster from monitoring the device.

On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote:
> What about second fencing mechanism ?
> You can add a shared (independent) vmdk as an sbd device. The
> reconfiguration will require cluster downtime, but this is only
> necessary once.
> Once 2 fencing mechanisms are available - you can configure the order
> easily.
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> написа: 
> 
> 
> 
> 
> 
> Hi!
> 
> I can't give much detailed advice, but I think any network service
> should have a timeout of at least 30 Sekonds (you have
> timeout=2ms).
> 
> And "after 100 failures" is symbolic, not literal: It means it
> failed too often, so I won't retry.
> 
> Regards,
> Ulrich
> 
> > > > Howard  schrieb am 17.06.2020 um 21:05 in
> > > > Nachricht
> 
> <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
> fDL_2tAbKmw
> mq...@mail.gmail.com>:
> > Hello, recently I received some really great advice from this
> > community
> > regarding changing the token timeout value in corosync. Thank you!
> > Since
> > then the cluster has been working perfectly with no errors in the
> > log for
> > more than a week.
> > 
> > This morning I logged in to find a stopped stonith device.  If I'm
> > reading
> > the log right, it looks like it failed 1 million times in ~20
> > seconds then
> > gave up. If you wouldn't mind looking at the logs below, is there
> > some way
> > that I can make this more robust so that it can recover?  I'll be
> > investigating the reason for the timeout but would like to help the
> > system
> > recover on its own.
> > 
> > Servers: RHEL 8.2
> > 
> > Cluster name: cluster_pgperf2
> > Stack: corosync
> > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
> > with
> > quorum
> > Last updated: Wed Jun 17 11:47:42 2020
> > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
> > srv1
> > 
> > 2 nodes configured
> > 4 resources configured
> > 
> > Online: [ srv1 srv2 ]
> > 
> > Full list of resources:
> > 
> >   Clone Set: pgsqld-clone [pgsqld] (promotable)
> >   Masters: [ srv1 ]
> >   Slaves: [ srv2 ]
> >   pgsql-master-ip(ocf::heartbeat:IPaddr2):  Started
> > srv1
> >   vmfence(stonith:fence_vmware_soap):Stopped
> > 
> > Failed Resource Actions:
> > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
> > status=Timed Out,
> > exitreason='',
> > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
> > exec=20184ms
> > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
> > status=Timed Out,
> > exitreason='',
> > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
> > exec=20008ms
> > 
> > Daemon Status:
> >   corosync: active/disabled
> >   pacemaker: active/disabled
> >   pcsd: active/enabled
> > 
> >   pcs resource config
> >   Clone: pgsqld-clone
> >   Meta Attrs: notify=true promotable=true
> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
> > Operations: demote interval=0s timeout=120s (pgsqld-demote-
> > interval-0s)
> > methods interval=0s timeout=5 (pgsqld-methods-
> > interval-0s)
> > monitor interval=15s role=Master timeout=60s
> > (pgsqld-monitor-interval-15s)
> > monitor interval=16s role=Slave timeout=60s
> > (pgsqld-monitor-interval-16s)
> > notify interval=0s timeout=60s (pgsqld-notify-
> > interval-0s)
> > promote interval=0s timeout=30s (pgsqld-promote-
> > interval-0s)
> > reload interval=0s timeout=20 (pgsqld-reload-
> > interval-0s)
> > start interval=0s timeout=60s (pgsqld-start-
> > interval-0s)
> > stop interval=0s timeout=60s (pgsqld-stop-interval-
> > 0s)
> > monitor interval=60s timeout=60s
> > (pgsqld-monitor-interval-60s)
> >   Resource: pgsql-master-ip (class=ocf provider=heartbeat
> > type=IPaddr2)
> >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
> >   Operations: monitor interval=10s (pgsql-master-ip-monitor-
> > interval-10s)
> >   start interval=0s timeout=20s
> > (pgsql-master-ip-start-interval-0s)
> >   stop interval=0s timeout=20s
> > (pgsql-master-ip-stop-interval-0s)
> > 
> > pcs stonith config
> >   Resource: vmfence (class=stonith type=fence_vmware_soap)
> >   Attributes: ipaddr=xxx.xxx.xxx.xxx login=\
> > passwd_script= pcmk_host_map=srv1:x;srv2:y
> > ssl=1
> > ssl_insecure=1
> >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> > 
> > pcs resource failcount show
> > Failcounts for resource 'vmfence'
> >   srv1: INFINITY
> >   srv2: INFINITY
> > 
> > Here are the versions installed:
> >

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Howard
Thanks for the replies! I will look at the failure-timeout resource
attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
that the 100 tries message is symbolic.

It turns out that the VMware host was down temporarily at the time of the
alerts. I don't know when It came back up but pcs had already given up
trying to reestablish the connection.

On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot  wrote:

> Note that a failed start of a stonith device will not prevent the
> cluster from using that device for fencing. It just prevents the
> cluster from monitoring the device.
>
> On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote:
> > What about second fencing mechanism ?
> > You can add a shared (independent) vmdk as an sbd device. The
> > reconfiguration will require cluster downtime, but this is only
> > necessary once.
> > Once 2 fencing mechanisms are available - you can configure the order
> > easily.
> > Best Regards,
> > Strahil Nikolov
> >
> >
> >
> >
> >
> >
> > В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
> > ulrich.wi...@rz.uni-regensburg.de> написа:
> >
> >
> >
> >
> >
> > Hi!
> >
> > I can't give much detailed advice, but I think any network service
> > should have a timeout of at least 30 Sekonds (you have
> > timeout=2ms).
> >
> > And "after 100 failures" is symbolic, not literal: It means it
> > failed too often, so I won't retry.
> >
> > Regards,
> > Ulrich
> >
> > > > > Howard  schrieb am 17.06.2020 um 21:05 in
> > > > > Nachricht
> >
> > <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
> > fDL_2tAbKmw
> > mq...@mail.gmail.com>:
> > > Hello, recently I received some really great advice from this
> > > community
> > > regarding changing the token timeout value in corosync. Thank you!
> > > Since
> > > then the cluster has been working perfectly with no errors in the
> > > log for
> > > more than a week.
> > >
> > > This morning I logged in to find a stopped stonith device.  If I'm
> > > reading
> > > the log right, it looks like it failed 1 million times in ~20
> > > seconds then
> > > gave up. If you wouldn't mind looking at the logs below, is there
> > > some way
> > > that I can make this more robust so that it can recover?  I'll be
> > > investigating the reason for the timeout but would like to help the
> > > system
> > > recover on its own.
> > >
> > > Servers: RHEL 8.2
> > >
> > > Cluster name: cluster_pgperf2
> > > Stack: corosync
> > > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
> > > with
> > > quorum
> > > Last updated: Wed Jun 17 11:47:42 2020
> > > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
> > > srv1
> > >
> > > 2 nodes configured
> > > 4 resources configured
> > >
> > > Online: [ srv1 srv2 ]
> > >
> > > Full list of resources:
> > >
> > >   Clone Set: pgsqld-clone [pgsqld] (promotable)
> > >   Masters: [ srv1 ]
> > >   Slaves: [ srv2 ]
> > >   pgsql-master-ip(ocf::heartbeat:IPaddr2):  Started
> > > srv1
> > >   vmfence(stonith:fence_vmware_soap):Stopped
> > >
> > > Failed Resource Actions:
> > > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
> > > status=Timed Out,
> > > exitreason='',
> > > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
> > > exec=20184ms
> > > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
> > > status=Timed Out,
> > > exitreason='',
> > > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
> > > exec=20008ms
> > >
> > > Daemon Status:
> > >   corosync: active/disabled
> > >   pacemaker: active/disabled
> > >   pcsd: active/enabled
> > >
> > >   pcs resource config
> > >   Clone: pgsqld-clone
> > >   Meta Attrs: notify=true promotable=true
> > >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> > > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
> > > Operations: demote interval=0s timeout=120s (pgsqld-demote-
> > > interval-0s)
> > > methods interval=0s timeout=5 (pgsqld-methods-
> > > interval-0s)
> > > monitor interval=15s role=Master timeout=60s
> > > (pgsqld-monitor-interval-15s)
> > > monitor interval=16s role=Slave timeout=60s
> > > (pgsqld-monitor-interval-16s)
> > > notify interval=0s timeout=60s (pgsqld-notify-
> > > interval-0s)
> > > promote interval=0s timeout=30s (pgsqld-promote-
> > > interval-0s)
> > > reload interval=0s timeout=20 (pgsqld-reload-
> > > interval-0s)
> > > start interval=0s timeout=60s (pgsqld-start-
> > > interval-0s)
> > > stop interval=0s timeout=60s (pgsqld-stop-interval-
> > > 0s)
> > > monitor interval=60s timeout=60s
> > > (pgsqld-monitor-interval-60s)
> > >   Resource: pgsql-master-ip (class=ocf provider=heartbeat
> > > type=IPaddr2)
> > >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
> > >   Operations: monitor interval=10s (pgsql-master-ip-monitor-
> > > interv

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 18:24, Ken Gaillot пишет:
> Note that a failed start of a stonith device will not prevent the
> cluster from using that device for fencing. It just prevents the
> cluster from monitoring the device.
> 

My understanding is that if stonith resource cannot run anywhere, it
also won't be used for stonith. When failcount exceeds threshold,
resource is banned from node. If it happens on all nodes, resource
cannot run anywhere and so won't be used for stonith. Start failure
automatically sets failcount to INFINITY.

Or do I misunderstand something?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 20:16, Howard пишет:
> Thanks for the replies! I will look at the failure-timeout resource
> attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
> that the 100 tries message is symbolic.
> 

It is not symbolic, it is INFINITY. From pacemaker documentation

If the cluster property start-failure-is-fatal is set to true (which is
the default), start failures cause the failcount to be set to INFINITY
and thus always cause the resource to move immediately.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Ken Gaillot
On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
> 18.06.2020 18:24, Ken Gaillot пишет:
> > Note that a failed start of a stonith device will not prevent the
> > cluster from using that device for fencing. It just prevents the
> > cluster from monitoring the device.
> > 
> 
> My understanding is that if stonith resource cannot run anywhere, it
> also won't be used for stonith. When failcount exceeds threshold,
> resource is banned from node. If it happens on all nodes, resource
> cannot run anywhere and so won't be used for stonith. Start failure
> automatically sets failcount to INFINITY.
> 
> Or do I misunderstand something?

I had to test to confirm, but a stonith resource stopped due to
failures can indeed be used. Only stonith resources stopped via
location constraints (bans) or target-role=Stopped are prevented from
being used.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Strahil Nikolov
Nice to know.
Yet, if the monitoring of that fencing device failed - most probably the 
Vcenter was not responding/unreachable - that's why  I offered sbd .

Best Regards,
Strahil  Nikolov

На 18 юни 2020 г. 18:24:48 GMT+03:00, Ken Gaillot  написа:
>Note that a failed start of a stonith device will not prevent the
>cluster from using that device for fencing. It just prevents the
>cluster from monitoring the device.
>
>On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote:
>> What about second fencing mechanism ?
>> You can add a shared (independent) vmdk as an sbd device. The
>> reconfiguration will require cluster downtime, but this is only
>> necessary once.
>> Once 2 fencing mechanisms are available - you can configure the order
>> easily.
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
>> ulrich.wi...@rz.uni-regensburg.de> написа: 
>> 
>> 
>> 
>> 
>> 
>> Hi!
>> 
>> I can't give much detailed advice, but I think any network service
>> should have a timeout of at least 30 Sekonds (you have
>> timeout=2ms).
>> 
>> And "after 100 failures" is symbolic, not literal: It means it
>> failed too often, so I won't retry.
>> 
>> Regards,
>> Ulrich
>> 
>> > > > Howard  schrieb am 17.06.2020 um 21:05 in
>> > > > Nachricht
>> 
>> <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
>> fDL_2tAbKmw
>> mq...@mail.gmail.com>:
>> > Hello, recently I received some really great advice from this
>> > community
>> > regarding changing the token timeout value in corosync. Thank you!
>> > Since
>> > then the cluster has been working perfectly with no errors in the
>> > log for
>> > more than a week.
>> > 
>> > This morning I logged in to find a stopped stonith device.  If I'm
>> > reading
>> > the log right, it looks like it failed 1 million times in ~20
>> > seconds then
>> > gave up. If you wouldn't mind looking at the logs below, is there
>> > some way
>> > that I can make this more robust so that it can recover?  I'll be
>> > investigating the reason for the timeout but would like to help the
>> > system
>> > recover on its own.
>> > 
>> > Servers: RHEL 8.2
>> > 
>> > Cluster name: cluster_pgperf2
>> > Stack: corosync
>> > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
>> > with
>> > quorum
>> > Last updated: Wed Jun 17 11:47:42 2020
>> > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
>> > srv1
>> > 
>> > 2 nodes configured
>> > 4 resources configured
>> > 
>> > Online: [ srv1 srv2 ]
>> > 
>> > Full list of resources:
>> > 
>> >   Clone Set: pgsqld-clone [pgsqld] (promotable)
>> >   Masters: [ srv1 ]
>> >   Slaves: [ srv2 ]
>> >   pgsql-master-ip(ocf::heartbeat:IPaddr2):  Started
>> > srv1
>> >   vmfence(stonith:fence_vmware_soap):Stopped
>> > 
>> > Failed Resource Actions:
>> > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
>> > status=Timed Out,
>> > exitreason='',
>> > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
>> > exec=20184ms
>> > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
>> > status=Timed Out,
>> > exitreason='',
>> > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
>> > exec=20008ms
>> > 
>> > Daemon Status:
>> >   corosync: active/disabled
>> >   pacemaker: active/disabled
>> >   pcsd: active/enabled
>> > 
>> >   pcs resource config
>> >   Clone: pgsqld-clone
>> >   Meta Attrs: notify=true promotable=true
>> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>> > Operations: demote interval=0s timeout=120s (pgsqld-demote-
>> > interval-0s)
>> > methods interval=0s timeout=5 (pgsqld-methods-
>> > interval-0s)
>> > monitor interval=15s role=Master timeout=60s
>> > (pgsqld-monitor-interval-15s)
>> > monitor interval=16s role=Slave timeout=60s
>> > (pgsqld-monitor-interval-16s)
>> > notify interval=0s timeout=60s (pgsqld-notify-
>> > interval-0s)
>> > promote interval=0s timeout=30s (pgsqld-promote-
>> > interval-0s)
>> > reload interval=0s timeout=20 (pgsqld-reload-
>> > interval-0s)
>> > start interval=0s timeout=60s (pgsqld-start-
>> > interval-0s)
>> > stop interval=0s timeout=60s (pgsqld-stop-interval-
>> > 0s)
>> > monitor interval=60s timeout=60s
>> > (pgsqld-monitor-interval-60s)
>> >   Resource: pgsql-master-ip (class=ocf provider=heartbeat
>> > type=IPaddr2)
>> >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
>> >   Operations: monitor interval=10s (pgsql-master-ip-monitor-
>> > interval-10s)
>> >   start interval=0s timeout=20s
>> > (pgsql-master-ip-start-interval-0s)
>> >   stop interval=0s timeout=20s
>> > (pgsql-master-ip-stop-interval-0s)
>> > 
>> > pcs stonith config
>> >   Resource: vmfence (class=stonith type=fence_vmware_soap)
>>

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Howard
Thanks for all the help so far.  With your assistance, I'm very close to
stable.

Made the following changes to the vmfence stonith resource:

Meta Attrs: failure-timeout=30m migration-threshold=10
  Operations: monitor interval=60s (vmfence-monitor-interval-60s)

If I understand this correctly, it will check if the fencing device is
online every 60 seconds. It will try 10 times and then mark the node
ineligible.  After 30 minutes it will start trying again.

On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot  wrote:

> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
> > 18.06.2020 18:24, Ken Gaillot пишет:
> > > Note that a failed start of a stonith device will not prevent the
> > > cluster from using that device for fencing. It just prevents the
> > > cluster from monitoring the device.
> > >
> >
> > My understanding is that if stonith resource cannot run anywhere, it
> > also won't be used for stonith. When failcount exceeds threshold,
> > resource is banned from node. If it happens on all nodes, resource
> > cannot run anywhere and so won't be used for stonith. Start failure
> > automatically sets failcount to INFINITY.
> >
> > Or do I misunderstand something?
>
> I had to test to confirm, but a stonith resource stopped due to
> failures can indeed be used. Only stonith resources stopped via
> location constraints (bans) or target-role=Stopped are prevented from
> being used.
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Klaus Wenninger
On 6/19/20 12:13 AM, Howard wrote:
> Thanks for all the help so far.  With your assistance, I'm very close
> to stable.
>
> Made the following changes to the vmfence stonith resource:
>   
> Meta Attrs: failure-timeout=30m migration-threshold=10
>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
>
> If I understand this correctly, it will check if the fencing device is
> online every 60 seconds. It will try 10 times and then mark the node
> ineligible.  After 30 minutes it will start trying again.
>
> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot  > wrote:
>
> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
> > 18.06.2020 18:24, Ken Gaillot пишет:
> > > Note that a failed start of a stonith device will not prevent the
> > > cluster from using that device for fencing. It just prevents the
> > > cluster from monitoring the device.
> > >
> >
> > My understanding is that if stonith resource cannot run anywhere, it
> > also won't be used for stonith. When failcount exceeds threshold,
> > resource is banned from node. If it happens on all nodes, resource
> > cannot run anywhere and so won't be used for stonith. Start failure
> > automatically sets failcount to INFINITY.
> >
> > Or do I misunderstand something?
>
> I had to test to confirm, but a stonith resource stopped due to
> failures can indeed be used. Only stonith resources stopped via
> location constraints (bans) or target-role=Stopped are prevented from
> being used.
>
Unfortunately this could be a bit tricky to test as fenced updates
the device-list on configuration changes but scores as well influence
if a device is taken into that list.
So there is as well a possible dependency on when the device-list has been
updated most recently.
Don't know if it is relevant for this config but unfortunately something
to have in the back of one's mind in case of more complex fencing
setups.
An uglyness that is known for a long time but there is no easy way
to solve the issue without loosing part of the independence and with
that robustness of the fencing subsystem.

Klaus
>
> -- 
> Ken Gaillot mailto:kgail...@redhat.com>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
19.06.2020 13:23, Klaus Wenninger пишет:
> On 6/19/20 12:13 AM, Howard wrote:
>> Thanks for all the help so far.  With your assistance, I'm very close
>> to stable.
>>
>> Made the following changes to the vmfence stonith resource:
>>   
>> Meta Attrs: failure-timeout=30m migration-threshold=10
>>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
>>
>> If I understand this correctly, it will check if the fencing device is
>> online every 60 seconds. It will try 10 times and then mark the node
>> ineligible.  After 30 minutes it will start trying again.
>>
>> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot > > wrote:
>>
>> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote:
>> > 18.06.2020 18:24, Ken Gaillot пишет:
>> > > Note that a failed start of a stonith device will not prevent the
>> > > cluster from using that device for fencing. It just prevents the
>> > > cluster from monitoring the device.
>> > >
>> >
>> > My understanding is that if stonith resource cannot run anywhere, it
>> > also won't be used for stonith. When failcount exceeds threshold,
>> > resource is banned from node. If it happens on all nodes, resource
>> > cannot run anywhere and so won't be used for stonith. Start failure
>> > automatically sets failcount to INFINITY.
>> >
>> > Or do I misunderstand something?
>>
>> I had to test to confirm, but a stonith resource stopped due to
>> failures can indeed be used. Only stonith resources stopped via
>> location constraints (bans) or target-role=Stopped are prevented from
>> being used.
>>
> Unfortunately this could be a bit tricky to test as fenced updates
> the device-list on configuration changes but scores as well influence
> if a device is taken into that list.

Can you elaborate? I understand it as "if score is -INFINITY, device is
ignored", is it correct? This would be consistent, explicit constraints
are just one possible way to set location score.

> So there is as well a possible dependency on when the device-list has been
> updated most recently.

My understanding was that pacemaker recomputes scores on every
transaction. That is the whole idea - any event triggers re-evaluation
of current resource placement. Node lost event that results in stonith
recomputes scores as the very first thing.

Of course it is possible after node loss some other even happens that
would have made resource available but it is no more taken in account
because pacemaker already decided no fencing resource was available to
perform stonith. Is it what you mean?

> Don't know if it is relevant for this config but unfortunately something
> to have in the back of one's mind in case of more complex fencing
> setups.
> An uglyness that is known for a long time but there is no easy way
> to solve the issue without loosing part of the independence and with
> that robustness of the fencing subsystem.
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
19.06.2020 01:13, Howard пишет:
> Thanks for all the help so far.  With your assistance, I'm very close to
> stable.
> 
> Made the following changes to the vmfence stonith resource:
> 
> Meta Attrs: failure-timeout=30m migration-threshold=10
>   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> 
> If I understand this correctly, it will check if the fencing device is
> online every 60 seconds. It will try 10 times and then mark the node
> ineligible.

No. That's the main problem - stonith resource failure on a node does
not affect whether this node can be selected to perform stonith. Node
becomes ineligible for *monitoring* operation, that's all.

Resource could be marked as failed on all nodes and still fencing will
be attempted.

That is very counter-intuitive, OTOH this allows fencing to work even in
case of transient issues.

I wonder if pacemaker will cycle through available nodes though.
Consider three node cluster nodeA, nodeB, nodeC. nodeA is lost, nodeB is
selected to but cannot perform stonith for whatever reasons. Will
pacemaker retry on nodeC? Under which conditions (number of retries on
nodeB, whatever)? If nodeC fails too, will pacemaker restart cycle from
the beginning?

Also does stonith resource failure on a node affect selecting this node
to perform stonith? Is there any sort of priority list? If yes, how is
it ordered?

>  After 30 minutes it will start trying again.
>

... resume monitoring. Nothing more.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-22 Thread Ken Gaillot
On Sat, 2020-06-20 at 08:47 +0300, Andrei Borzenkov wrote:
> 19.06.2020 01:13, Howard пишет:
> > Thanks for all the help so far.  With your assistance, I'm very
> > close to
> > stable.
> > 
> > Made the following changes to the vmfence stonith resource:
> > 
> > Meta Attrs: failure-timeout=30m migration-threshold=10
> >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> > 
> > If I understand this correctly, it will check if the fencing device
> > is
> > online every 60 seconds. It will try 10 times and then mark the
> > node
> > ineligible.
> 
> No. That's the main problem - stonith resource failure on a node does
> not affect whether this node can be selected to perform stonith. Node
> becomes ineligible for *monitoring* operation, that's all.
> 
> Resource could be marked as failed on all nodes and still fencing
> will
> be attempted.
> 
> That is very counter-intuitive, OTOH this allows fencing to work even
> in
> case of transient issues.
> 
> I wonder if pacemaker will cycle through available nodes though.
> Consider three node cluster nodeA, nodeB, nodeC. nodeA is lost, nodeB
> is
> selected to but cannot perform stonith for whatever reasons. Will
> pacemaker retry on nodeC? Under which conditions (number of retries
> on
> nodeB, whatever)? If nodeC fails too, will pacemaker restart cycle
> from
> the beginning?

When selecting a node to execute fencing, pacemaker prefers (1) a node
that runs a recurring monitor on the device; (2) any other node besides
the target; or (3) the target, if no other node is available.

By default pacemaker will attempt to execute a fencing action twice.
This is customizable via the pcmk_reboot_retries / pcmk_off_retries /
etc. stonith device meta-attributes. IIRC, the second attempt will be
tried on a different node if one is available.

However each attempt eats into the overall timeout. If the first
attempt hangs and uses up all the timeout, then no further attempts
will be made.

A fencing topology can be configured if multiple devices can be used to
fence a node, to specify which should be attempted first.

If all devices/attempts fail, pacemaker marks the fencing as failed.
From there it depends on how the fencing was initiated. If pacemaker
itself initiated it (vs. external software like DLM, or a sysadmin
running stonith_admin), the controller will resubmit the fencing
operation up to 10 times by default (the stonith-max-attempts cluster
property) then give up. However the controller will reset the counter
to zero and try again at the next transition if the node still needs to
be fenced.

> Also does stonith resource failure on a node affect selecting this
> node
> to perform stonith? Is there any sort of priority list? If yes, how
> is
> it ordered?

Currently, device monitor failure does not affect the selection of a
node to execute the device, but that is planned.

The priority list for selecting a node to execute a device is described
above. For selecting between multiple fence devices when there is no
topology, there is a priority meta-attribute for stonith devices, but
it not currently implemented (another to-do item).

> 
> >  After 30 minutes it will start trying again.
> > 
> 
> ... resume monitoring. Nothing more.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/