Hi! I can't give much detailed advice, but I think any network service should have a timeout of at least 30 Sekonds (you have timeout=20000ms).
And "after 1000000 failures" is symbolic, not literal: It means it failed too often, so I won't retry. Regards, Ulrich >>> Howard <hmon...@gmail.com> schrieb am 17.06.2020 um 21:05 in Nachricht <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1gfDL_2tAbKmw mq...@mail.gmail.com>: > Hello, recently I received some really great advice from this community > regarding changing the token timeout value in corosync. Thank you! Since > then the cluster has been working perfectly with no errors in the log for > more than a week. > > This morning I logged in to find a stopped stonith device. If I'm reading > the log right, it looks like it failed 1 million times in ~20 seconds then > gave up. If you wouldn't mind looking at the logs below, is there some way > that I can make this more robust so that it can recover? I'll be > investigating the reason for the timeout but would like to help the system > recover on its own. > > Servers: RHEL 8.2 > > Cluster name: cluster_pgperf2 > Stack: corosync > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition with > quorum > Last updated: Wed Jun 17 11:47:42 2020 > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on srv1 > > 2 nodes configured > 4 resources configured > > Online: [ srv1 srv2 ] > > Full list of resources: > > Clone Set: pgsqld-clone [pgsqld] (promotable) > Masters: [ srv1 ] > Slaves: [ srv2 ] > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started srv1 > vmfence (stonith:fence_vmware_soap): Stopped > > Failed Resource Actions: > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, status=Timed Out, > exitreason='', > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, exec=20184ms > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, status=Timed Out, > exitreason='', > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, exec=20008ms > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > pcs resource config > Clone: pgsqld-clone > Meta Attrs: notify=true promotable=true > Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data > Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s) > methods interval=0s timeout=5 (pgsqld-methods-interval-0s) > monitor interval=15s role=Master timeout=60s > (pgsqld-monitor-interval-15s) > monitor interval=16s role=Slave timeout=60s > (pgsqld-monitor-interval-16s) > notify interval=0s timeout=60s (pgsqld-notify-interval-0s) > promote interval=0s timeout=30s (pgsqld-promote-interval-0s) > reload interval=0s timeout=20 (pgsqld-reload-interval-0s) > start interval=0s timeout=60s (pgsqld-start-interval-0s) > stop interval=0s timeout=60s (pgsqld-stop-interval-0s) > monitor interval=60s timeout=60s > (pgsqld-monitor-interval-60s) > Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2) > Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx > Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s) > start interval=0s timeout=20s > (pgsql-master-ip-start-interval-0s) > stop interval=0s timeout=20s > (pgsql-master-ip-stop-interval-0s) > > pcs stonith config > Resource: vmfence (class=stonith type=fence_vmware_soap) > Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx > passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy ssl=1 > ssl_insecure=1 > Operations: monitor interval=60s (vmfence-monitor-interval-60s) > > pcs resource failcount show > Failcounts for resource 'vmfence' > srv1: INFINITY > srv2: INFINITY > > Here are the versions installed: > [postgres@srv1 cluster]$ rpm -qa|grep > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" > corosync-3.0.2-3.el8_1.1.x86_64 > corosync-qdevice-3.0.0-2.el8.x86_64 > corosync-qnetd-3.0.0-2.el8.x86_64 > corosynclib-3.0.2-3.el8_1.1.x86_64 > fence-agents-vmware-soap-4.2.1-41.el8.noarch > pacemaker-2.0.2-3.el8_1.2.x86_64 > pacemaker-cli-2.0.2-3.el8_1.2.x86_64 > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-schemas-2.0.2-3.el8_1.2.noarch > pcs-0.10.2-4.el8.x86_64 > resource-agents-paf-2.3.0-1.noarch > > Here are the errors and warnings from the pacemaker.log from the first > warning until it gave up. > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced > [26722] (child_timeout_callback) warning: > fence_vmware_soap_monitor_1 process (PID 43095) timed out > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced > [26722] (operation_finished) warning: > fence_vmware_soap_monitor_1:43095 - timed out after 20000ms > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-controld > [26726] (process_lrm_event) error: Result of monitor operation for > vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000 > timeout=20000ms > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-fenced > [26722] (child_timeout_callback) warning: > fence_vmware_soap_monitor_1 process (PID 43215) timed out > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-fenced > [26722] (operation_finished) warning: > fence_vmware_soap_monitor_1:43215 - timed out after 20000ms > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-controld > [26726] (process_lrm_event) error: Result of start operation for > vmfence on srv1: Timed Out | call=44 key=vmfence_start_0 timeout=20000ms > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-controld > [26726] (status_from_rc) warning: Action 39 (vmfence_start_0) on > srv1 failed (target: 0 vs. rc: 198): Error > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (check_migration_threshold) warning: > Forcing vmfence away from srv1 after 1000000 failures (max=5) > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 > pacemaker-schedulerd[26725] (check_migration_threshold) warning: > Forcing vmfence away from srv1 after 1000000 failures (max=5) > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-controld > [26726] (status_from_rc) warning: Action 38 (vmfence_start_0) on > srv2 failed (target: 0 vs. rc: 198): Error > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (check_migration_threshold) warning: > Forcing vmfence away from srv1 after 1000000 failures (max=5) > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (check_migration_threshold) warning: > Forcing vmfence away from srv1 after 1000000 failures (max=5) > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 > pacemaker-schedulerd[26725] (check_migration_threshold) warning: > Forcing vmfence away from srv2 after 1000000 failures (max=5) _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/