Hello, recently I received some really great advice from this community regarding changing the token timeout value in corosync. Thank you! Since then the cluster has been working perfectly with no errors in the log for more than a week.
This morning I logged in to find a stopped stonith device. If I'm reading the log right, it looks like it failed 1 million times in ~20 seconds then gave up. If you wouldn't mind looking at the logs below, is there some way that I can make this more robust so that it can recover? I'll be investigating the reason for the timeout but would like to help the system recover on its own. Servers: RHEL 8.2 Cluster name: cluster_pgperf2 Stack: corosync Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum Last updated: Wed Jun 17 11:47:42 2020 Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on srv1 2 nodes configured 4 resources configured Online: [ srv1 srv2 ] Full list of resources: Clone Set: pgsqld-clone [pgsqld] (promotable) Masters: [ srv1 ] Slaves: [ srv2 ] pgsql-master-ip (ocf::heartbeat:IPaddr2): Started srv1 vmfence (stonith:fence_vmware_soap): Stopped Failed Resource Actions: * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, status=Timed Out, exitreason='', last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, exec=20184ms * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, status=Timed Out, exitreason='', last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, exec=20008ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled pcs resource config Clone: pgsqld-clone Meta Attrs: notify=true promotable=true Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s) methods interval=0s timeout=5 (pgsqld-methods-interval-0s) monitor interval=15s role=Master timeout=60s (pgsqld-monitor-interval-15s) monitor interval=16s role=Slave timeout=60s (pgsqld-monitor-interval-16s) notify interval=0s timeout=60s (pgsqld-notify-interval-0s) promote interval=0s timeout=30s (pgsqld-promote-interval-0s) reload interval=0s timeout=20 (pgsqld-reload-interval-0s) start interval=0s timeout=60s (pgsqld-start-interval-0s) stop interval=0s timeout=60s (pgsqld-stop-interval-0s) monitor interval=60s timeout=60s (pgsqld-monitor-interval-60s) Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s) start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s) stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s) pcs stonith config Resource: vmfence (class=stonith type=fence_vmware_soap) Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy ssl=1 ssl_insecure=1 Operations: monitor interval=60s (vmfence-monitor-interval-60s) pcs resource failcount show Failcounts for resource 'vmfence' srv1: INFINITY srv2: INFINITY Here are the versions installed: [postgres@srv1 cluster]$ rpm -qa|grep "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" corosync-3.0.2-3.el8_1.1.x86_64 corosync-qdevice-3.0.0-2.el8.x86_64 corosync-qnetd-3.0.0-2.el8.x86_64 corosynclib-3.0.2-3.el8_1.1.x86_64 fence-agents-vmware-soap-4.2.1-41.el8.noarch pacemaker-2.0.2-3.el8_1.2.x86_64 pacemaker-cli-2.0.2-3.el8_1.2.x86_64 pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-schemas-2.0.2-3.el8_1.2.noarch pcs-0.10.2-4.el8.x86_64 resource-agents-paf-2.3.0-1.noarch Here are the errors and warnings from the pacemaker.log from the first warning until it gave up. /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced [26722] (child_timeout_callback) warning: fence_vmware_soap_monitor_1 process (PID 43095) timed out /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-fenced [26722] (operation_finished) warning: fence_vmware_soap_monitor_1:43095 - timed out after 20000ms /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-controld [26726] (process_lrm_event) error: Result of monitor operation for vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000 timeout=20000ms /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-fenced [26722] (child_timeout_callback) warning: fence_vmware_soap_monitor_1 process (PID 43215) timed out /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-fenced [26722] (operation_finished) warning: fence_vmware_soap_monitor_1:43215 - timed out after 20000ms /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-controld [26726] (process_lrm_event) error: Result of start operation for vmfence on srv1: Timed Out | call=44 key=vmfence_start_0 timeout=20000ms /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-controld [26726] (status_from_rc) warning: Action 39 (vmfence_start_0) on srv1 failed (target: 0 vs. rc: 198): Error /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (check_migration_threshold) warning: Forcing vmfence away from srv1 after 1000000 failures (max=5) /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-schedulerd[26725] (check_migration_threshold) warning: Forcing vmfence away from srv1 after 1000000 failures (max=5) /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-controld [26726] (status_from_rc) warning: Action 38 (vmfence_start_0) on srv2 failed (target: 0 vs. rc: 198): Error /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (check_migration_threshold) warning: Forcing vmfence away from srv1 after 1000000 failures (max=5) /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: Processing failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (check_migration_threshold) warning: Forcing vmfence away from srv1 after 1000000 failures (max=5) /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-schedulerd[26725] (check_migration_threshold) warning: Forcing vmfence away from srv2 after 1000000 failures (max=5)
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/