Re: [ClusterLabs] New user needs some help stabilizing the cluster
Don't forget to increase the consensus! Best Regards, Strahil Nikolov На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard написа: >This is interesting. So it seems that 13,000 ms or 13 seconds is how >long >the VM was frozen during the snapshot backup and 0.8 seconds is the >threshold. We will be disabling the snapshot backups and may increase >the >token timeout a bit since these systems are not so critical. > >Thanks Honza for helping me understand. > >Howard > >On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse >wrote: > >> Howard, >> >> >> ... >> >> The most important info is following line: >> >> > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync >main >> > process was not scheduled for 13006.0615 ms (threshold is 800. >ms). >> > Consider token timeout increase. >> >> There are more of these, so you can either make sure VM is not paused >> for such a long time or increase token timeout so corosync is able to >> handle such pause. >> >> Regards, >>Honza >> >> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] New user needs some help stabilizing the cluster
And I forgot to ask ... Are you using memory-based snapshot ? It shouldn't take so long. Best Regards, Strahil Nikolov На 12 юни 2020 г. 7:10:38 GMT+03:00, Strahil Nikolov написа: >Don't forget to increase the consensus! > >Best Regards, >Strahil Nikolov > >На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard >написа: >>This is interesting. So it seems that 13,000 ms or 13 seconds is how >>long >>the VM was frozen during the snapshot backup and 0.8 seconds is the >>threshold. We will be disabling the snapshot backups and may increase >>the >>token timeout a bit since these systems are not so critical. >> >>Thanks Honza for helping me understand. >> >>Howard >> >>On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse >>wrote: >> >>> Howard, >>> >>> >>> ... >>> >>> The most important info is following line: >>> >>> > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync >>main >>> > process was not scheduled for 13006.0615 ms (threshold is >800. >>ms). >>> > Consider token timeout increase. >>> >>> There are more of these, so you can either make sure VM is not >paused >>> for such a long time or increase token timeout so corosync is able >to >>> handle such pause. >>> >>> Regards, >>>Honza >>> >>> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] New user needs some help stabilizing the cluster
This is interesting. So it seems that 13,000 ms or 13 seconds is how long the VM was frozen during the snapshot backup and 0.8 seconds is the threshold. We will be disabling the snapshot backups and may increase the token timeout a bit since these systems are not so critical. Thanks Honza for helping me understand. Howard On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse wrote: > Howard, > > > ... > > The most important info is following line: > > > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync main > > process was not scheduled for 13006.0615 ms (threshold is 800. ms). > > Consider token timeout increase. > > There are more of these, so you can either make sure VM is not paused > for such a long time or increase token timeout so corosync is able to > handle such pause. > > Regards, >Honza > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] New user needs some help stabilizing the cluster
Howard, Good morning. Thanks for reading. We have a requirement to provide high availability for PostgreSQL 10. I have built a two node cluster with a quorum device as the third vote, all running on RHEL 8. Here are the versions installed: [postgres@srv2 cluster]$ rpm -qa|grep "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" corosync-3.0.2-3.el8_1.1.x86_64 corosync-qdevice-3.0.0-2.el8.x86_64 corosync-qnetd-3.0.0-2.el8.x86_64 corosynclib-3.0.2-3.el8_1.1.x86_64 fence-agents-vmware-soap-4.2.1-41.el8.noarch pacemaker-2.0.2-3.el8_1.2.x86_64 pacemaker-cli-2.0.2-3.el8_1.2.x86_64 pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-schemas-2.0.2-3.el8_1.2.noarch pcs-0.10.2-4.el8.x86_64 resource-agents-paf-2.3.0-1.noarch These are vmare VMs so I configured the cluster to use the ESX host as the fencing device using fence_vmware_soap. Throughout each day things generally work very well. The cluster remains online and healthy. Unfortunately, when I check pcs status in the mornings, I see that all kinds of things went wrong overnight. It is hard to pinpoint what the issue is as there is so much information being written to the pacemaker.log. Scrolling through pages and pages of informational log entries trying to find the lines that pertain to the issue. Is there a way to separate the logs out to make it easier to scroll through? Or maybe a list of keywords to GREP for? The most important info is following line: > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync main > process was not scheduled for 13006.0615 ms (threshold is 800. ms). > Consider token timeout increase. There are more of these, so you can either make sure VM is not paused for such a long time or increase token timeout so corosync is able to handle such pause. Regards, Honza It is clearly indicating that the server lost contact with the other node and also the quorum device. Is there a way to make this configuration more robust or able to recover from a connectivity blip? Here are the pacemaker and corosync logs for this morning's failures: pacemaker.log /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd [10573] (pcmk_quorum_notification) warning: Quorum lost | membership=952 members=1 /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld [10579] (pcmk_quorum_notification) warning: Quorum lost | membership=952 members=1 /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (pe_fence_node) warning: Cluster node srv1 will be fenced: peer is no longer part of the cluster /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (determine_online_status)warning: Node srv1 is unclean /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_demote_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_stop_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_demote_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_stop_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_demote_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_stop_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_demote_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsqld:1_stop_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (custom_action) warning: Action pgsql-master-ip_stop_0 on srv1 is unrunnable (offline) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1 for STONITH /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 pacemaker-schedulerd[10578] (pcmk__log_transition_summary) warning: Calculated transition 2 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-34.bz2 /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1 (op=join_offer) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld [10579] (destroy_action) warning: Cancelling timer for action 3 (src=307) /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45
Re: [ClusterLabs] New user needs some help stabilizing the cluster
What is your corosync.conf timeouts (especially token & consensus)? Last time I did live migration of RHEL 7 node with the default values, the cluster fenced it - thus I set it to 10s for token and I also raised the consensus (check 'man corosync.conf') above the default. Also, start your investigation from the virtualization layer, as during the nights a lot of backups are going on. Last week I got a cluster node fenced cause it failed to respond for 40s . Thankfully that was just a QA cluster, so it wasn't a big deal. The most common reasons for a VM to fail to respond are: - CPU starvation due to high CPU utilisation on the host - I/O issues causing the VM to pause - Lots of backups eating the bandwidth on any of the Hypervisours or on a switch between them (if you have a single heartbeat network) With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff like sctp. P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" , just make the vmdk shared & independent . This way your cluster can operate even when the vCenter is unreachable for any reason. Best Regards, Strahil Nikolov На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard написа: >Good morning. Thanks for reading. We have a requirement to provide >high >availability for PostgreSQL 10. I have built a two node cluster with a >quorum device as the third vote, all running on RHEL 8. > >Here are the versions installed: >[postgres@srv2 cluster]$ rpm -qa|grep >"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" >corosync-3.0.2-3.el8_1.1.x86_64 >corosync-qdevice-3.0.0-2.el8.x86_64 >corosync-qnetd-3.0.0-2.el8.x86_64 >corosynclib-3.0.2-3.el8_1.1.x86_64 >fence-agents-vmware-soap-4.2.1-41.el8.noarch >pacemaker-2.0.2-3.el8_1.2.x86_64 >pacemaker-cli-2.0.2-3.el8_1.2.x86_64 >pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 >pacemaker-libs-2.0.2-3.el8_1.2.x86_64 >pacemaker-schemas-2.0.2-3.el8_1.2.noarch >pcs-0.10.2-4.el8.x86_64 >resource-agents-paf-2.3.0-1.noarch > >These are vmare VMs so I configured the cluster to use the ESX host as >the >fencing device using fence_vmware_soap. > >Throughout each day things generally work very well. The cluster >remains >online and healthy. Unfortunately, when I check pcs status in the >mornings, >I see that all kinds of things went wrong overnight. It is hard to >pinpoint what the issue is as there is so much information being >written to >the pacemaker.log. Scrolling through pages and pages of informational >log >entries trying to find the lines that pertain to the issue. Is there a >way >to separate the logs out to make it easier to scroll through? Or maybe >a >list of keywords to GREP for? > >It is clearly indicating that the server lost contact with the other >node >and also the quorum device. Is there a way to make this configuration >more >robust or able to recover from a connectivity blip? > >Here are the pacemaker and corosync logs for this morning's failures: >pacemaker.log >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd > [10573] (pcmk_quorum_notification) warning: Quorum lost | >membership=952 members=1 >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 >pacemaker-controld > [10579] (pcmk_quorum_notification) warning: Quorum lost | >membership=952 members=1 >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (pe_fence_node) warning: Cluster node srv1 >will be fenced: peer is no longer part of the cluster >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (determine_online_status)warning: >Node >srv1 is unclean >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_demote_0 on srv1 is unrunnable (offline) >/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 >pacemaker-schedulerd[10578] (custom_action) warning: Action >pgsqld:1_stop_0 on srv1 is unrunnable (offline)
Re: [ClusterLabs] New user needs some help stabilizing the cluster
Hi everyone. As a followup, I found that the vms were having snapshot backup at the time of the disconnects which I think freezes IO. We'll be addressing that. Is there anything else in the log that can be improved. Thanks, Howard On Wed, Jun 10, 2020 at 10:06 AM Howard wrote: > Good morning. Thanks for reading. We have a requirement to provide high > availability for PostgreSQL 10. I have built a two node cluster with a > quorum device as the third vote, all running on RHEL 8. > > Here are the versions installed: > [postgres@srv2 cluster]$ rpm -qa|grep > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" > corosync-3.0.2-3.el8_1.1.x86_64 > corosync-qdevice-3.0.0-2.el8.x86_64 > corosync-qnetd-3.0.0-2.el8.x86_64 > corosynclib-3.0.2-3.el8_1.1.x86_64 > fence-agents-vmware-soap-4.2.1-41.el8.noarch > pacemaker-2.0.2-3.el8_1.2.x86_64 > pacemaker-cli-2.0.2-3.el8_1.2.x86_64 > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-schemas-2.0.2-3.el8_1.2.noarch > pcs-0.10.2-4.el8.x86_64 > resource-agents-paf-2.3.0-1.noarch > > These are vmare VMs so I configured the cluster to use the ESX host as the > fencing device using fence_vmware_soap. > > Throughout each day things generally work very well. The cluster remains > online and healthy. Unfortunately, when I check pcs status in the mornings, > I see that all kinds of things went wrong overnight. It is hard to > pinpoint what the issue is as there is so much information being written to > the pacemaker.log. Scrolling through pages and pages of informational log > entries trying to find the lines that pertain to the issue. Is there a way > to separate the logs out to make it easier to scroll through? Or maybe a > list of keywords to GREP for? > > It is clearly indicating that the server lost contact with the other node > and also the quorum device. Is there a way to make this configuration more > robust or able to recover from a connectivity blip? > > Here are the pacemaker and corosync logs for this morning's failures: > pacemaker.log > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd > [10573] (pcmk_quorum_notification) warning: Quorum lost | > membership=952 members=1 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld > [10579] (pcmk_quorum_notification) warning: Quorum lost | > membership=952 members=1 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (pe_fence_node) warning: Cluster node srv1 > will be fenced: peer is no longer part of the cluster > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (determine_online_status)warning: Node > srv1 is unclean > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsql-master-ip_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1 > for STONITH > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (pcmk__log_transition_summary) warning: > Calculated transition 2 (with warnings), saving inputs in > /var/lib/pacemaker/pengine/pe-warn-34.bz2 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1 > (op=join_offer) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (destroy_action) warning: Cancelling timer for action 3 > (src=307) >