Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-12 Thread Strahil Nikolov
Don't forget to increase the consensus!

Best Regards,
Strahil Nikolov

На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard  написа:
>This is interesting. So it seems that 13,000 ms or 13 seconds is how
>long
>the VM was frozen during the snapshot backup and 0.8 seconds is the
>threshold. We will be disabling the snapshot backups and may increase
>the
>token timeout a bit since these systems are not so critical.
>
>Thanks Honza for helping me understand.
>
>Howard
>
>On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse 
>wrote:
>
>> Howard,
>>
>>
>> ...
>>
>> The most important info is following line:
>>
>>  > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync
>main
>>  > process was not scheduled for 13006.0615 ms (threshold is 800.
>ms).
>>  > Consider token timeout increase.
>>
>> There are more of these, so you can either make sure VM is not paused
>> for such a long time or increase token timeout so corosync is able to
>> handle such pause.
>>
>> Regards,
>>Honza
>>
>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-12 Thread Strahil Nikolov
And I forgot to ask ... Are you using memory-based snapshot ?
It shouldn't take so long.

Best Regards,
Strahil Nikolov

На 12 юни 2020 г. 7:10:38 GMT+03:00, Strahil Nikolov  
написа:
>Don't forget to increase the consensus!
>
>Best Regards,
>Strahil Nikolov
>
>На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard 
>написа:
>>This is interesting. So it seems that 13,000 ms or 13 seconds is how
>>long
>>the VM was frozen during the snapshot backup and 0.8 seconds is the
>>threshold. We will be disabling the snapshot backups and may increase
>>the
>>token timeout a bit since these systems are not so critical.
>>
>>Thanks Honza for helping me understand.
>>
>>Howard
>>
>>On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse 
>>wrote:
>>
>>> Howard,
>>>
>>>
>>> ...
>>>
>>> The most important info is following line:
>>>
>>>  > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync
>>main
>>>  > process was not scheduled for 13006.0615 ms (threshold is
>800.
>>ms).
>>>  > Consider token timeout increase.
>>>
>>> There are more of these, so you can either make sure VM is not
>paused
>>> for such a long time or increase token timeout so corosync is able
>to
>>> handle such pause.
>>>
>>> Regards,
>>>Honza
>>>
>>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-11 Thread Howard
This is interesting. So it seems that 13,000 ms or 13 seconds is how long
the VM was frozen during the snapshot backup and 0.8 seconds is the
threshold. We will be disabling the snapshot backups and may increase the
token timeout a bit since these systems are not so critical.

Thanks Honza for helping me understand.

Howard

On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse  wrote:

> Howard,
>
>
> ...
>
> The most important info is following line:
>
>  > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
>  > process was not scheduled for 13006.0615 ms (threshold is 800. ms).
>  > Consider token timeout increase.
>
> There are more of these, so you can either make sure VM is not paused
> for such a long time or increase token timeout so corosync is able to
> handle such pause.
>
> Regards,
>Honza
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-11 Thread Jan Friesse

Howard,



Good morning.  Thanks for reading.  We have a requirement to provide high
availability for PostgreSQL 10.  I have built a two node cluster with a
quorum device as the third vote, all running on RHEL 8.

Here are the versions installed:
[postgres@srv2 cluster]$ rpm -qa|grep
"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
corosync-3.0.2-3.el8_1.1.x86_64
corosync-qdevice-3.0.0-2.el8.x86_64
corosync-qnetd-3.0.0-2.el8.x86_64
corosynclib-3.0.2-3.el8_1.1.x86_64
fence-agents-vmware-soap-4.2.1-41.el8.noarch
pacemaker-2.0.2-3.el8_1.2.x86_64
pacemaker-cli-2.0.2-3.el8_1.2.x86_64
pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-schemas-2.0.2-3.el8_1.2.noarch
pcs-0.10.2-4.el8.x86_64
resource-agents-paf-2.3.0-1.noarch

These are vmare VMs so I configured the cluster to use the ESX host as the
fencing device using fence_vmware_soap.

Throughout each day things generally work very well.  The cluster remains
online and healthy. Unfortunately, when I check pcs status in the mornings,
I see that all kinds of things went wrong overnight.  It is hard to
pinpoint what the issue is as there is so much information being written to
the pacemaker.log. Scrolling through pages and pages of informational log
entries trying to find the lines that pertain to the issue.  Is there a way
to separate the logs out to make it easier to scroll through? Or maybe a
list of keywords to GREP for?


The most important info is following line:

> Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 13006.0615 ms (threshold is 800. ms).
> Consider token timeout increase.

There are more of these, so you can either make sure VM is not paused 
for such a long time or increase token timeout so corosync is able to 
handle such pause.


Regards,
  Honza



It is clearly indicating that the server lost contact with the other node
and also the quorum device. Is there a way to make this configuration more
robust or able to recover from a connectivity blip?

Here are the pacemaker and corosync logs for this morning's failures:
pacemaker.log
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
  [10573] (pcmk_quorum_notification)   warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
  [10579] (pcmk_quorum_notification)   warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
will be fenced: peer is no longer part of the cluster
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (determine_online_status)warning: Node
srv1 is unclean
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1
for STONITH
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
Calculated transition 2 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-34.bz2
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
  [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1
(op=join_offer)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
  [10579] (destroy_action) warning: Cancelling timer for action 3
(src=307)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Strahil Nikolov
What  is your corosync.conf timeouts (especially token & consensus)?
Last time I did live migration of RHEL 7 node with the default values, the 
cluster fenced it - thus I set it to 10s  for token and I also raised the 
consensus (check 'man corosync.conf') above the default.

Also,  start your investigation from the  virtualization layer, as during the 
nights a lot  of backups  are going on. Last week I got a cluster node fenced 
cause it failed to respond for 40s  . Thankfully that was just a QA cluster,  
so it wasn't a big deal.

The most common reasons for a VM to fail to respond are:
- CPU starvation due to high CPU utilisation on the host
- I/O issues  causing the VM to pause
- Lots  of backups eating the bandwidth on any of the  Hypervisours or on a 
switch between them  (if you have a single heartbeat  network)

With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff 
like sctp.

P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" ,  
just make the vmdk shared &  independent .  This way your cluster can operate 
even when the vCenter is unreachable for any reason.

Best Regards,
Strahil Nikolov


На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard  написа:
>Good morning.  Thanks for reading.  We have a requirement to provide
>high
>availability for PostgreSQL 10.  I have built a two node cluster with a
>quorum device as the third vote, all running on RHEL 8.
>
>Here are the versions installed:
>[postgres@srv2 cluster]$ rpm -qa|grep
>"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>corosync-3.0.2-3.el8_1.1.x86_64
>corosync-qdevice-3.0.0-2.el8.x86_64
>corosync-qnetd-3.0.0-2.el8.x86_64
>corosynclib-3.0.2-3.el8_1.1.x86_64
>fence-agents-vmware-soap-4.2.1-41.el8.noarch
>pacemaker-2.0.2-3.el8_1.2.x86_64
>pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>pcs-0.10.2-4.el8.x86_64
>resource-agents-paf-2.3.0-1.noarch
>
>These are vmare VMs so I configured the cluster to use the ESX host as
>the
>fencing device using fence_vmware_soap.
>
>Throughout each day things generally work very well.  The cluster
>remains
>online and healthy. Unfortunately, when I check pcs status in the
>mornings,
>I see that all kinds of things went wrong overnight.  It is hard to
>pinpoint what the issue is as there is so much information being
>written to
>the pacemaker.log. Scrolling through pages and pages of informational
>log
>entries trying to find the lines that pertain to the issue.  Is there a
>way
>to separate the logs out to make it easier to scroll through? Or maybe
>a
>list of keywords to GREP for?
>
>It is clearly indicating that the server lost contact with the other
>node
>and also the quorum device. Is there a way to make this configuration
>more
>robust or able to recover from a connectivity blip?
>
>Here are the pacemaker and corosync logs for this morning's failures:
>pacemaker.log
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
> [10573] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2
>pacemaker-controld
> [10579] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
>will be fenced: peer is no longer part of the cluster
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (determine_online_status)warning:
>Node
>srv1 is unclean
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Howard
Hi everyone.  As a followup, I found that the vms were having snapshot
backup at the time of the disconnects which I think freezes IO. We'll be
addressing that.  Is there anything else in the log that can be improved.

Thanks,
Howard

On Wed, Jun 10, 2020 at 10:06 AM Howard  wrote:

> Good morning.  Thanks for reading.  We have a requirement to provide high
> availability for PostgreSQL 10.  I have built a two node cluster with a
> quorum device as the third vote, all running on RHEL 8.
>
> Here are the versions installed:
> [postgres@srv2 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> pcs-0.10.2-4.el8.x86_64
> resource-agents-paf-2.3.0-1.noarch
>
> These are vmare VMs so I configured the cluster to use the ESX host as the
> fencing device using fence_vmware_soap.
>
> Throughout each day things generally work very well.  The cluster remains
> online and healthy. Unfortunately, when I check pcs status in the mornings,
> I see that all kinds of things went wrong overnight.  It is hard to
> pinpoint what the issue is as there is so much information being written to
> the pacemaker.log. Scrolling through pages and pages of informational log
> entries trying to find the lines that pertain to the issue.  Is there a way
> to separate the logs out to make it easier to scroll through? Or maybe a
> list of keywords to GREP for?
>
> It is clearly indicating that the server lost contact with the other node
> and also the quorum device. Is there a way to make this configuration more
> robust or able to recover from a connectivity blip?
>
> Here are the pacemaker and corosync logs for this morning's failures:
> pacemaker.log
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
>  [10573] (pcmk_quorum_notification)   warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
>  [10579] (pcmk_quorum_notification)   warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
> will be fenced: peer is no longer part of the cluster
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (determine_online_status)warning: Node
> srv1 is unclean
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1
> for STONITH
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
> Calculated transition 2 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-34.bz2
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>  [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1
> (op=join_offer)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>  [10579] (destroy_action) warning: Cancelling timer for action 3
> (src=307)
>