Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-04-12 Thread Eric Robinson
> -Original Message-
> From: Strahil Nikolov 
> Sent: Sunday, April 12, 2020 1:32 AM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed ;
> Andrei Borzenkov 
> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On April 11, 2020 5:01:37 PM GMT+03:00, Eric Robinson
>  wrote:
> >
> >Hi Strahil --
> >
> >I hope you won't mind if I revive this old question. In your comments
> >below, you suggested using a 1s  token with a 1.2s consensus. I
> >currently have 2-node clusters (will soon install a qdevice). I was
> >reading in the corosync.conf man page where it says...
> >
> >"For  two  node  clusters,  a  consensus larger than the join timeout
> >but less than token is safe.  For three node or larger clusters,
> >consensus should be larger than token."
> >
> >Do you still think the consensus should be 1.2 * token in a 2-node
> >cluster? Why is a smaller consensus considered safe for 2-node
> >clusters? Should I use a larger consensus anyway?
> >
> >--Eric
> >
> >
> >> -Original Message-
> >> From: Strahil Nikolov 
> >> Sent: Thursday, February 6, 2020 1:07 PM
> >> To: Eric Robinson ; Cluster Labs - All
> >topics
> >> related to open-source clustering welcomed ;
> >> Andrei Borzenkov 
> >> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
> >>
> >> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
> >>  wrote:
> >> >Hi Nikolov --
> >> >
> >> >> Defaults are 1s  token,  1.2s  consensus which is too small.
> >> >> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> >> >> With these settings, cluster  will not react   for 22s.
> >> >>
> >> >> I think it's a good start for your cluster .
> >> >> Don't forget to put  the cluster  in maintenance (pcs property set
> >> >> maintenance-mode=true) before restarting the stack ,  or  even
> >better
> >> >- get
> >> >> some downtime.
> >> >>
> >> >> You can use the following article to run a simulation before
> >removing
> >> >the
> >> >> maintenance:
> >> >> https://www.suse.com/support/kb/doc/?id=7022764
> >> >>
> >> >
> >> >
> >> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
> >> >
> >> >--Eric
> >> >
> >> >Disclaimer : This email and any files transmitted with it are
> >> >confidential and intended solely for intended recipients. If you are
> >> >not the named addressee you should not disseminate, distribute, copy
> >or
> >> >alter this email. Any views or opinions presented in this email are
> >> >solely those of the author and might not represent those of
> >Physician
> >> >Select Management. Warning: Although Physician Select Management
> has
> >> >taken reasonable precautions to ensure no viruses are present in
> >this
> >> >email, the company cannot accept responsibility for any loss or
> >damage
> >> >arising from the use of this email or attachments.
> >>
> >> Hi Eric,
> >>
> >> The timeouts can be treated as 'how much time to wait before  taking
> >any
> >> action'. The workload is not very important (HANA  is something
> >different).
> >>
> >> You can try with 10s (token) , 12s (consensus) and if needed  you can
> >adjust.
> >>
> >> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2
> >node
> >> cluster is vulnerable to split brain, especially when one of the
> >nodes  is
> >> syncing  (for example after a patching) and the source is
> >> fenced/lost/disconnected. It's very hard to extract data from a
> >semi-synced
> >> drbd.
> >>
> >> Also, if you need guidance for the SELINUX, I can point you to my
> >guide in the
> >> centos forum.
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-04-11 Thread Eric Robinson


Hi Strahil --

I hope you won't mind if I revive this old question. In your comments below, 
you suggested using a 1s  token with a 1.2s consensus. I currently have 2-node 
clusters (will soon install a qdevice). I was reading in the corosync.conf man 
page where it says...

"For  two  node  clusters,  a  consensus larger than the join timeout but less 
than token is safe.  For three node or larger clusters, consensus should be 
larger than token."

Do you still think the consensus should be 1.2 * token in a 2-node cluster? Why 
is a smaller consensus considered safe for 2-node clusters? Should I use a 
larger consensus anyway?

--Eric


> -Original Message-
> From: Strahil Nikolov 
> Sent: Thursday, February 6, 2020 1:07 PM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed ;
> Andrei Borzenkov 
> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
>  wrote:
> >Hi Nikolov --
> >
> >> Defaults are 1s  token,  1.2s  consensus which is too small.
> >> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> >> With these settings, cluster  will not react   for 22s.
> >>
> >> I think it's a good start for your cluster .
> >> Don't forget to put  the cluster  in maintenance (pcs property set
> >> maintenance-mode=true) before restarting the stack ,  or  even better
> >- get
> >> some downtime.
> >>
> >> You can use the following article to run a simulation before removing
> >the
> >> maintenance:
> >> https://www.suse.com/support/kb/doc/?id=7022764
> >>
> >
> >
> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
> >
> >--Eric
> >
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician Select Management has
> >taken reasonable precautions to ensure no viruses are present in this
> >email, the company cannot accept responsibility for any loss or damage
> >arising from the use of this email or attachments.
>
> Hi Eric,
>
> The timeouts can be treated as 'how much time to wait before  taking any
> action'. The workload is not very important (HANA  is something different).
>
> You can try with 10s (token) , 12s (consensus) and if needed  you can adjust.
>
> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node
> cluster is vulnerable to split brain, especially when one of the nodes  is
> syncing  (for example after a patching) and the source is
> fenced/lost/disconnected. It's very hard to extract data from a semi-synced
> drbd.
>
> Also, if you need guidance for the SELINUX, I can point you to my guide in the
> centos forum.
>
> Best Regards,
> Strahil Nikolov
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-06 Thread Strahil Nikolov
On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson 
 wrote:
>Hi Nikolov --
>
>> Defaults are 1s  token,  1.2s  consensus which is too small.
>> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
>> With these settings, cluster  will not react   for 22s.
>>
>> I think it's a good start for your cluster .
>> Don't forget to put  the cluster  in maintenance (pcs property set
>> maintenance-mode=true) before restarting the stack ,  or  even better
>- get
>> some downtime.
>>
>> You can use the following article to run a simulation before removing
>the
>> maintenance:
>> https://www.suse.com/support/kb/doc/?id=7022764
>>
>
>
>Thanks for the suggestions. Any thoughts on timeouts for DRBD?
>
>--Eric
>
>Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.

Hi Eric,

The timeouts can be treated as 'how much time to wait before  taking any 
action'. The workload is not very important (HANA  is something different).

You can try with 10s (token) , 12s (consensus) and if needed  you can adjust.

Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node 
cluster is vulnerable to split brain, especially when one of the nodes  is 
syncing  (for example after a patching) and the source is 
fenced/lost/disconnected. It's very hard to extract data from a semi-synced  
drbd.

Also, if you need guidance for the SELINUX, I can point you to my guide in the 
centos forum.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-06 Thread Eric Robinson
Hi Nikolov --

> Defaults are 1s  token,  1.2s  consensus which is too small.
> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> With these settings, cluster  will not react   for 22s.
>
> I think it's a good start for your cluster .
> Don't forget to put  the cluster  in maintenance (pcs property set
> maintenance-mode=true) before restarting the stack ,  or  even better - get
> some downtime.
>
> You can use the following article to run a simulation before removing the
> maintenance:
> https://www.suse.com/support/kb/doc/?id=7022764
>


Thanks for the suggestions. Any thoughts on timeouts for DRBD?

--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Strahil Nikolov
On February 6, 2020 4:18:15 AM GMT+02:00, Eric Robinson 
 wrote:
>Hi Strahil –
>
>I think you may be right about the token timeouts being too short. I’ve
>also noticed that periods of high load can cause drbd to disconnect.
>What would you recommend for changes to the timeouts?
>
>I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The
>config is relatively simple.
>
>Corosync config looks like this…
>
>totem {
>version: 2
>cluster_name: 001db01ab
>secauth: off
>transport: udpu
>}
>
>nodelist {
>node {
>ring0_addr: 001db01a
>nodeid: 1
>}
>
>node {
>ring0_addr: 001db01b
>nodeid: 2
>}
>}
>
>quorum {
>provider: corosync_votequorum
>two_node: 1
>}
>
>logging {
>to_logfile: yes
>logfile: /var/log/cluster/corosync.log
>to_syslog: yes
>}
>
>
>From: Users  On Behalf Of Strahil
>Nikolov
>Sent: Wednesday, February 5, 2020 6:39 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed ; Andrei Borzenkov
>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>Hi Andrei,
>
>don't trust Azure so much :D . I've seen stuff that was way more
>unbelievable.
>Can you check other systems in the same subnet reported any issues.
>Yet, pcs most probably won't report any short-term issues. I have
>noticed that RHEL7 defaults for token and consensus are quite small and
>any short-term disruption could cause an issue.
>Actually when I tested live migration on oVirt - the other hosts fenced
>the node that was migrated.
>What is your corosync config and OS version ?
>
>Best Regards,
>Strahil Nikolov
>
>В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>Hi Strahil –
>
>
>
>I can’t prove there was no network loss, but:
>
>
>
>  1.  There were no dmesg indications of ethernet link loss.
>2.  Other than corosync, there are no other log messages about
>connectivity issues.
>  3.  Wouldn’t pcsd say something about connectivity loss?
>  4.  Both servers are in Azure.
>5.  There are many other servers in the same Azure subscription,
>including other corosync clusters, none of which had issues.
>
>
>
>So I guess it’s possible, but it seems unlikely.
>
>
>
>--Eric
>
>
>
>From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>Sent: Wednesday, February 5, 2020 3:13 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed mailto:users@clusterlabs.org>>; Andrei
>Borzenkov mailto:arvidj...@gmail.com>>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>
>
>Hi Erik,
>
>
>
>what has led you to think that there was no network loss ?
>
>
>
>Best Regards,
>
>Strahil Nikolov
>
>
>
>В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>
>
>> -Original Message-
>> From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>> Sent: Wednesday, February 5, 2020 1:59 PM
>> To: Andrei Borzenkov
>mailto:arvidj...@gmail.com>>;
>users@clusterlabs.org<mailto:users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>>
>> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>> mailto:arvidj...@gmail.com>> wrote:
>> >05.02.2020 20:55, Eric Robinson пишет:
>> >> The two servers 001db01a and 001db01b were up and responsive.
>Neither
>> >had been rebooted and neither were under heavy load. There's no
>> >indication in the logs of loss of network connectivity. Any ideas on
>> >why both nodes seem to think the other one is at fault?
>> >
>> >The very fact that nodes lost connection to each other *is*
>indication
>> >of network problems. Your logs start too late, after any problem
>> >already happened.
>> >
>> >>
>> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is
>not
>> >an option at this time.)
>> >>
>> >> Log from 001db01a:
>> >>
>> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor
>failed,
>> >forming new configuration.
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
>> >(10.51.14.33:960) was formed. Members left: 2
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to
>receive
>> >the leave message.

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I think you may be right about the token timeouts being too short. I’ve also 
noticed that periods of high load can cause drbd to disconnect. What would you 
recommend for changes to the timeouts?

I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The config is 
relatively simple.

Corosync config looks like this…

totem {
version: 2
cluster_name: 001db01ab
secauth: off
transport: udpu
}

nodelist {
node {
ring0_addr: 001db01a
nodeid: 1
}

node {
ring0_addr: 001db01b
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}


From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 6:39 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Andrei,

don't trust Azure so much :D . I've seen stuff that was way more unbelievable.
Can you check other systems in the same subnet reported any issues. Yet, pcs 
most probably won't report any short-term issues. I have noticed that RHEL7 
defaults for token and consensus are quite small and any short-term disruption 
could cause an issue.
Actually when I tested live migration on oVirt - the other hosts fenced the 
node that was migrated.
What is your corosync config and OS version ?

Best Regards,
Strahil Nikolov

В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



Hi Strahil –



I can’t prove there was no network loss, but:



  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.



So I guess it’s possible, but it seems unlikely.



--Eric



From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Andrei Borzenkov 
mailto:arvidj...@gmail.com>>
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?



Hi Erik,



what has led you to think that there was no network loss ?



Best Regards,

Strahil Nikolov



В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:





> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expe

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I can’t prove there was no network loss, but:


  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.

So I guess it’s possible, but it seems unlikely.

--Eric

From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Erik,

what has led you to think that there was no network loss ?

Best Regards,
Strahil Nikolov

В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. fai

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov ; users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>  wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. failed: 1
> >> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb
> >> 5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
> >with id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_NOT_DC -> S_ELECTION
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> >001db01a
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with
> >id=1 and/or uname=00

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson




> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 5, 2020 12:14 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> 05.02.2020 20:55, Eric Robinson пишет:
> > The two servers 001db01a and 001db01b were up and responsive. Neither
> had been rebooted and neither were under heavy load. There's no indication
> in the logs of loss of network connectivity. Any ideas on why both nodes
> seem to think the other one is at fault?
>
> The very fact that nodes lost connection to each other *is* indication of
> network problems. Your logs start too late, after any problem already
> happened.
>

All the log messages before those are just normal repetitive stuff that always 
gets logged, even during normal production. The snippet I provided shows the 
first indication of anything unusual. Also, there is no other indication of 
network connectivity loss, and both servers are in Azure.

> >
> > (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an
> > option at this time.)
> >
> > Log from 001db01a:
> >
> > Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> forming new configuration.
> > Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> > (10.51.14.33:960) was formed. Members left: 2 Feb  5 08:01:03 001db01a
> > corosync[1306]: [TOTEM ] Failed to receive the leave message. failed:
> > 2 Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state
> > is now lost Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing
> > all 001db01b attributes for peer loss Feb  5 08:01:03 001db01a
> > cib[1522]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> > 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or
> > uname=001db01b from the membership cache Feb  5 08:01:03 001db01a
> > attrd[1525]:  notice: Purged 1 peer with id=2 and/or uname=001db01b
> > from the membership cache Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03 001db01a
> stonith-ng[1523]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb  5
> 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> > Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with
> > id=2 and/or uname=001db01b from the membership cache Feb  5 08:01:03
> > 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now lost
> > Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> > -> S_POLICY_ENGINE Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node
> > 001db01b state is now lost Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03
> > 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> > Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> > Quorum: Ignore
> >
> > From 001db01b:
> >
> > Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> > (10.51.14.34:960) was formed. Members left: 1 Feb  5 08:01:03 001db01b
> > crmd[1693]:  notice: Our peer on the DC (001db01a) is dead Feb  5
> > 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is
> > now lost Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to
> > receive the leave message. failed: 1 Feb  5 08:01:03 001db01b
> corosync[1455]: [QUORUM] Members[1]: 2 Feb  5 08:01:03 001db01b
> corosync[1455]: [MAIN  ] Completed service synchronization, ready to
> provide service.
> > Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with
> > id=1 and/or uname=001db01a from the membership cache Feb  5 08:01:03
> > 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now lost
> > Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> > S_NOT_DC -> S_ELECTION Feb  5 08:01:03 001db01b crmd[1693]:  notice:
> > Node 001db01a state is now lost Feb  5 08:01:03 001db01b attrd[1691]:
> > notice: Node 001db01a state is now lost Feb  5 08:01:03 001db01b
> > attrd[1691]:  notice: Removing all 001db01a attributes for peer loss
> > Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> > 001db01a Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer
> > with id=1 and/or uname=001db01a from the membership cache Feb  5
> > 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION ->
> > S_INTEGRATION Feb  5 08:01:03 001db01b cib[1688

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Andrei Borzenkov
05.02.2020 20:55, Eric Robinson пишет:
> The two servers 001db01a and 001db01b were up and responsive. Neither had 
> been rebooted and neither were under heavy load. There's no indication in the 
> logs of loss of network connectivity. Any ideas on why both nodes seem to 
> think the other one is at fault?

The very fact that nodes lost connection to each other *is* indication
of network problems. Your logs start too late, after any problem already
happened.

> 
> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option 
> at this time.)
> 
> Log from 001db01a:
> 
> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming 
> new configuration.
> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership 
> (10.51.14.33:960) was formed. Members left: 2
> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave 
> message. failed: 2
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b 
> attributes for peer loss
> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or 
> uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with id=2 and/or 
> uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to 
> be down
> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b state is 
> now lost
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b 
> not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
> Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with id=2 
> and/or uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b state is 
> now lost
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to 
> be down
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b 
> not matched
> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM Quorum: Ignore
> 
> From 001db01b:
> 
> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership 
> (10.51.14.34:960) was formed. Members left: 1
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC (001db01a) 
> is dead
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is 
> now lost
> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave 
> message. failed: 1
> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
> Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with id=1 
> and/or uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a state is 
> now lost
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_NOT_DC -> 
> S_ELECTION
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a 
> attributes for peer loss
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer 001db01a
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with id=1 and/or 
> uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION -> 
> S_INTEGRATION
> Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1 and/or 
> uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify] Patch 
> aborted: Application of an update diff failed (-206)
> Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in 
> state S_INTEGRATION from do_election_check
> Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM Quorum: Ignore
> 
> 
> -Eric
> 
> 
> 
> Disclaimer : This email and any files transmitted with it are confidential 
> and intended solely for intended recipients. If you are not the named 
> addressee you should not disseminate, distribute, copy or alter this email. 
> Any views or opinions presented in this email are solely those of the author 
> and might not represent those of Physician Select Management. Warning: 
> Although Physician Select Management has taken