[ClusterLabs] Antw: [EXT] Re: Why Do Nodes Leave the Cluster?

2020-02-05 Thread Ulrich Windl
>>> Eric Robinson  schrieb am 05.02.2020 um 21:59 in
Nachricht
<4849_1580936395_5E3B2CCA_4849_709_1_MN2PR03MB4845D4B66D794C4AF58DF2E3FA020@MN2P
03MB4845.namprd03.prod.outlook.com>:

[...]
> 
> I've done that with all my other clusters, but these two servers are in 
> Azure, so the network is out of our control.

Is a normal cluster supported to use corosync over Internet? I'm not sure 
(because of the delays and possible packet losses).

> 
>> If not, you can easily set them up without downtime.
>>
>> Also, are you using  multicast or unicast ?
>>
> 
> Unicast, as Azure does not support multicast.

The Internet should support mulitcast, but you would have to use registered 
addresses and ports, I'm afraid.

[...]

Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Strahil Nikolov
On February 6, 2020 4:18:15 AM GMT+02:00, Eric Robinson 
 wrote:
>Hi Strahil –
>
>I think you may be right about the token timeouts being too short. I’ve
>also noticed that periods of high load can cause drbd to disconnect.
>What would you recommend for changes to the timeouts?
>
>I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The
>config is relatively simple.
>
>Corosync config looks like this…
>
>totem {
>version: 2
>cluster_name: 001db01ab
>secauth: off
>transport: udpu
>}
>
>nodelist {
>node {
>ring0_addr: 001db01a
>nodeid: 1
>}
>
>node {
>ring0_addr: 001db01b
>nodeid: 2
>}
>}
>
>quorum {
>provider: corosync_votequorum
>two_node: 1
>}
>
>logging {
>to_logfile: yes
>logfile: /var/log/cluster/corosync.log
>to_syslog: yes
>}
>
>
>From: Users  On Behalf Of Strahil
>Nikolov
>Sent: Wednesday, February 5, 2020 6:39 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed ; Andrei Borzenkov
>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>Hi Andrei,
>
>don't trust Azure so much :D . I've seen stuff that was way more
>unbelievable.
>Can you check other systems in the same subnet reported any issues.
>Yet, pcs most probably won't report any short-term issues. I have
>noticed that RHEL7 defaults for token and consensus are quite small and
>any short-term disruption could cause an issue.
>Actually when I tested live migration on oVirt - the other hosts fenced
>the node that was migrated.
>What is your corosync config and OS version ?
>
>Best Regards,
>Strahil Nikolov
>
>В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>Hi Strahil –
>
>
>
>I can’t prove there was no network loss, but:
>
>
>
>  1.  There were no dmesg indications of ethernet link loss.
>2.  Other than corosync, there are no other log messages about
>connectivity issues.
>  3.  Wouldn’t pcsd say something about connectivity loss?
>  4.  Both servers are in Azure.
>5.  There are many other servers in the same Azure subscription,
>including other corosync clusters, none of which had issues.
>
>
>
>So I guess it’s possible, but it seems unlikely.
>
>
>
>--Eric
>
>
>
>From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>Sent: Wednesday, February 5, 2020 3:13 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed mailto:users@clusterlabs.org>>; Andrei
>Borzenkov mailto:arvidj...@gmail.com>>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>
>
>Hi Erik,
>
>
>
>what has led you to think that there was no network loss ?
>
>
>
>Best Regards,
>
>Strahil Nikolov
>
>
>
>В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>
>
>> -Original Message-
>> From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>> Sent: Wednesday, February 5, 2020 1:59 PM
>> To: Andrei Borzenkov
>mailto:arvidj...@gmail.com>>;
>users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>>
>> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>> mailto:arvidj...@gmail.com>> wrote:
>> >05.02.2020 20:55, Eric Robinson пишет:
>> >> The two servers 001db01a and 001db01b were up and responsive.
>Neither
>> >had been rebooted and neither were under heavy load. There's no
>> >indication in the logs of loss of network connectivity. Any ideas on
>> >why both nodes seem to think the other one is at fault?
>> >
>> >The very fact that nodes lost connection to each other *is*
>indication
>> >of network problems. Your logs start too late, after any problem
>> >already happened.
>> >
>> >>
>> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is
>not
>> >an option at this time.)
>> >>
>> >> Log from 001db01a:
>> >>
>> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor
>failed,
>> >forming new configuration.
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
>> >(10.51.14.33:960) was formed. Members left: 2
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to
>receive
>> >the leave message. failed: 2
>> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state
>is
>> >now lost
>> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all
>001db01b
>> >attributes for peer loss
>> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state
>is
>> >now lost
>> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with
>id=2
>> >and/or uname=001db01b from the membership cache
>> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
>> >id=2 and/or uname=001db01b from the membership cache
>> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
>> >node 2 to be down
>> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
>> >state is now lost
>> >> 

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I think you may be right about the token timeouts being too short. I’ve also 
noticed that periods of high load can cause drbd to disconnect. What would you 
recommend for changes to the timeouts?

I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The config is 
relatively simple.

Corosync config looks like this…

totem {
version: 2
cluster_name: 001db01ab
secauth: off
transport: udpu
}

nodelist {
node {
ring0_addr: 001db01a
nodeid: 1
}

node {
ring0_addr: 001db01b
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}


From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 6:39 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Andrei,

don't trust Azure so much :D . I've seen stuff that was way more unbelievable.
Can you check other systems in the same subnet reported any issues. Yet, pcs 
most probably won't report any short-term issues. I have noticed that RHEL7 
defaults for token and consensus are quite small and any short-term disruption 
could cause an issue.
Actually when I tested live migration on oVirt - the other hosts fenced the 
node that was migrated.
What is your corosync config and OS version ?

Best Regards,
Strahil Nikolov

В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



Hi Strahil –



I can’t prove there was no network loss, but:



  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.



So I guess it’s possible, but it seems unlikely.



--Eric



From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Andrei Borzenkov 
mailto:arvidj...@gmail.com>>
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?



Hi Erik,



what has led you to think that there was no network loss ?



Best Regards,

Strahil Nikolov



В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:





> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] 

Re: [ClusterLabs] SBD on shared disk

2020-02-05 Thread Gang He
Hello Strahil,

This kind of configuration should not be recommended.
Why?
Since SBD partition need to be accessed by the cluster nodes stably/frequently.
But the other partition (for XFS file system) is probably under extreme 
pressure conditions, 
in that case, the SBD partition IO requests will starve to timeout.

Thanks
Gang 


From: Users  on behalf of Strahil Nikolov 

Sent: Thursday, February 6, 2020 5:15 AM
To: users@clusterlabs.org
Subject: [ClusterLabs] SBD on shared disk

Hello Community,

I'm preparing for my EX436 and I was wondering if there are any drawbacks if a 
shared LUN is split into 2 partitions and the first partition is used for SBD , 
while the second one for Shared File System (Either XFS for active/passive, or 
GFS2 for active/active).

Do you see any drawback in such implementation ?
Thanks in advance.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] A note for digimer re: qdevice documentation

2020-02-05 Thread Digimer
On 2020-02-05 5:07 a.m., Steven Levine wrote:
> I'm having some trouble re-registering for the Clusterlabs IRC channel
> but this might get to you.
> 
> Red Hat's overview documentation of qdevice (quorum device when spelled
> out in the doc) is here:
> 
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_high_availability_clusters/index#assembly_configuring-quorum-devices-configuring-cluster-quorum
> 
> 
> Steven

Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I can’t prove there was no network loss, but:


  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.

So I guess it’s possible, but it seems unlikely.

--Eric

From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Erik,

what has led you to think that there was no network loss ?

Best Regards,
Strahil Nikolov

В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. failed: 1
> >> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb
> >> 5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
> >with id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_NOT_DC -> S_ELECTION
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: 

[ClusterLabs] SBD on shared disk

2020-02-05 Thread Strahil Nikolov
Hello Community,
I'm preparing for my EX436 and I was wondering if there are any drawbacks if a 
shared LUN is split into 2 partitions and the first partition is used for SBD , 
while the second one for Shared File System (Either XFS for active/passive, or 
GFS2 for active/active).
Do you see any drawback in such implementation ?Thanks in advance.
Best Regards,Strahil Nikolov___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov ; users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>  wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. failed: 1
> >> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb
> >> 5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
> >with id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_NOT_DC -> S_ELECTION
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> >001db01a
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with
> >id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_ELECTION -> S_INTEGRATION
> >> Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1
> >and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify]
> >Patch aborted: Application of an update diff failed (-206)
> >> Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC
> >received in state S_INTEGRATION from do_election_check
> >> Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss 

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson




> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 5, 2020 12:14 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> 05.02.2020 20:55, Eric Robinson пишет:
> > The two servers 001db01a and 001db01b were up and responsive. Neither
> had been rebooted and neither were under heavy load. There's no indication
> in the logs of loss of network connectivity. Any ideas on why both nodes
> seem to think the other one is at fault?
>
> The very fact that nodes lost connection to each other *is* indication of
> network problems. Your logs start too late, after any problem already
> happened.
>

All the log messages before those are just normal repetitive stuff that always 
gets logged, even during normal production. The snippet I provided shows the 
first indication of anything unusual. Also, there is no other indication of 
network connectivity loss, and both servers are in Azure.

> >
> > (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an
> > option at this time.)
> >
> > Log from 001db01a:
> >
> > Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> forming new configuration.
> > Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> > (10.51.14.33:960) was formed. Members left: 2 Feb  5 08:01:03 001db01a
> > corosync[1306]: [TOTEM ] Failed to receive the leave message. failed:
> > 2 Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state
> > is now lost Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing
> > all 001db01b attributes for peer loss Feb  5 08:01:03 001db01a
> > cib[1522]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> > 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or
> > uname=001db01b from the membership cache Feb  5 08:01:03 001db01a
> > attrd[1525]:  notice: Purged 1 peer with id=2 and/or uname=001db01b
> > from the membership cache Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03 001db01a
> stonith-ng[1523]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb  5
> 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> > Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with
> > id=2 and/or uname=001db01b from the membership cache Feb  5 08:01:03
> > 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now lost
> > Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> > -> S_POLICY_ENGINE Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node
> > 001db01b state is now lost Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03
> > 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> > Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> > Quorum: Ignore
> >
> > From 001db01b:
> >
> > Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> > (10.51.14.34:960) was formed. Members left: 1 Feb  5 08:01:03 001db01b
> > crmd[1693]:  notice: Our peer on the DC (001db01a) is dead Feb  5
> > 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is
> > now lost Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to
> > receive the leave message. failed: 1 Feb  5 08:01:03 001db01b
> corosync[1455]: [QUORUM] Members[1]: 2 Feb  5 08:01:03 001db01b
> corosync[1455]: [MAIN  ] Completed service synchronization, ready to
> provide service.
> > Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with
> > id=1 and/or uname=001db01a from the membership cache Feb  5 08:01:03
> > 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now lost
> > Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> > S_NOT_DC -> S_ELECTION Feb  5 08:01:03 001db01b crmd[1693]:  notice:
> > Node 001db01a state is now lost Feb  5 08:01:03 001db01b attrd[1691]:
> > notice: Node 001db01a state is now lost Feb  5 08:01:03 001db01b
> > attrd[1691]:  notice: Removing all 001db01a attributes for peer loss
> > Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> > 001db01a Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer
> > with id=1 and/or uname=001db01a from the membership cache Feb  5
> > 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION ->
> > S_INTEGRATION Feb  5 08:01:03 001db01b cib[1688]:  notice: Node
> > 001db01a state is now lost Feb  5 08:01:03 001db01b cib[1688]:
> > notice: Purged 1 peer with id=1 and/or uname=001db01a from the
> > membership cache Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice:
> > [cib_diff_notify] Patch aborted: Application of an update diff failed
> > (-206) Feb  5 08:01:03 001db01b crmd[1693]: warning: Input
> > I_ELECTION_DC received in state 

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Andrei Borzenkov
05.02.2020 20:55, Eric Robinson пишет:
> The two servers 001db01a and 001db01b were up and responsive. Neither had 
> been rebooted and neither were under heavy load. There's no indication in the 
> logs of loss of network connectivity. Any ideas on why both nodes seem to 
> think the other one is at fault?

The very fact that nodes lost connection to each other *is* indication
of network problems. Your logs start too late, after any problem already
happened.

> 
> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option 
> at this time.)
> 
> Log from 001db01a:
> 
> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming 
> new configuration.
> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership 
> (10.51.14.33:960) was formed. Members left: 2
> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave 
> message. failed: 2
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b 
> attributes for peer loss
> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or 
> uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with id=2 and/or 
> uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to 
> be down
> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b state is 
> now lost
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b 
> not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
> Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with id=2 
> and/or uname=001db01b from the membership cache
> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b state is 
> now lost
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is now lost
> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to 
> be down
> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b 
> not matched
> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM Quorum: Ignore
> 
> From 001db01b:
> 
> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership 
> (10.51.14.34:960) was formed. Members left: 1
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC (001db01a) 
> is dead
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is 
> now lost
> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave 
> message. failed: 1
> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
> Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with id=1 
> and/or uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a state is 
> now lost
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_NOT_DC -> 
> S_ELECTION
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a 
> attributes for peer loss
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer 001db01a
> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with id=1 and/or 
> uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION -> 
> S_INTEGRATION
> Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is now lost
> Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1 and/or 
> uname=001db01a from the membership cache
> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify] Patch 
> aborted: Application of an update diff failed (-206)
> Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in 
> state S_INTEGRATION from do_election_check
> Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM Quorum: Ignore
> 
> 
> -Eric
> 
> 
> 
> Disclaimer : This email and any files transmitted with it are confidential 
> and intended solely for intended recipients. If you are not the named 
> addressee you should not disseminate, distribute, copy or alter this email. 
> Any views or opinions presented in this email are solely those of the author 
> and might not represent those of Physician Select Management. Warning: 
> Although Physician Select Management has taken 

Re: [ClusterLabs] multi-site clusters vs disaster recovery clusters

2020-02-05 Thread Andrei Borzenkov
05.02.2020 18:16, Олег Самойлов пишет:
> Hi all.
> 
> I am reading the documentation about new (for me) pacemaker, which came with 
> RedHat 8.
> 
> And I see two different chapters, which both tried to solve exactly the same 
> problem.
> 
> One is CONFIGURING DISASTER RECOVERY CLUSTERS (pcs dr):
> 
> This is about infrastructure to create two different clusters on different 
> sites with manual switching between them.
> 
> And CONFIGURING MULTI-SITE CLUSTERS WITH PACEMAKER (pcs booth):
> 
> This is also about the same, to create two different clusters on different 
> sites with automatic switching, but with lack of some features from dr.
> 
> IMHO because both features is about the same, worth to unit them in 
> documentation and as single feature. Or, may be here is a point to make them 
> different? May be I don't understand something?

(Multi-site) high availability cluster and disaster recovery solve
entirely different problems.

HA cluster task is to automatically restore access to service with
minimal downtime. Fundamental prerequisite is that every node that can
takeover service (or resource) has access to up-to-date data for this
resource. Otherwise resource fail-over would result in silent data loss
or other inconsistencies. This severely limits maximal distance between
sites - once you go beyond several dozens kilometers, latency to
synchronously replicate data becomes too high. There could be special
workloads that tolerate it, but in general it is more or less metro area.

Disaster recovery goal is to protect against catastrophic loss of the
whole area with data and infrastructure to access this area. To minimize
disaster impact, secondary site is locate on far greater distances which
inevitably means it cannot have full up to date data. So decision to
accept data loss and continue operation is always manual. It may be more
acceptable to wait until primary site returns to operation (or try to
rescue latest data from it).

It is true that often "disaster recovery site" is located in the same
metropolitan area so stretched cluster can cover it.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
The two servers 001db01a and 001db01b were up and responsive. Neither had been 
rebooted and neither were under heavy load. There's no indication in the logs 
of loss of network connectivity. Any ideas on why both nodes seem to think the 
other one is at fault?

(Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option 
at this time.)

Log from 001db01a:

Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming 
new configuration.
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership 
(10.51.14.33:960) was formed. Members left: 2
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave 
message. failed: 2
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b attributes 
for peer loss
Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with id=2 
and/or uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM Quorum: Ignore

>From 001db01b:

Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership 
(10.51.14.34:960) was formed. Members left: 1
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC (001db01a) is 
dead
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave 
message. failed: 1
Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with id=1 
and/or uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_NOT_DC -> 
S_ELECTION
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a attributes 
for peer loss
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer 001db01a
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION -> 
S_INTEGRATION
Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify] Patch 
aborted: Application of an update diff failed (-206)
Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in 
state S_INTEGRATION from do_election_check
Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM Quorum: Ignore


-Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] multi-site clusters vs disaster recovery clusters

2020-02-05 Thread Олег Самойлов
Hi all.

I am reading the documentation about new (for me) pacemaker, which came with 
RedHat 8.

And I see two different chapters, which both tried to solve exactly the same 
problem.

One is CONFIGURING DISASTER RECOVERY CLUSTERS (pcs dr):

This is about infrastructure to create two different clusters on different 
sites with manual switching between them.

And CONFIGURING MULTI-SITE CLUSTERS WITH PACEMAKER (pcs booth):

This is also about the same, to create two different clusters on different 
sites with automatic switching, but with lack of some features from dr.

IMHO because both features is about the same, worth to unit them in 
documentation and as single feature. Or, may be here is a point to make them 
different? May be I don't understand something?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] A note for digimer re: qdevice documentation

2020-02-05 Thread Steven Levine
I'm having some trouble re-registering for the Clusterlabs IRC channel 
but this might get to you.


Red Hat's overview documentation of qdevice (quorum device when spelled 
out in the doc) is here:


https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_high_availability_clusters/index#assembly_configuring-quorum-devices-configuring-cluster-quorum

Steven

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/