Re: [ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Andrei Borzenkov
11.08.2020 10:34, Adam Cécile пишет:
> On 8/11/20 8:48 AM, Andrei Borzenkov wrote:
>> 08.08.2020 13:10, Adam Cécile пишет:
>>> Hello,
>>>
>>>
>>> I'm experiencing issue with corosync/pacemaker running on Debian Buster.
>>> Cluster has three nodes running in VMWare virtual machine and the
>>> cluster fails when VEEAM backups the virtual machine (I know it's doing
>>> bad things, like freezing completely the VM for a few minutes to make
>>> disk snapshot).
>>>
>>> My biggest issue is that once the backup has been completed, the cluster
>>> stays in split brain state, and I'd like it to heal itself. Here current
>>> status:
>>>
>>>
>>> One node is isolated:
>>>
>>> Stack: corosync
>>> Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition
>>> WITHOUT quorum
>>> Last updated: Sat Aug  8 11:59:46 2020
>>> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
>>> host1.domain.com
>>>
>>> 3 nodes configured
>>> 6 resources configured
>>>
>>> Online: [ host2.domain.com ]
>>> OFFLINE: [ host3.domain.com host1.domain.com ]
>>>
>>>
>>> Two others are seeing each others:
>>>
>>> Stack: corosync
>>> Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with
>>> quorum
>>> Last updated: Sat Aug  8 12:07:56 2020
>>> Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
>>> host1.domain.com
>>>
>>> 3 nodes configured
>>> 6 resources configured
>>>
>>> Online: [ host3.domain.com host1.domain.com ]
>>> OFFLINE: [ host2.domain.com ]
>>>
>> Show your full configuration including defined STONITH resources and
>> cluster options (most importantly, no-quorum-policy and stonith-enabled).
> 
> Hello,
> 
> Stonith is disabled and I tried various settings for no-quorum-policy.
> 
>>> The problem is that one of the resources is a floating IP address which
>>> is currently assigned to two different hosts...
>>>
>> Of course - each partition assumes another partition is dead and so it
>> is free to take over remaining resources.
> I understand that but I still don't get why once all nodes are back
> online, the cluster does not heal from resources running one multiple
> hosts.

In my limited testing it does - after nodes see each other pacemaker
sees resources active on multiple nodes and tries to fix it. This is
with pacemaker 2.0.3. Check logs on all nodes what happens around time
node becomes alive again.

>>
>>> Can you help me configuring the cluster correctly so this cannot
>>> occurs ?
>>>
>> Define "correctly".
>>
>> The most straightforward text book answer - you need to have STONITH
>> resources that will eliminate "lost" node. But your lost node is in the
>> middle of performing backup. Eliminating it may invalidate backup being
>> created.
> Yeah but well, no. Killing the node is worse, sensible services are
> already running in clustering mode at application level so they do not
> rely on corosync. Basically corosync is providing a floating IP for some
> external non critical access and starting systemd timers that are
> pointless to be run on multiple hosts. Nothing critical here.
>>
>> So another answer would be - put cluster in maintenance mode, perform
>> backup, resume normal operation. Usually backup software allows hooks to
>> be executed before and after backup. It may work too.
> This in indeed something I might look at, but again, for my trivial
> needs it sounds a bit overkill to me.
>> Or find a way to not freeze VM during backup ... e.g. by using different
>> backup method?
> 
> Or tweaks some network settings so corosync does not consider the node
> as being dead too soon ? Backup won't last more than 2 minutes and the
> freeze is usually way below. I can definitely leave with cluster state
> being unknown for a couple of minutes. Is that possible ?
> 

You probably can increase token time but that also directly affects how
fast pacemaker will start and also how fast it will react to node failures.

> Removing VEEAM is indeed my last option and the one I used so far, but
> this time I was hoping someone else would be experiencing the same issue
> and could help me fixing that in a clean way.
> 
> 
> Thanks
> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Adam Cecile

Hello,


I'm experiencing issue with corosync/pacemaker running on Debian Buster. 
Cluster has three nodes running in VMWare virtual machine and the 
cluster fails when VEEAM backups the virtual machine (I know it's doing 
bad things, like freezing completely the VM for a few minutes to make 
disk snapshot).


My biggest issue is that once the backup has been completed, the cluster 
stays in split brain state, and I'd like it to heal itself. Here current 
status:



One node is isolated:

Stack: corosync
Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition 
WITHOUT quorum

Last updated: Sat Aug  8 11:59:46 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
host1.domain.com


3 nodes configured
6 resources configured

Online: [ host2.domain.com ]
OFFLINE: [ host3.domain.com host1.domain.com ]


Two others are seeing each others:

Stack: corosync
Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with 
quorum

Last updated: Sat Aug  8 12:07:56 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
host1.domain.com


3 nodes configured
6 resources configured

Online: [ host3.domain.com host1.domain.com ]
OFFLINE: [ host2.domain.com ]


The problem is that one of the resources is a floating IP address which 
is currently assigned to two different hosts...



Can you help me configuring the cluster correctly so this cannot occurs ?


Thanks in advance,

Adam.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Adam Cécile

On 8/11/20 8:48 AM, Andrei Borzenkov wrote:

08.08.2020 13:10, Adam Cécile пишет:

Hello,


I'm experiencing issue with corosync/pacemaker running on Debian Buster.
Cluster has three nodes running in VMWare virtual machine and the
cluster fails when VEEAM backups the virtual machine (I know it's doing
bad things, like freezing completely the VM for a few minutes to make
disk snapshot).

My biggest issue is that once the backup has been completed, the cluster
stays in split brain state, and I'd like it to heal itself. Here current
status:


One node is isolated:

Stack: corosync
Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition
WITHOUT quorum
Last updated: Sat Aug  8 11:59:46 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
host1.domain.com

3 nodes configured
6 resources configured

Online: [ host2.domain.com ]
OFFLINE: [ host3.domain.com host1.domain.com ]


Two others are seeing each others:

Stack: corosync
Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with
quorum
Last updated: Sat Aug  8 12:07:56 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
host1.domain.com

3 nodes configured
6 resources configured

Online: [ host3.domain.com host1.domain.com ]
OFFLINE: [ host2.domain.com ]


Show your full configuration including defined STONITH resources and
cluster options (most importantly, no-quorum-policy and stonith-enabled).


Hello,

Stonith is disabled and I tried various settings for no-quorum-policy.


The problem is that one of the resources is a floating IP address which
is currently assigned to two different hosts...


Of course - each partition assumes another partition is dead and so it
is free to take over remaining resources.
I understand that but I still don't get why once all nodes are back 
online, the cluster does not heal from resources running one multiple hosts.



Can you help me configuring the cluster correctly so this cannot occurs ?


Define "correctly".

The most straightforward text book answer - you need to have STONITH
resources that will eliminate "lost" node. But your lost node is in the
middle of performing backup. Eliminating it may invalidate backup being
created.
Yeah but well, no. Killing the node is worse, sensible services are 
already running in clustering mode at application level so they do not 
rely on corosync. Basically corosync is providing a floating IP for some 
external non critical access and starting systemd timers that are 
pointless to be run on multiple hosts. Nothing critical here.


So another answer would be - put cluster in maintenance mode, perform
backup, resume normal operation. Usually backup software allows hooks to
be executed before and after backup. It may work too.
This in indeed something I might look at, but again, for my trivial 
needs it sounds a bit overkill to me.

Or find a way to not freeze VM during backup ... e.g. by using different
backup method?


Or tweaks some network settings so corosync does not consider the node 
as being dead too soon ? Backup won't last more than 2 minutes and the 
freeze is usually way below. I can definitely leave with cluster state 
being unknown for a couple of minutes. Is that possible ?


Removing VEEAM is indeed my last option and the one I used so far, but 
this time I was hoping someone else would be experiencing the same issue 
and could help me fixing that in a clean way.



Thanks


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] How to specify which IP pcs should use?

2020-08-11 Thread Tomas Jelinek

Hi Mariusz,

You haven't mention pcs version you are running. Based on you mentioning 
running Pacemaker 2, I suppose you are running pcs 0.10.x. The text 
bellow applies to pcs 0.10.x.


Pcs doesn't depend on or use corosync.conf when connecting to other 
nodes. The reason is pcs must be able to connect to nodes not specified 
in corosync.conf, e.g. when there is no cluster created yet.


Instead, pcs has its own config file mapping node names to addresses. 
The easiest way to set it is to specify an address for each node in 'pcs 
host auth' command like this:

pcs host auth  addr=  addr= ...

Specifying addresses is not mandatory. If the addresses are omitted, pcs 
uses node names as addresses. See man pcs for more details.


To fix your issues, run 'pcs host auth' and specify all nodes and their 
addresses. Running the command on one node of your cluster should be enough.



Regards,
Tomas


Dne 10. 08. 20 v 14:21 Mariusz Gronczewski napsal(a):

Hi,

Pacemaker 2, the current setup is

* management network with host's hostname resolving to host's
   management IP
* cluster network for Pacemaker/Corosync communication
* corosync set up with node name and IP of the cluster network

pcs status shows both nodes online, added config syncs to the other
node etc. but pcs cluster status shows one node being offline.

After a look in firewall logs it appears all of the communication is
going just fine on the cluster network but PCS tries to talk with PCSD
on port 2224 via *management* network instead of using IP set as
ring0_addr in corosync

Is "just use host's hostname regardless of config" something normal ?
Is there a separate setting to pcs about which IP it should use ?

Regards



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/