Quorum doesn't prevent split-brains, stonith (fencing) does.

https://www.alteeve.com/w/The_2-Node_Myth

There is no way to use quorum-only to avoid a potential split-brain. You might be able to make it less likely with enough effort, but never prevent it.

digimer

On 2017-11-14 10:45 PM, Garima wrote:

Hello All,

 

Split-brain situation occurs due to there is a drop in quorum which leads to Spilt-brain situation and status information is not exchanged between both two nodes of the cluster.

This can be avoided if quorum communicates between both the nodes.

I have checked the code. In My opinion these files need to be updated (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain Active-Passive configuration.

 

Regards,

Garima

 

From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca]
Sent: 13 November 2017 20:55
To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>
Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

 

Hello Ken !

 

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

 

It is not. I made sure of this in the first place :)

 

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).

 

Well, that’s the thing. There is not much log messages telling what is actually happening. The ’systemd’ resource is not even trying to start (nothing in either log for that resource). Here are the logs from my last attempt:

Scenario:

- Services were running on ‘pancakeFence2’. DRBD was synced and connected

- I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’

- After ‘pancakeFence2’ comes back, services are running just fine on ‘pancakeFence1’ but DRBD is in Standalone due to split-brain

 

Logs for pancakeFence1: https://pastebin.com/dVSGPP78

Logs for pancakeFence2: https://pastebin.com/at8qPkHE

 

It really looks like the status checkup mechanism of corosync/pacemaker for a systemd resource force the resource to “start” and therefore, start the ones above that resource in the group (DRBD in instance).

This does not happen for a regular OCF resource (IPaddr2 per example)


Cheers!

-dw

 

--

Derek Wuelfrath

dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) :: +1.866.353.6153 (x110)

Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence (www.packetfence.org) and Fingerbank (www.fingerbank.org)



On Nov 10, 2017, at 11:39, Ken Gaillot <kgail...@redhat.com> wrote:

 

On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:

Hello there,

First post here but following since a while!


Welcome!



Here’s my issue,
we are putting in place and running this type of cluster since a
while and never really encountered this kind of problem.

I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
along with different other resources. Part of theses resources are
some systemd resources… this is the part where things are “breaking”.

Having a two servers cluster running only DRBD or DRBD with an OCF
ipaddr2 resource (Cluser IP in instance) works just fine. I can
easily move from one node to the other without any issue.
As soon as I add a systemd resource to the resource group, things are
breaking. Moving from one node to the other using standby mode works
just fine but as soon as Corosync / Pacemaker restart involves
polling of a systemd resource, it seems like it is trying to start
the whole resource group and therefore, create a split-brain of the
DRBD resource.


My first two suggestions would be:

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

Fencing is the only way pacemaker can resolve split-brains and certain
other situations, so that will help in the recovery.

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).




It is the best explanation / description of the situation that I can
give. If it need any clarification, examples, … I am more than open
to share them.

Any guidance would be appreciated :)

Here’s the output of a ‘pcs config’

https://pastebin.com/1TUvZ4X9

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) :: +1.866.353.6153
(x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu), PacketFence
(www.packetfence.org) and Fingerbank (www.fingerbank.org)

-- 
Ken Gaillot <
kgail...@redhat.com>

_______________________________________________
Users mailing list: 
Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: 
http://www.clusterlabs.org
Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://bugs.clusterlabs.org

 



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to