I agree. Thing is, we have this kind of setup deployed largely and since a while. Never ran into any issue. Not sure if something changed in Corosync/Pacemaker code or way of dealing with systemd resources.
As said, without a systemd resource, everything just work as it should… 100% of the time As soon as a systemd resource comes in, it breaks. -derek -- Derek Wuelfrath dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) :: +1.866.353.6153 (x110) Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>) > On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca> wrote: > > Quorum doesn't prevent split-brains, stonith (fencing) does. > > https://www.alteeve.com/w/The_2-Node_Myth > <https://www.alteeve.com/w/The_2-Node_Myth> > > There is no way to use quorum-only to avoid a potential split-brain. You > might be able to make it less likely with enough effort, but never prevent it. > > digimer > > On 2017-11-14 10:45 PM, Garima wrote: >> Hello All, >> >> Split-brain situation occurs due to there is a drop in quorum which leads to >> Spilt-brain situation and status information is not exchanged between both >> two nodes of the cluster. >> This can be avoided if quorum communicates between both the nodes. >> I have checked the code. In My opinion these files need to be updated >> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain >> Active-Passive configuration. >> >> Regards, >> Garima >> >> From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca >> <mailto:dwuelfr...@inverse.ca>] >> Sent: 13 November 2017 20:55 >> To: Cluster Labs - All topics related to open-source clustering welcomed >> <users@clusterlabs.org> <mailto:users@clusterlabs.org> >> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd >> resource >> >> Hello Ken ! >> >> Make sure that the systemd service is not enabled. If pacemaker is >> managing a service, systemd can't also be trying to start and stop it. >> >> It is not. I made sure of this in the first place :) >> >> Beyond that, the question is what log messages are there from around >> the time of the issue (on both nodes). >> >> Well, that’s the thing. There is not much log messages telling what is >> actually happening. The ’systemd’ resource is not even trying to start >> (nothing in either log for that resource). Here are the logs from my last >> attempt: >> Scenario: >> - Services were running on ‘pancakeFence2’. DRBD was synced and connected >> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’ >> - After ‘pancakeFence2’ comes back, services are running just fine on >> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain >> >> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 >> <https://pastebin.com/dVSGPP78> >> Logs for pancakeFence2: https://pastebin.com/at8qPkHE >> <https://pastebin.com/at8qPkHE> >> >> It really looks like the status checkup mechanism of corosync/pacemaker for >> a systemd resource force the resource to “start” and therefore, start the >> ones above that resource in the group (DRBD in instance). >> This does not happen for a regular OCF resource (IPaddr2 per example) >> >> Cheers! >> -dw >> >> -- >> Derek Wuelfrath >> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 >> (x110) :: +1.866.353.6153 (x110) >> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), >> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and >> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>) >> >> >> On Nov 10, 2017, at 11:39, Ken Gaillot <kgail...@redhat.com >> <mailto:kgail...@redhat.com>> wrote: >> >> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote: >> >> Hello there, >> >> First post here but following since a while! >> >> Welcome! >> >> >> >> Here’s my issue, >> we are putting in place and running this type of cluster since a >> while and never really encountered this kind of problem. >> >> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD >> along with different other resources. Part of theses resources are >> some systemd resources… this is the part where things are “breaking”. >> >> Having a two servers cluster running only DRBD or DRBD with an OCF >> ipaddr2 resource (Cluser IP in instance) works just fine. I can >> easily move from one node to the other without any issue. >> As soon as I add a systemd resource to the resource group, things are >> breaking. Moving from one node to the other using standby mode works >> just fine but as soon as Corosync / Pacemaker restart involves >> polling of a systemd resource, it seems like it is trying to start >> the whole resource group and therefore, create a split-brain of the >> DRBD resource. >> >> My first two suggestions would be: >> >> Make sure that the systemd service is not enabled. If pacemaker is >> managing a service, systemd can't also be trying to start and stop it. >> >> Fencing is the only way pacemaker can resolve split-brains and certain >> other situations, so that will help in the recovery. >> >> Beyond that, the question is what log messages are there from around >> the time of the issue (on both nodes). >> >> >> >> >> It is the best explanation / description of the situation that I can >> give. If it need any clarification, examples, … I am more than open >> to share them. >> >> Any guidance would be appreciated :) >> >> Here’s the output of a ‘pcs config’ >> >> https://pastebin.com/1TUvZ4X9 <https://pastebin.com/1TUvZ4X9> >> >> Cheers! >> -dw >> >> -- >> Derek Wuelfrath >> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 >> (x110) :: +1.866.353.6153 >> (x110) >> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu/>), >> PacketFence >> (www.packetfence.org <http://www.packetfence.org/>) and Fingerbank >> (www.fingerbank.org <http://www.fingerbank.org/>) >> -- >> Ken Gaillot <kgail...@redhat.com <mailto:kgail...@redhat.com>> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >> http://lists.clusterlabs.org/mailman/listinfo/users >> <http://lists.clusterlabs.org/mailman/listinfo/users> >> >> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >> http://lists.clusterlabs.org/mailman/listinfo/users >> <http://lists.clusterlabs.org/mailman/listinfo/users> >> >> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/> > "I am, somehow, less interested in the weight and convolutions of Einstein’s > brain than in the near certainty that people of equal talent have lived and > died in cotton fields and sweatshops." - Stephen Jay Gould > _______________________________________________ > Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> > http://lists.clusterlabs.org/mailman/listinfo/users > <http://lists.clusterlabs.org/mailman/listinfo/users> > > Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org