I've driven for 22 years and never needed my seatbelt before, but yet, I still make sure I use it every time I am in a car. ;)
Why it happened now is perhaps an interesting question, but it is one I would try to answer after fixing the core problem. cheers, digimer On 2017-11-15 03:37 PM, Derek Wuelfrath wrote: > And just to make sure, I’m not the kind of person who stick to the “we > always did it that way…” ;) > Just trying to figure out why it suddenly breaks. > > -derek > > -- > Derek Wuelfrath > dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 > (x110) :: +1.866.353.6153 (x110) > Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu>), > PacketFence (www.packetfence.org <https://www.packetfence.org/>) and > Fingerbank (www.fingerbank.org <https://www.fingerbank.org>) > >> On Nov 15, 2017, at 15:30, Derek Wuelfrath <dwuelfr...@inverse.ca >> <mailto:dwuelfr...@inverse.ca>> wrote: >> >> I agree. Thing is, we have this kind of setup deployed largely and >> since a while. Never ran into any issue. >> Not sure if something changed in Corosync/Pacemaker code or way of >> dealing with systemd resources. >> >> As said, without a systemd resource, everything just work as it >> should… 100% of the time >> As soon as a systemd resource comes in, it breaks. >> >> -derek >> >> -- >> Derek Wuelfrath >> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: >> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110) >> Inverse inc. :: Leaders behind SOGo (www.sogo.nu >> <https://www.sogo.nu/>), PacketFence (www.packetfence.org >> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org >> <https://www.fingerbank.org/>) >> >>> On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca >>> <mailto:li...@alteeve.ca>> wrote: >>> >>> Quorum doesn't prevent split-brains, stonith (fencing) does. >>> >>> https://www.alteeve.com/w/The_2-Node_Myth >>> >>> There is no way to use quorum-only to avoid a potential split-brain. >>> You might be able to make it less likely with enough effort, but >>> never prevent it. >>> >>> digimer >>> >>> On 2017-11-14 10:45 PM, Garima wrote: >>>> Hello All, >>>> >>>> Split-brain situation occurs due to there is a drop in quorum which >>>> leads to Spilt-brain situation and status information is not >>>> exchanged between both two nodes of the cluster. >>>> This can be avoided if quorum communicates between both the nodes. >>>> I have checked the code. In My opinion these files need to be >>>> updated (quorum.py/stonith.py) to avoid the spilt-brain situation to >>>> maintain Active-Passive configuration. >>>> >>>> Regards, >>>> Garima >>>> >>>> *From:* Derek Wuelfrath [mailto:dwuelfr...@inverse.ca] >>>> *Sent:* 13 November 2017 20:55 >>>> *To:* Cluster Labs - All topics related to open-source clustering >>>> welcomed <users@clusterlabs.org> >>>> *Subject:* Re: [ClusterLabs] Pacemaker responsible of DRBD and a >>>> systemd resource >>>> >>>> Hello Ken ! >>>> >>>> >>>> Make sure that the systemd service is not enabled. If pacemaker is >>>> managing a service, systemd can't also be trying to start and >>>> stop it. >>>> >>>> >>>> It is not. I made sure of this in the first place :) >>>> >>>> >>>> Beyond that, the question is what log messages are there from around >>>> the time of the issue (on both nodes). >>>> >>>> >>>> Well, that’s the thing. There is not much log messages telling what >>>> is actually happening. The ’systemd’ resource is not even trying to >>>> start (nothing in either log for that resource). Here are the logs >>>> from my last attempt: >>>> Scenario: >>>> - Services were running on ‘pancakeFence2’. DRBD was synced and >>>> connected >>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’ >>>> - After ‘pancakeFence2’ comes back, services are running just fine >>>> on ‘pancakeFence1’ but DRBD is in Standalone due to split-brain >>>> >>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 >>>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE >>>> >>>> It really looks like the status checkup mechanism of >>>> corosync/pacemaker for a systemd resource force the resource to >>>> “start” and therefore, start the ones above that resource in the >>>> group (DRBD in instance). >>>> This does not happen for a regular OCF resource (IPaddr2 per example) >>>> >>>> Cheers! >>>> -dw >>>> >>>> -- >>>> Derek Wuelfrath >>>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: >>>> +1.514.447.4918 (x110) :: +1.866.353.6153 (x110) >>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu >>>> <https://www.sogo.nu/>), PacketFence (www.packetfence.org >>>> <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org >>>> <https://www.fingerbank.org/>) >>>> >>>> >>>> On Nov 10, 2017, at 11:39, Ken Gaillot <kgail...@redhat.com >>>> <mailto:kgail...@redhat.com>> wrote: >>>> >>>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote: >>>> >>>> Hello there, >>>> >>>> First post here but following since a while! >>>> >>>> >>>> Welcome! >>>> >>>> >>>> >>>> Here’s my issue, >>>> we are putting in place and running this type of cluster since a >>>> while and never really encountered this kind of problem. >>>> >>>> I recently set up a Corosync / Pacemaker / PCS cluster to >>>> manage DRBD >>>> along with different other resources. Part of theses >>>> resources are >>>> some systemd resources… this is the part where things are >>>> “breaking”. >>>> >>>> Having a two servers cluster running only DRBD or DRBD with >>>> an OCF >>>> ipaddr2 resource (Cluser IP in instance) works just fine. I can >>>> easily move from one node to the other without any issue. >>>> As soon as I add a systemd resource to the resource group, >>>> things are >>>> breaking. Moving from one node to the other using standby >>>> mode works >>>> just fine but as soon as Corosync / Pacemaker restart involves >>>> polling of a systemd resource, it seems like it is trying to >>>> start >>>> the whole resource group and therefore, create a split-brain >>>> of the >>>> DRBD resource. >>>> >>>> >>>> My first two suggestions would be: >>>> >>>> Make sure that the systemd service is not enabled. If pacemaker is >>>> managing a service, systemd can't also be trying to start and >>>> stop it. >>>> >>>> Fencing is the only way pacemaker can resolve split-brains and >>>> certain >>>> other situations, so that will help in the recovery. >>>> >>>> Beyond that, the question is what log messages are there from around >>>> the time of the issue (on both nodes). >>>> >>>> >>>> >>>> >>>> It is the best explanation / description of the situation >>>> that I can >>>> give. If it need any clarification, examples, … I am more >>>> than open >>>> to share them. >>>> >>>> Any guidance would be appreciated :) >>>> >>>> Here’s the output of a ‘pcs config’ >>>> >>>> https://pastebin.com/1TUvZ4X9 >>>> >>>> Cheers! >>>> -dw >>>> >>>> -- >>>> Derek Wuelfrath >>>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: >>>> +1.514.447.4918 (x110) :: +1.866.353.6153 >>>> (x110) >>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu >>>> <http://www.sogo.nu/>), PacketFence >>>> (www.packetfence.org <http://www.packetfence.org/>) and >>>> Fingerbank (www.fingerbank.org <http://www.fingerbank.org/>) >>>> >>>> -- >>>> Ken Gaillot <kgail...@redhat.com <mailto:kgail...@redhat.com>> >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> <mailto:Users@clusterlabs.org> >>>> http://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> <http://www.clusterlabs.org/> >>>> Getting >>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> http://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.com/w/ >>> "I am, somehow, less interested in the weight and convolutions of >>> Einstein’s brain than in the near certainty that people of equal talent >>> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org