Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Derek Wuelfrath Wed, 15 Nov 2017 12:34:31 -0800

I agree. Thing is, we have this kind of setup deployed largely and since a 
while. Never ran into any issue.
Not sure if something changed in Corosync/Pacemaker code or way of dealing with 
systemd resources.


As said, without a systemd resource, everything just work as it should… 100% of 
the time
As soon as a systemd resource comes in, it breaks.

-derek

--
Derek Wuelfrath
dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank 
(www.fingerbank.org <https://www.fingerbank.org/>)

> On Nov 14, 2017, at 23:03, Digimer <li...@alteeve.ca> wrote:
> 
> Quorum doesn't prevent split-brains, stonith (fencing) does. 
> 
> https://www.alteeve.com/w/The_2-Node_Myth 
> <https://www.alteeve.com/w/The_2-Node_Myth>
> 
> There is no way to use quorum-only to avoid a potential split-brain. You 
> might be able to make it less likely with enough effort, but never prevent it.
> 
> digimer
> 
> On 2017-11-14 10:45 PM, Garima wrote:
>> Hello All,
>>  
>> Split-brain situation occurs due to there is a drop in quorum which leads to 
>> Spilt-brain situation and status information is not exchanged between both 
>> two nodes of the cluster. 
>> This can be avoided if quorum communicates between both the nodes.
>> I have checked the code. In My opinion these files need to be updated 
>> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
>> Active-Passive configuration.
>>  
>> Regards,
>> Garima
>>  
>> From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca 
>> <mailto:dwuelfr...@inverse.ca>] 
>> Sent: 13 November 2017 20:55
>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>> <users@clusterlabs.org> <mailto:users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd 
>> resource
>>  
>> Hello Ken !
>>  
>> Make sure that the systemd service is not enabled. If pacemaker is
>> managing a service, systemd can't also be trying to start and stop it.
>>  
>> It is not. I made sure of this in the first place :)
>>  
>> Beyond that, the question is what log messages are there from around
>> the time of the issue (on both nodes).
>>  
>> Well, that’s the thing. There is not much log messages telling what is 
>> actually happening. The ’systemd’ resource is not even trying to start 
>> (nothing in either log for that resource). Here are the logs from my last 
>> attempt:
>> Scenario:
>> - Services were running on ‘pancakeFence2’. DRBD was synced and connected
>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>> - After ‘pancakeFence2’ comes back, services are running just fine on 
>> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>  
>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 
>> <https://pastebin.com/dVSGPP78>
>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE 
>> <https://pastebin.com/at8qPkHE>
>>  
>> It really looks like the status checkup mechanism of corosync/pacemaker for 
>> a systemd resource force the resource to “start” and therefore, start the 
>> ones above that resource in the group (DRBD in instance).
>> This does not happen for a regular OCF resource (IPaddr2 per example)
>> 
>> Cheers!
>> -dw
>>  
>> --
>> Derek Wuelfrath
>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 
>> (x110) :: +1.866.353.6153 (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
>> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and 
>> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>)
>> 
>> 
>> On Nov 10, 2017, at 11:39, Ken Gaillot <kgail...@redhat.com 
>> <mailto:kgail...@redhat.com>> wrote:
>>  
>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
>> 
>> Hello there,
>> 
>> First post here but following since a while!
>> 
>> Welcome!
>> 
>> 
>> 
>> Here’s my issue,
>> we are putting in place and running this type of cluster since a
>> while and never really encountered this kind of problem.
>> 
>> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
>> along with different other resources. Part of theses resources are
>> some systemd resources… this is the part where things are “breaking”.
>> 
>> Having a two servers cluster running only DRBD or DRBD with an OCF
>> ipaddr2 resource (Cluser IP in instance) works just fine. I can
>> easily move from one node to the other without any issue.
>> As soon as I add a systemd resource to the resource group, things are
>> breaking. Moving from one node to the other using standby mode works
>> just fine but as soon as Corosync / Pacemaker restart involves
>> polling of a systemd resource, it seems like it is trying to start
>> the whole resource group and therefore, create a split-brain of the
>> DRBD resource.
>> 
>> My first two suggestions would be:
>> 
>> Make sure that the systemd service is not enabled. If pacemaker is
>> managing a service, systemd can't also be trying to start and stop it.
>> 
>> Fencing is the only way pacemaker can resolve split-brains and certain
>> other situations, so that will help in the recovery.
>> 
>> Beyond that, the question is what log messages are there from around
>> the time of the issue (on both nodes).
>> 
>> 
>> 
>> 
>> It is the best explanation / description of the situation that I can
>> give. If it need any clarification, examples, … I am more than open
>> to share them.
>> 
>> Any guidance would be appreciated :)
>> 
>> Here’s the output of a ‘pcs config’
>> 
>> https://pastebin.com/1TUvZ4X9 <https://pastebin.com/1TUvZ4X9>
>> 
>> Cheers!
>> -dw
>> 
>> --
>> Derek Wuelfrath
>> dwuelfr...@inverse.ca <mailto:dwuelfr...@inverse.ca> :: +1.514.447.4918 
>> (x110) :: +1.866.353.6153
>> (x110)
>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu/>), 
>> PacketFence
>> (www.packetfence.org <http://www.packetfence.org/>) and Fingerbank 
>> (www.fingerbank.org <http://www.fingerbank.org/>)
>> -- 
>> Ken Gaillot <kgail...@redhat.com <mailto:kgail...@redhat.com>>
>> 
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>  
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/>
> "I am, somehow, less interested in the weight and convolutions of Einstein’s 
> brain than in the near certainty that people of equal talent have lived and 
> died in cotton fields and sweatshops." - Stephen Jay Gould
> _______________________________________________
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> http://lists.clusterlabs.org/mailman/listinfo/users 
> <http://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Reply via email to