Re: [ClusterLabs] DRBD failover in Pacemaker

Devin Ortner Wed, 07 Sep 2016 23:48:03 -0700

Message: 1
Date: Wed, 7 Sep 2016 19:23:04 +0900
From: Digimer <li...@alteeve.ca>
To: Cluster Labs - All topics related to open-source clustering
        welcomed        <users@clusterlabs.org>
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID: <b1e95242-1b0d-ed28-2ba8-d6b58d152...@alteeve.ca>
Content-Type: text/plain; charset=windows-1252


> no-quorum-policy: ignore
> stonith-enabled: false

You must have fencing configured.

CentOS 6 uses pacemaker with the cman plugin. So setup cman
(cluster.conf) to use the fence_pcmk passthrough agent, then setup proper 
stonith in pacemaker (and test that it works). Finally, tell DRBD to use 
'fencing resource-and-stonith;' and configure the 'crm-{un,}fence-peer.sh' 
{un,}fence handlers.

See if that gets things working.

On 07/09/16 04:04 AM, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD. I have 
> been using the "Clusters from Scratch" documentation to create my cluster and 
> I am running into a problem where DRBD is not failing over to the other node 
> when one goes down. Here is my "pcs status" prior to when it is supposed to 
> fail over:
> 
> ----------------------------------------------------------------------
> ------------------------------------------------
> 
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:50:21 2016                Last change: Tue Sep  6 
> 14:50:17 2016 by root via crm_attribute on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):       Started node1
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS    (ocf::heartbeat:Filesystem):    Started node1
>  WebSite      (ocf::heartbeat:apache):        Started node1
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> When I put node1 in standby everything fails over except DRBD:
> ----------------------------------------------------------------------
> ----------------
> 
> [root@node1 ~]# pcs cluster standby node1
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:53:45 2016                Last change: Tue Sep  6 
> 14:53:37 2016 by root via cibadmin on node2
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Node node1: standby
> Online: [ node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):       Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Slaves: [ node2 ]
>      Stopped: [ node1 ]
>  ClusterFS    (ocf::heartbeat:Filesystem):    Stopped
>  WebSite      (ocf::heartbeat:apache):        Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> I have pasted the contents of "/var/log/messages" here: 
> http://pastebin.com/0i0FMzGZ Here is my Configuration: 
> http://pastebin.com/HqqBV90p
> 
> When I unstandby node1, it comes back as the master for the DRBD and 
> everything else stays running on node2 (Which is fine because I haven't setup 
> colocation constraints for that) Here is what I have after node1 is back:
> -----------------------------------------------------
> 
> [root@node1 ~]# pcs cluster unstandby node1
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:57:46 2016                Last change: Tue Sep  6 
> 14:57:42 2016 by root via cibadmin on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):       Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS    (ocf::heartbeat:Filesystem):    Started node1
>  WebSite      (ocf::heartbeat:apache):        Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
>     last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> Any help would be appreciated, I think there is something dumb that I'm 
> missing.
> 
> Thank you.
> 
> _______________________________________________
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is 
trapped in the mind of a person without access to education?

------------------------------

Message: 2
Date: Wed, 7 Sep 2016 08:51:36 -0600
From: Greg Woods <wo...@ucar.edu>
To: Cluster Labs - All topics related to open-source clustering
        welcomed        <users@clusterlabs.org>
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID:
        <cakhxxfzp7nwunp0plfgd9j1yreramx4b4boyn_xyu3nrxpd...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Tue, Sep 6, 2016 at 1:04 PM, Devin Ortner <
devin.ort...@gtshq.onmicrosoft.com> wrote:

> Master/Slave Set: ClusterDBclone [ClusterDB]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  ClusterFS      (ocf::heartbeat:Filesystem):    Started node1
>

As Digimer said, you really need fencing when you are using DRBD. Otherwise
it's only a matter of time before your shared filesystem gets corrupted.

You also need an order constraint to be sure that the ClusterFS Filesystem
does not start until after the Master DRBD resource, and a colocation
constraint to ensure these are on the same node.

--Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://clusterlabs.org/pipermail/users/attachments/20160907/8a6eadad/attachment-0001.html>

------------------------------

Message: 3
Date: Wed, 7 Sep 2016 10:02:45 -0500
From: Dmitri Maziuk <dmitri.maz...@gmail.com>
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID: <8c6f6527-e691-55ed-f2cb-602a6dcec...@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 2016-09-06 14:04, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD.
> I  have been using the "Clusters from Scratch" documentation to create my
cluster and I am running into a problem where DRBD is not failing over
to the other node when one goes down.

I forget if Clusters From Scratch spell this out: you have to create the 
DRBD volume and let it finish the initial sync before you let pacemaker 
near it. Was 'cat /proc/drbd' showing UpToDate/UpToDdate 
Primary/Secondary when you tried the failover?

Ignore the "stonith is optional; you *must* use stonith" mantra du jour.

Dima
------------------
Thank you for the responses, I followed Digimer's instructions along with some 
information I had read on the DRBD site and configured fencing on the DRBD 
resource. I also configured STONITH using IPMI in Pacemaker. I setup Pacemaker 
first and verified that it kills the other node. 

After configuring DRBD fencing though I ran into a problem where failover 
stopped working. If I disable fencing in DRBD when one node is taken offline 
pacemaker kills it and everything fails over to the other as I would expect, 
but with fencing enabled the second node doesn't become master in DRBD until 
the first node completely finishes rebooting. This makes for a lot of downtime, 
and if one of the nodes has a hardware failure it would never fail over. I 
think its something to do with the fencing scripts. 

I am looking for complete redundancy including in the event of hardware 
failure. Is there a way I can prevent Split-Brain while still allowing for DRBD 
to failover to the other node? Right now I have only STONITH configured in 
pacemaker and fencing turned OFF in DRBD. So far it works as I want it to but 
sometimes when communication is lost between the two nodes the wrong one ends 
up getting killed, and when that happens it results in Split-Brain on recovery. 
I hope I described the situation well enough for someone to offer a little 
help. I'm currently experimenting with the delays before STONITH to see if I 
can figure something out.

Thank you,
Devin

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] DRBD failover in Pacemaker

Reply via email to