Re: [ClusterLabs] DRBD failover in Pacemaker

2016-09-08 Thread Dimitri Maziuk
On 09/08/2016 06:33 PM, Digimer wrote:

> With 'fencing resource-and-stonith;' and a {un,}fence-handler set, DRBD
> will block when the peer is lost until the fence handler script returns
> indicating the peer was fenced/stonithed. In this way, the secondary
> WON'T promote to Primary while the peer is still Primary. It will only
> promote AFTER confirmation that the old Primary is gone. Thus, no
> split-brain.

In 7 or 8 years of running several DRBD pairs I had split brain about 5
times and at least 2 of them were because I tugged on the crosslink
cable while mucking around the back of the rack. Maybe if you run a
zillion of stacked active-active resources on a 100-node cluster DRBD
split brain becomes a real problem, from where I'm sitting stonith'ing
DRBD nodes is a solution in search of a problem.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Jan Pokorný
On 08/09/16 10:20 -0400, Scott Greenlese wrote:
> Correction...
> 
> When I stopped pacemaker/corosync on the four (powered on / active)
> cluster node hosts,  I was having an issue with the gentle method of
> stopping the cluster (pcs cluster stop --all),

Can you elaborate on what went wrong with this gentle method, please?

If it seemed to have stuck, you can perhaps run some diagnostics like:

  pstree -p | grep -A5 $(pidof -x pcs)

across the nodes to see if what process(es) pcs waits on, next time.

> so I ended up doing individual (pcs cluster kill ) on
> each of the four cluster nodes.   I then had to stop the virtual
> domains manually via 'virsh destroy ' on each host.
> Perhaps there was some residual node status affecting my quorum?

Hardly if corosync processes were indeed dead.

-- 
Jan (Poki)


pgp1Re5MQ30mT.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Scott Greenlese

Hi Klaus, thanks for your prompt and thoughtful feedback...

Please see my answers nested below (sections entitled, "Scott's Reply").
Thanks!

- Scott


Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966




From:   Klaus Wenninger 
To: users@clusterlabs.org
Date:   09/08/2016 10:59 AM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior



On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>
> Hi all...
>
> I have a few very basic questions for the group.
>
> I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
> VirtualDomain pacemaker-remote nodes
> plus 100 "opaque" VirtualDomain resources. The cluster is configured
> to be 'symmetric' and I have no
> location constraints on the 200 VirtualDomain resources (other than to
> prevent the opaque guests
> from running on the pacemaker remote node resources). My quorum is set
> as:
>
> quorum {
> provider: corosync_votequorum
> }
>
> As an experiment, I powered down one LPAR in the cluster, leaving 4
> powered up with the pcsd service up on the 4 survivors
> but corosync/pacemaker down (pcs cluster stop --all) on the 4
> survivors. I then started pacemaker/corosync on a single cluster
>

"pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
did you check the status of the individual services?

Scott's reply:

No, I only assumed that pacemaker was down because I got this back on my
pcs status
command from each cluster node:

[root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 ; do ssh $host pcs status; done
Wed Sep  7 15:49:27 EDT 2016
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node


What else should I check?  The pcsd.service service was still up, since I
didn't not stop that
anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync  to
check the state before
assuming it was really down?




> node (pcs cluster start), and this resulted in the 200 VirtualDomain
> resources activating on the single node.
> This was not what I was expecting. I assumed that no resources would
> activate / start on any cluster nodes
> until 3 out of the 5 total cluster nodes had pacemaker/corosync running.
>
> After starting pacemaker/corosync on the single host (zs95kjpcs1),
> this is what I see :
>
> [root@zs95kj VD]# date;pcs status |less
> Wed Sep 7 15:51:17 EDT 2016
> Cluster name: test_cluster_2
> Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
> 2016 by hacluster via crmd on zs93kjpcs1
> Stack: corosync
> Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
> partition with quorum
> 106 nodes and 304 resources configured
>
> Node zs93KLpcs1: pending
> Node zs93kjpcs1: pending
> Node zs95KLpcs1: pending
> Online: [ zs95kjpcs1 ]
> OFFLINE: [ zs90kppcs1 ]
>
> .
> .
> .
> PCSD Status:
> zs93kjpcs1: Online
> zs95kjpcs1: Online
> zs95KLpcs1: Online
> zs90kppcs1: Offline
> zs93KLpcs1: Online
>
> So, what exactly constitutes an "Online" vs. "Offline" cluster node
> w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
> nodes,
> so where does that fall? Any why "pending"? What does that mean?
>
> Also, what exactly is the cluster's expected reaction to quorum loss?
> Cluster resources will be stopped or something else?
>
Depends on how you configure it using cluster property no-quorum-policy
(default: stop).

Scott's reply:

This is how the policy is configured:

[root@zs95kj VD]# date;pcs config |grep quorum
Thu Sep  8 13:18:33 EDT 2016
 no-quorum-policy: stop

What should I expect with the 'stop' setting?


>
>
> Where can I find this documentation?
>
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/

Scott's reply:

OK, I'll keep looking thru this doc, but I don't easily find the
no-quorum-policy explained.

Thanks..


>
>
> Thanks!
>
> Scott Greenlese - IBM Solution Test Team.
>
>
>
> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users


[ClusterLabs] Pacemaker migration - how to?

2016-09-08 Thread Nurit Vilosny
Hi everyone,
I have a very basic question that I couldn't find an answer for.
I am using the pacemaker to control a 3 nodes cluster, with a private 
application that works in an active - standby - standby mode.
My node have priorities in which is better to migrate to. I implemented it via 
location constraint scores.
I want to give my user the ability to migrate / failover  from the active to 
one of the / or specific  standby.
What is the correct way to do it?
Currently I am changing the location constraint score to make pacemaker move my 
resources, but I think this method is wrong.

Thanks,
Nurit
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD failover in Pacemaker

2016-09-08 Thread Dmitri Maziuk

On 2016-09-08 02:03, Digimer wrote:


You need to solve the problem with fencing in DRBD. Leaving it off WILL
result in a split-brain eventually, full stop. With working fencing, you
will NOT get a split-brain, full stop.


"Split brain is a situation where, due to temporary failure of all 
network links between cluster nodes, and possibly due to intervention by 
a cluster management software or human error, both nodes switched to the 
primary role while disconnected."

 -- DRBD Users Guide 8.4 # 2.9 Split brain notification.

About the only practical problem with *DRBD* split brain under pacemaker 
is that pacemaker won't let you run "drbdadm secondary && drbdadm 
connect --discard-my-data" as easy as busted ancient code did.


Dima


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Digimer
On 08/09/16 06:51 PM, Shermal Fernando wrote:
> Hi Jehan-Guillaume,
> 
> Sorry for disturbing you. This is really important for us to pass this test 
> on the pacemaker resiliency and robustness. 
> To my understanding, it's the pacemakerd who feeds the watchdog. If only the 
> crmd is hung, fencing will not work. Am I correct here?
> 
> Regards,
> Shermal Fernando

Watchdog fencing is not ideal. If you're running a critical enough
environment, consider using IPMI or other "real" fencing methods.
Personally, we use (and recommend) IPMI as a primary fence method with a
pair of switched PDUs as a backup fence method. This provides full
coverage and is generally a lot faster.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Scott Greenlese

Hi all...

I have a few very basic questions for the group.

I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 VirtualDomain
pacemaker-remote nodes
plus 100 "opaque" VirtualDomain resources. The cluster is configured to be
'symmetric' and I have no
location constraints on the 200 VirtualDomain resources (other than to
prevent the opaque guests
from running on the pacemaker remote node resources).  My quorum is set as:

quorum {
provider: corosync_votequorum
}

As an experiment, I powered down one LPAR in the cluster, leaving 4 powered
up with the pcsd service up on the 4 survivors
but corosync/pacemaker down (pcs cluster stop --all) on the 4 survivors.
I then started pacemaker/corosync on a single cluster
node (pcs cluster start), and this resulted in the 200 VirtualDomain
resources activating on the single node.
This was not what I was expecting.  I assumed that no resources would
activate / start on any cluster nodes
until 3 out of the 5 total cluster nodes had pacemaker/corosync running.

After starting pacemaker/corosync on the single host (zs95kjpcs1), this is
what I see :

[root@zs95kj VD]# date;pcs status |less
Wed Sep  7 15:51:17 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep  7 15:51:18 2016  Last change: Wed Sep  7
15:30:12 2016 by hacluster via crmd on zs93kjpcs1
Stack: corosync
Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs93KLpcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Online: [ zs95kjpcs1 ]
OFFLINE: [ zs90kppcs1 ]

.
.
.
PCSD Status:
  zs93kjpcs1: Online
  zs95kjpcs1: Online
  zs95KLpcs1: Online
  zs90kppcs1: Offline
  zs93KLpcs1: Online

So, what exactly constitutes an "Online" vs. "Offline" cluster node w.r.t.
quorum calculation?   Seems like in my case, it's "pending" on 3 nodes,
so where does that fall?   Any why "pending"?  What does that mean?

Also, what exactly is the cluster's expected reaction to quorum loss?
Cluster resources will be stopped or something else?

Where can I find this documentation?

Thanks!

Scott Greenlese -  IBM Solution Test Team.



Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 02:28 PM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 08.09.2016 um 09:13 in
> Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>:
>> On 09/08/2016 08:55 AM, Digimer wrote:
>>> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>>> Shermal Fernando  schrieb am 08.09.2016 um 
>>> 06:41 
>> in
 Nachricht
 <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
> starvation or hanging while trying to perform a IO operation.  
> Please share some thoughts on this issue.
 What is "the whole cluster will fail"? If the DC times out, some recovery 
>> will take place.
>>> Yup. The starved node should be declared lost by corosync, the remaining
>>> nodes reform and if they're still quorate, the hung node should be
>>> fenced. Recovery occur and life goes on.
>> Didn't happen in my test (SIGSTOP to crmd).
>> Might be a configuration mistake though...
>> Even had sbd with a watchdog active (amongst
>> other - real - fencing devices).
>> Thinking if it might make sense so tickle the
>> crmd-API from sbd-pacemaker-watcher ...
> OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if 
> corosync is up and happy, but crmd is silent, the cluster just thinks that 
> the DC has nothing to say.
> But I still wonder what will happen if crmd is goinf to send some reply to a 
> command.

Just lost accuracy during discussion. We did stop crmd on the DC.

>
>>> Unless you don't have fencing, then may $deity of mercy. ;)
>>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 08.09.2016 um 09:13 in
Nachricht <4c828344-44da-1d93-b43f-a305cfaa5...@redhat.com>:
> On 09/08/2016 08:55 AM, Digimer wrote:
>> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>> Shermal Fernando  schrieb am 08.09.2016 um 
>> 06:41 
> in
>>> Nachricht
>>> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
 The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
 starvation or hanging while trying to perform a IO operation.  
 Please share some thoughts on this issue.
>>> What is "the whole cluster will fail"? If the DC times out, some recovery 
> will take place.
>> Yup. The starved node should be declared lost by corosync, the remaining
>> nodes reform and if they're still quorate, the hung node should be
>> fenced. Recovery occur and life goes on.
> Didn't happen in my test (SIGSTOP to crmd).
> Might be a configuration mistake though...
> Even had sbd with a watchdog active (amongst
> other - real - fencing devices).
> Thinking if it might make sense so tickle the
> crmd-API from sbd-pacemaker-watcher ...

OK, so we mix "DC" and crmd. crmd is just a part of the DC. I guess if corosync 
is up and happy, but crmd is silent, the cluster just thinks that the DC has 
nothing to say.
But I still wonder what will happen if crmd is goinf to send some reply to a 
command.

>>
>> Unless you don't have fencing, then may $deity of mercy. ;)
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Jehan-Guillaume de Rorthais
On Thu, 8 Sep 2016 09:51:27 +
Shermal Fernando  wrote:

> Hi Jehan-Guillaume,
> 
> Sorry for disturbing you. This is really important for us to pass this test
> on the pacemaker resiliency and robustness. To my understanding, it's the
> pacemakerd who feeds the watchdog. If only the crmd is hung, fencing will not
> work. Am I correct here?

I guess yes.

I am talking of a scenario where the server is under a high load (fork bomb,
swap storm, ...), not only crmd being hung for some reasons.


> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
> Sent: Thursday, September 08, 2016 3:12 PM
> To: Shermal Fernando
> Cc: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster
> decisions are delayed infinitely
> 
> On Thu, 8 Sep 2016 08:58:15 +
> Shermal Fernando  wrote:
> 
> > Hi Jehan-Guillaume,
> > 
> > Does this means watchdog will serf-terminate the machine when the crm 
> > daemon is frozen?
> 
> This means that if the machine is under such a load that PAcemaker is not
> able to feed the watchdog, the watchdog will fence the machine itself.
> 
> > -Original Message-
> > From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
> > Sent: Thursday, September 08, 2016 12:52 PM
> > To: Digimer
> > Cc: Cluster Labs - All topics related to open-source clustering 
> > welcomed
> > Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
> > cluster decisions are delayed infinitely
> > 
> > On Thu, 8 Sep 2016 15:55:50 +0900
> > Digimer  wrote:
> > 
> > > On 08/09/16 03:47 PM, Ulrich Windl wrote:
> > >  Shermal Fernando  schrieb am
> > >  08.09.2016 um
> > >  06:41 in
> > > > Nachricht
> > > > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> > > >> The whole cluster will fail if the DC (crm daemon) is frozen due 
> > > >> to CPU starvation or hanging while trying to perform a IO operation.
> > > >> Please share some thoughts on this issue.
> > > > 
> > > > What is "the whole cluster will fail"? If the DC times out, some 
> > > > recovery will take place.
> > > 
> > > Yup. The starved node should be declared lost by corosync, the 
> > > remaining nodes reform and if they're still quorate, the hung node 
> > > should be fenced. Recovery occur and life goes on.
> > 
> > +1
> > 
> > And fencing might either come from outside, or just from the server 
> > itself using watchdog.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
Hi Jehan-Guillaume,

Sorry for disturbing you. This is really important for us to pass this test on 
the pacemaker resiliency and robustness. 
To my understanding, it's the pacemakerd who feeds the watchdog. If only the 
crmd is hung, fencing will not work. Am I correct here?

Regards,
Shermal Fernando







-Original Message-
From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
Sent: Thursday, September 08, 2016 3:12 PM
To: Shermal Fernando
Cc: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
decisions are delayed infinitely

On Thu, 8 Sep 2016 08:58:15 +
Shermal Fernando  wrote:

> Hi Jehan-Guillaume,
> 
> Does this means watchdog will serf-terminate the machine when the crm 
> daemon is frozen?

This means that if the machine is under such a load that PAcemaker is not able 
to feed the watchdog, the watchdog will fence the machine itself.

> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
> Sent: Thursday, September 08, 2016 12:52 PM
> To: Digimer
> Cc: Cluster Labs - All topics related to open-source clustering 
> welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
> cluster decisions are delayed infinitely
> 
> On Thu, 8 Sep 2016 15:55:50 +0900
> Digimer  wrote:
> 
> > On 08/09/16 03:47 PM, Ulrich Windl wrote:
> >  Shermal Fernando  schrieb am
> >  08.09.2016 um
> >  06:41 in
> > > Nachricht
> > > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> > >> The whole cluster will fail if the DC (crm daemon) is frozen due 
> > >> to CPU starvation or hanging while trying to perform a IO operation.
> > >> Please share some thoughts on this issue.
> > > 
> > > What is "the whole cluster will fail"? If the DC times out, some 
> > > recovery will take place.
> > 
> > Yup. The starved node should be declared lost by corosync, the 
> > remaining nodes reform and if they're still quorate, the hung node 
> > should be fenced. Recovery occur and life goes on.
> 
> +1
> 
> And fencing might either come from outside, or just from the server 
> itself using watchdog.


This e-mail transmission (inclusive of any attachments) is strictly 
confidential and intended solely for the ordinary user of the e-mail address to 
which it was addressed. It may contain legally privileged and/or CONFIDENTIAL 
information. The unauthorized use, disclosure, distribution printing and/or 
copying of this e-mail or any information it contains is prohibited and could, 
in certain circumstances, constitute an offence. If you have received this 
e-mail in error or are not an intended recipient please inform the sender of 
the email and MillenniumIT immediately by return e-mail or telephone (+94-11) 
2416000. We advise that in keeping with good computing practice, the recipient 
of this e-mail should ensure that it is virus free. We do not accept 
responsibility for any virus that may be transferred by way of this e-mail. 
E-mail may be susceptible to data corruption, interception and unauthorized 
amendment, and we do not accept liability for any such corruption, interception 
or amen
 dment or any consequences thereof.  www.millenniumit.com 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 10:58 AM, Shermal Fernando wrote:
> Hi Jehan-Guillaume,
>
> Does this means watchdog will serf-terminate the machine when the crm daemon 
> is frozen?

Would be desirable but doesn't seem to happen - at least till now - will
see what I can do on that front.
 
>
> Regards,
> Shermal Fernando
>
>
>
>
>
>
>
>
> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
> Sent: Thursday, September 08, 2016 12:52 PM
> To: Digimer
> Cc: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
> decisions are delayed infinitely
>
> On Thu, 8 Sep 2016 15:55:50 +0900
> Digimer  wrote:
>
>> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>> Shermal Fernando  schrieb am 
>> 08.09.2016 um
>> 06:41 in
>>> Nachricht
>>> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
 The whole cluster will fail if the DC (crm daemon) is frozen due to 
 CPU starvation or hanging while trying to perform a IO operation.
 Please share some thoughts on this issue.
>>> What is "the whole cluster will fail"? If the DC times out, some 
>>> recovery will take place.
>> Yup. The starved node should be declared lost by corosync, the 
>> remaining nodes reform and if they're still quorate, the hung node 
>> should be fenced. Recovery occur and life goes on.
> +1
>
> And fencing might either come from outside, or just from the server itself 
> using watchdog.
>
> --
> Jehan-Guillaume (ioguix) de Rorthais
> Dalibo
>
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> This e-mail transmission (inclusive of any attachments) is strictly 
> confidential and intended solely for the ordinary user of the e-mail address 
> to which it was addressed. It may contain legally privileged and/or 
> CONFIDENTIAL information. The unauthorized use, disclosure, distribution 
> printing and/or copying of this e-mail or any information it contains is 
> prohibited and could, in certain circumstances, constitute an offence. If you 
> have received this e-mail in error or are not an intended recipient please 
> inform the sender of the email and MillenniumIT immediately by return e-mail 
> or telephone (+94-11) 2416000. We advise that in keeping with good computing 
> practice, the recipient of this e-mail should ensure that it is virus free. 
> We do not accept responsibility for any virus that may be transferred by way 
> of this e-mail. E-mail may be susceptible to data corruption, interception 
> and unauthorized amendment, and we do not accept liability for any such 
> corruption, interceptio!
>  n or amen
>  dment or any consequences thereof.  www.millenniumit.com 
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
Hi Jehan-Guillaume,

Does this means watchdog will serf-terminate the machine when the crm daemon is 
frozen?

Regards,
Shermal Fernando








-Original Message-
From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
Sent: Thursday, September 08, 2016 12:52 PM
To: Digimer
Cc: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
decisions are delayed infinitely

On Thu, 8 Sep 2016 15:55:50 +0900
Digimer  wrote:

> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>  Shermal Fernando  schrieb am 
>  08.09.2016 um
>  06:41 in
> > Nachricht
> > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> >> The whole cluster will fail if the DC (crm daemon) is frozen due to 
> >> CPU starvation or hanging while trying to perform a IO operation.
> >> Please share some thoughts on this issue.
> > 
> > What is "the whole cluster will fail"? If the DC times out, some 
> > recovery will take place.
> 
> Yup. The starved node should be declared lost by corosync, the 
> remaining nodes reform and if they're still quorate, the hung node 
> should be fenced. Recovery occur and life goes on.

+1

And fencing might either come from outside, or just from the server itself 
using watchdog.

--
Jehan-Guillaume (ioguix) de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


This e-mail transmission (inclusive of any attachments) is strictly 
confidential and intended solely for the ordinary user of the e-mail address to 
which it was addressed. It may contain legally privileged and/or CONFIDENTIAL 
information. The unauthorized use, disclosure, distribution printing and/or 
copying of this e-mail or any information it contains is prohibited and could, 
in certain circumstances, constitute an offence. If you have received this 
e-mail in error or are not an intended recipient please inform the sender of 
the email and MillenniumIT immediately by return e-mail or telephone (+94-11) 
2416000. We advise that in keeping with good computing practice, the recipient 
of this e-mail should ensure that it is virus free. We do not accept 
responsibility for any virus that may be transferred by way of this e-mail. 
E-mail may be susceptible to data corruption, interception and unauthorized 
amendment, and we do not accept liability for any such corruption, interception 
or amen
 dment or any consequences thereof.  www.millenniumit.com 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 08:55 AM, Digimer wrote:
> On 08/09/16 03:47 PM, Ulrich Windl wrote:
> Shermal Fernando  schrieb am 08.09.2016 um 
> 06:41 in
>> Nachricht
>> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>>> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
>>> starvation or hanging while trying to perform a IO operation.  
>>> Please share some thoughts on this issue.
>> What is "the whole cluster will fail"? If the DC times out, some recovery 
>> will take place.
> Yup. The starved node should be declared lost by corosync, the remaining
> nodes reform and if they're still quorate, the hung node should be
> fenced. Recovery occur and life goes on.
Didn't happen in my test (SIGSTOP to crmd).
Might be a configuration mistake though...
Even had sbd with a watchdog active (amongst
other - real - fencing devices).
Thinking if it might make sense so tickle the
crmd-API from sbd-pacemaker-watcher ...
>
> Unless you don't have fencing, then may $deity of mercy. ;)
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
If the DC (crm daemon) is frozen (corosync is running without problem), DC will 
not time out. Frozen DC will be there forever.

Regards,
Shermal Fernando








-Original Message-
From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 
Sent: Thursday, September 08, 2016 12:18 PM
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions 
are delayed infinitely

>>> Shermal Fernando  schrieb am 08.09.2016 
>>> um 06:41 in
Nachricht
<8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to 
> CPU starvation or hanging while trying to perform a IO operation.
> Please share some thoughts on this issue.

What is "the whole cluster will fail"? If the DC times out, some recovery will 
take place.

> 
> Regards,
> Shermal Fernando
> 
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Monday, September 05, 2016 6:42 PM
> To: users@clusterlabs.org; develop...@clusterlabs.org
> Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster 
> decisions are delayed infinitely
> 
> On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>>
>> Hi,
>>
>>  
>>
>> Currently our system have 99.96% uptime. But our goal is to increase 
>> it beyond 99.999%. Now we are studying the 
>> reliability/performance/features of pacemaker to replace the existing 
>> clustering solution.
>>
>>  
>>
>> While testing pacemaker, I have encountered a problem. If the DC (crm
>> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
>> machines never start election to elect a new DC. Therefore 
>> fail-overs, resource restartings and other cluster decisions will be 
>> delayed until the DC is unfrozen.
>>
>> Is this the default behavior of pacemaker or is it due to a 
>> misconfiguration? Is there any way to avoid this single point of failure?
>>
>>  
>>
>> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
>> 12 SP1 operation system.
>>
> 
> Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
> I'm having sbd with pacemaker-watcher running as well on the nodes.
> As the node-health is not updated and the cib can be read sbd is happy 
> - as to be expected.
> Maybe we could at least add something into sbd-pacemaker-watcher to 
> detect the issue ... thinking ...
> 
> Regards,
> Klaus
> 
>>  
>>
>>  
>>
>> Regards,
>>
>> Shermal Fernando
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>> This e-mail transmission (inclusive of any attachments) is strictly 
>> confidential and intended solely for the ordinary user of the e-mail 
>> address to which it was addressed. It may contain legally privileged 
>> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
>> distribution printing and/or copying of this e-mail or any 
>> information it contains is prohibited and could, in certain 
>> circumstances, constitute an offence. If you have received this 
>> e-mail in error or are not an intended recipient please inform the 
>> sender of the email and MillenniumIT immediately by return e-mail or 
>> telephone (+94-11) 2416000. We advise that in keeping with good 
>> computing practice, the recipient of this e-mail should ensure that 
>> it is virus free. We do not accept responsibility for any virus that 
>> may be transferred by way of this e-mail. E-mail may be susceptible 
>> to data corruption, interception and unauthorized amendment, and we 
>> do not accept liability for any such corruption, interception or 
>> amendment or any consequences thereof.
>>
>> www.millenniumit.com 
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users 

Re: [ClusterLabs] DRBD failover in Pacemaker

2016-09-08 Thread Digimer
> Thank you for the responses, I followed Digimer's instructions along with 
> some information I had read on the DRBD site and configured fencing on the 
> DRBD resource. I also configured STONITH using IPMI in Pacemaker. I setup 
> Pacemaker first and verified that it kills the other node. 
> 
> After configuring DRBD fencing though I ran into a problem where failover 
> stopped working. If I disable fencing in DRBD when one node is taken offline 
> pacemaker kills it and everything fails over to the other as I would expect, 
> but with fencing enabled the second node doesn't become master in DRBD until 
> the first node completely finishes rebooting. This makes for a lot of 
> downtime, and if one of the nodes has a hardware failure it would never fail 
> over. I think its something to do with the fencing scripts. 
> 
> I am looking for complete redundancy including in the event of hardware 
> failure. Is there a way I can prevent Split-Brain while still allowing for 
> DRBD to failover to the other node? Right now I have only STONITH configured 
> in pacemaker and fencing turned OFF in DRBD. So far it works as I want it to 
> but sometimes when communication is lost between the two nodes the wrong one 
> ends up getting killed, and when that happens it results in Split-Brain on 
> recovery. I hope I described the situation well enough for someone to offer a 
> little help. I'm currently experimenting with the delays before STONITH to 
> see if I can figure something out.
> 
> Thank you,
> Devin

You need to solve the problem with fencing in DRBD. Leaving it off WILL
result in a split-brain eventually, full stop. With working fencing, you
will NOT get a split-brain, full stop.

With working fencing; nodes will block if fencing fails. So as an
example, if the IPMI fencing fails because the IPMI BMC died with the
host, then the surviving node(s) will hang. The logic is that it is
better to hang than risk a split-brain/corruption.

If fencing via IPMI works, then pacemaker should be told as much by
fence_ipmilan and recover as soon as the fence agent exits. If it
doesn't recover until the node returns, fencing is NOT configured
properly (or otherwise not working).

If you want to make sure that the cluster will recover no matter what,
then you will need a backup fence method. We do this by using IPMI as
the primary fence method and a pair of switched PDUs as a backup. So
with this setup, if a node fails, first pacemaker will try to shoot the
peer using IPMI. If IPMI fails (say because the host lost all power),
pacemaker gives up and moves on to PDU fencing. In this case, both PDUs
are called to open the circuits feeding the lost node, thus ensuring it
is off.

If for some reason both methods fail, pacemaker goes back to IPMI and
tries that again, then on to PDUs, ... and will loop until one of the
methods succeeds, leaving the cluster (intentionally) hung in the mean time.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Digimer
On 08/09/16 03:47 PM, Ulrich Windl wrote:
 Shermal Fernando  schrieb am 08.09.2016 um 
 06:41 in
> Nachricht
> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
>> starvation or hanging while trying to perform a IO operation.  
>> Please share some thoughts on this issue.
> 
> What is "the whole cluster will fail"? If the DC times out, some recovery 
> will take place.

Yup. The starved node should be declared lost by corosync, the remaining
nodes reform and if they're still quorate, the hung node should be
fenced. Recovery occur and life goes on.

Unless you don't have fencing, then may $deity of mercy. ;)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Ulrich Windl
>>> Shermal Fernando  schrieb am 08.09.2016 um 
>>> 06:41 in
Nachricht
<8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
> starvation or hanging while trying to perform a IO operation.  
> Please share some thoughts on this issue.

What is "the whole cluster will fail"? If the DC times out, some recovery will 
take place.

> 
> Regards,
> Shermal Fernando
> 
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
> Sent: Monday, September 05, 2016 6:42 PM
> To: users@clusterlabs.org; develop...@clusterlabs.org 
> Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are 
> delayed infinitely
> 
> On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>>
>> Hi,
>>
>>  
>>
>> Currently our system have 99.96% uptime. But our goal is to increase 
>> it beyond 99.999%. Now we are studying the 
>> reliability/performance/features of pacemaker to replace the existing 
>> clustering solution.
>>
>>  
>>
>> While testing pacemaker, I have encountered a problem. If the DC (crm
>> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
>> machines never start election to elect a new DC. Therefore fail-overs, 
>> resource restartings and other cluster decisions will be delayed until 
>> the DC is unfrozen.
>>
>> Is this the default behavior of pacemaker or is it due to a 
>> misconfiguration? Is there any way to avoid this single point of failure?
>>
>>  
>>
>> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
>> 12 SP1 operation system.
>>
> 
> Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
> I'm having sbd with pacemaker-watcher running as well on the nodes.
> As the node-health is not updated and the cib can be read sbd is happy - as 
> to 
> be expected.
> Maybe we could at least add something into sbd-pacemaker-watcher to detect 
> the 
> issue ... thinking ...
> 
> Regards,
> Klaus
> 
>>  
>>
>>  
>>
>> Regards,
>>
>> Shermal Fernando
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>> This e-mail transmission (inclusive of any attachments) is strictly 
>> confidential and intended solely for the ordinary user of the e-mail 
>> address to which it was addressed. It may contain legally privileged 
>> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
>> distribution printing and/or copying of this e-mail or any information 
>> it contains is prohibited and could, in certain circumstances, 
>> constitute an offence. If you have received this e-mail in error or 
>> are not an intended recipient please inform the sender of the email 
>> and MillenniumIT immediately by return e-mail or telephone (+94-11) 
>> 2416000. We advise that in keeping with good computing practice, the 
>> recipient of this e-mail should ensure that it is virus free. We do 
>> not accept responsibility for any virus that may be transferred by way 
>> of this e-mail. E-mail may be susceptible to data corruption, 
>> interception and unauthorized amendment, and we do not accept 
>> liability for any such corruption, interception or amendment or any 
>> consequences thereof.
>>
>> www.millenniumit.com 
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD failover in Pacemaker

2016-09-08 Thread Devin Ortner

Message: 1
Date: Wed, 7 Sep 2016 19:23:04 +0900
From: Digimer 
To: Cluster Labs - All topics related to open-source clustering
welcomed
Subject: Re: [ClusterLabs] DRBD failover in Pacemaker
Message-ID: 
Content-Type: text/plain; charset=windows-1252

> no-quorum-policy: ignore
> stonith-enabled: false

You must have fencing configured.

CentOS 6 uses pacemaker with the cman plugin. So setup cman
(cluster.conf) to use the fence_pcmk passthrough agent, then setup proper 
stonith in pacemaker (and test that it works). Finally, tell DRBD to use 
'fencing resource-and-stonith;' and configure the 'crm-{un,}fence-peer.sh' 
{un,}fence handlers.

See if that gets things working.

On 07/09/16 04:04 AM, Devin Ortner wrote:
> I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD. I have 
> been using the "Clusters from Scratch" documentation to create my cluster and 
> I am running into a problem where DRBD is not failing over to the other node 
> when one goes down. Here is my "pcs status" prior to when it is supposed to 
> fail over:
> 
> --
> 
> 
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:50:21 2016Last change: Tue Sep  6 
> 14:50:17 2016 by root via crm_attribute on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):   Started node1
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>  Masters: [ node1 ]
>  Slaves: [ node2 ]
>  ClusterFS(ocf::heartbeat:Filesystem):Started node1
>  WebSite  (ocf::heartbeat:apache):Started node1
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
> last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> When I put node1 in standby everything fails over except DRBD:
> --
> 
> 
> [root@node1 ~]# pcs cluster standby node1
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:53:45 2016Last change: Tue Sep  6 
> 14:53:37 2016 by root via cibadmin on node2
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Node node1: standby
> Online: [ node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):   Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>  Slaves: [ node2 ]
>  Stopped: [ node1 ]
>  ClusterFS(ocf::heartbeat:Filesystem):Stopped
>  WebSite  (ocf::heartbeat:apache):Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
> last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> I have pasted the contents of "/var/log/messages" here: 
> http://pastebin.com/0i0FMzGZ Here is my Configuration: 
> http://pastebin.com/HqqBV90p
> 
> When I unstandby node1, it comes back as the master for the DRBD and 
> everything else stays running on node2 (Which is fine because I haven't setup 
> colocation constraints for that) Here is what I have after node1 is back:
> -
> 
> [root@node1 ~]# pcs cluster unstandby node1
> [root@node1 ~]# pcs status
> Cluster name: webcluster
> Last updated: Tue Sep  6 14:57:46 2016Last change: Tue Sep  6 
> 14:57:42 2016 by root via cibadmin on node1
> Stack: cman
> Current DC: node2 (version 1.1.14-8.el6_8.1-70404b0) - partition with 
> quorum
> 2 nodes and 5 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Cluster_VIP  (ocf::heartbeat:IPaddr2):   Started node2
>  Master/Slave Set: ClusterDBclone [ClusterDB]
>  Masters: [ node1 ]
>  Slaves: [ node2 ]
>  ClusterFS(ocf::heartbeat:Filesystem):Started node1
>  WebSite  (ocf::heartbeat:apache):Started node2
> 
> Failed Actions:
> * ClusterFS_start_0 on node2 'unknown error' (1): call=61, status=complete, 
> exitreason='none',
> last-rc-change='Tue Sep  6 13:15:00 2016', queued=0ms, exec=40ms
> 
> 
> PCSD Status:
>   node1: Online
>   node2: Online
> 
> [root@node1 ~]#
> 
> Any help would be appreciated, I think there is something dumb that I'm 
> missing.
> 
> Thank you.
> 
> ___
> Users mailing list: Users@clusterlabs.org