Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

2018-04-17 Thread 飯田 雄介
Hi, Ken

Thanks for your comment.
Network fencing that's a valid means, I also think.
However, I think that the reliance on equipment is strong.
Since we do not have an SNMP-capable network switch in our environment, we can 
not immediately try it.

Thanks, Yusuke
> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken Gaillot
> Sent: Friday, April 06, 2018 11:12 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an
> environment using fence_mpath?
> 
> On Fri, 2018-04-06 at 04:30 +, 飯田 雄介 wrote:
> > Hi, all
> > I am testing the environment using fence_mpath with the following
> > settings.
> >
> > ===
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with
> > quorum
> >   Last updated: Fri Apr  6 13:16:20 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650e x3650f ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
> >    fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
> >    Resource Group: grpPostgreSQLDB
> >    prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
> >    Resource Group: grpPostgreSQLIP
> >    prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Start
> > ed x3650e
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >    Started: [ x3650e x3650f ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >    Started: [ x3650e x3650f ]
> >    Clone Set: clnPing [prmPing]
> >    Started: [ x3650e x3650f ]
> > ===
> >
> > When split-brain occurs in this environment, x3650f executes fence and
> > the resource is started with x3650f.
> >
> > === view of x3650e 
> >   Stack: corosync
> >   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition
> > WITHOUT quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Node x3650f: UNCLEAN (offline)
> >   Online: [ x3650e ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
> >    fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f
> > ]
> >    Resource Group: grpPostgreSQLDB
> >    prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Start
> > ed x3650e
> >    prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
> >    Resource Group: grpPostgreSQLIP
> >    prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Start
> > ed x3650e
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >    prmDiskd1(ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >    Started: [ x3650e ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >    prmDiskd2(ocf::pacemaker:diskd): Started x3650f
> > (UNCLEAN)
> >    Started: [ x3650e ]
> >    Clone Set: clnPing [prmPing]
> >    prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
> >    Started: [ x3650e ]
> >
> > === view of x3650f 
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition
> > WITHOUT quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650f ]
> >   OFFLINE: [ x3650e ]
> >
> >   Full list of resources:
> >
> >    fenceMpath-x3650e(stonith:fence_mpath):  Started x3650f
> >    fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
> >    Resource Group: grpPostgreSQLDB
> >    prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Start
> > ed x3650f
> >    prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Start
> > ed x3650f
> >    prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Start
> > ed x3650f
> >    prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f
> >    Resource Group: grpPostgreSQLIP
> >    prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Start
> > ed x3650f
> >    Clone Set: clnDiskd1 [prmDiskd1]
> >    Started: [ x3650f ]
> >    Stopped: [ x3650e ]
> >    Clone Set: clnDiskd2 [prmDiskd2]
> >    Started: [ x3650f ]
> >    Stopped: [ x3650e ]
> >    Clone Set: clnPing [prmPing]
> >    Started: [ x3650f ]
> >    Stopped: [ x3650e ]
> 

[ClusterLabs] 答复: No slave is promoted to be master

2018-04-17 Thread 范国腾
Thank you very much, Rorthais,



I see now. I have two more questions.



1. If I change the "cluster-recheck-interval" parameter from the default 15 
minutes to 10 seconds, is there any bad impact? Could this be a workaround?



2. This issue happens only in the following configuration.

[cid:image003.jpg@01D3D700.2F3E24D0]

But it does not happen in the following configuration. Why is the behaviors 
different?

[cid:image004.jpg@01D3D700.2F3E24D0]



-邮件原件-
发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
发送时间: 2018年4月17日 17:47
收件人: 范国腾 
抄送: Cluster Labs - All topics related to open-source clustering welcomed 

主题: Re: [ClusterLabs] No slave is promoted to be master



On Tue, 17 Apr 2018 04:16:38 +

范国腾 > wrote:



> I check the status again. It is not not promoted but it promoted about

> 15 minutes after the cluster starts.

>

> I try in three labs and the results are same: The promotion happens 15

> minutes after the cluster starts.

>

> Why is there about 15 minutes delay every time?



This was a bug in Pacemaker up to 1.1.17. I did a report about this last August 
and Ken Gaillot fixed it few days later in 1.1.18. See:



https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html

https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html



I wonder if disabling the pgsql resource before shutting down the cluster might 
be a simpler and safer workaround. Eg.:



pcs resource disable pgsql-ha  --wait

pcs cluster stop --all



and



pcs cluster start --all

pcs resource enable pgsql-ha



Another fix would be to force a master score on one node **if needed** using:



  crm_master -N  -r  -l forever -v 1


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Booth fail-over conditions

2018-04-17 Thread Dejan Muhamedagic
Hi,

On Mon, Apr 16, 2018 at 01:22:08PM +0200, Kristoffer Grönlund wrote:
> Zach Anderson  writes:
> 
> >  Hey all,
> >
> > new user to pacemaker/booth and I'm fumbling my way through my first proof
> > of concept. I have a 2 site configuration setup with local pacemaker
> > clusters at each site (running rabbitmq) and a booth arbitrator. I've
> > successfully validated the base failover when the "granted" site has
> > failed. My question is if there are any other ways to configure failover,
> > i.e. using resource health checks or the like?

You can take a look at "before-acquire-handler" (quite a mouthful
there). The main motivation was to add an ability to verify that
some other conditions at _the site_ are good, perhaps using
environment sensors, say to measure temperature, or if the
aircondition works, or such.

Nothing stopping you from doing there a resource health check,
but it could probably be deemed as something on a rather
different "level".

> 
> Hi Zach,
> 
> Do you mean that a resource health check should trigger site failover?
> That's actually something I'm not sure comes built-in..

There's nothing really specific about a resource, because booth
knows nothing about resources. The tickets are the only way it
can describe the world ;-)

Cheers,

Dejan

> though making a
> resource agent which revokes a ticket on failure should be fairly
> straight-forward. You could then group your resource which the ticket
> resource to enable this functionality.
> 
> The logic in the ticket resource ought to be something like "if monitor
> fails and the current site is granted, then revoke the ticket, else do
> nothing". You would probably want to handle probe monitor invocations
> differently. There is a ocf_is_probe function provided to help with
> this.
> 
> Cheers,
> Kristoffer
> 
> > Thanks!
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

2018-04-17 Thread 飯田 雄介
Hi, Andrei

Thanks for your comment.

We are not assuming node level fencing in the current environment.

I tried the power_timeout setting that you taught.
However, fence_mpath immediately returns the status off when you execute the 
off action.
https://github.com/ClusterLabs/fence-agents/blob/v4.0.25/fence/agents/lib/fencing.py.py#L744
Therefore, we could not wait to stop IPaddr2 using this option.

I read the code and learned the power_wait option.
With this option you can delay the completion of STONITH by the specified 
amount of time, 
so it seems to meet our requirements.

Thanks, Yusuke
> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Andrei
> Borzenkov
> Sent: Friday, April 06, 2018 2:04 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an
> environment using fence_mpath?
> 
> 06.04.2018 07:30, 飯田 雄介 пишет:
> > Hi, all
> > I am testing the environment using fence_mpath with the following settings.
> >
> > ===
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum
> >   Last updated: Fri Apr  6 13:16:20 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650e x3650f ]
> >
> >   Full list of resources:
> >
> >fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
> >fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
> >Resource Group: grpPostgreSQLDB
> >prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
> >Resource Group: grpPostgreSQLIP
> >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started
> x3650e
> >Clone Set: clnDiskd1 [prmDiskd1]
> >Started: [ x3650e x3650f ]
> >Clone Set: clnDiskd2 [prmDiskd2]
> >Started: [ x3650e x3650f ]
> >Clone Set: clnPing [prmPing]
> >Started: [ x3650e x3650f ]
> > ===
> >
> > When split-brain occurs in this environment, x3650f executes fence and the
> resource is started with x3650f.
> >
> > === view of x3650e 
> >   Stack: corosync
> >   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT
> quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Node x3650f: UNCLEAN (offline)
> >   Online: [ x3650e ]
> >
> >   Full list of resources:
> >
> >fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
> >fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f ]
> >Resource Group: grpPostgreSQLDB
> >prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started
> x3650e
> >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
> >Resource Group: grpPostgreSQLIP
> >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started
> x3650e
> >Clone Set: clnDiskd1 [prmDiskd1]
> >prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
> >Started: [ x3650e ]
> >Clone Set: clnDiskd2 [prmDiskd2]
> >prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
> >Started: [ x3650e ]
> >Clone Set: clnPing [prmPing]
> >prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
> >Started: [ x3650e ]
> >
> > === view of x3650f 
> >   Stack: corosync
> >   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT
> quorum
> >   Last updated: Fri Apr  6 13:16:36 2018
> >   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> >
> >   2 nodes configured
> >   13 resources configured
> >
> >   Online: [ x3650f ]
> >   OFFLINE: [ x3650e ]
> >
> >   Full list of resources:
> >
> >fenceMpath-x3650e(stonith:fence_mpath):  Started x3650f
> >fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
> >Resource Group: grpPostgreSQLDB
> >prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started
> x3650f
> >prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started
> x3650f
> >prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started
> x3650f
> >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f
> >Resource Group: grpPostgreSQLIP
> >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started
> x3650f
> >Clone Set: clnDiskd1 [prmDiskd1]
> >Started: [ x3650f ]
> >

Re: [ClusterLabs] No slave is promoted to be master

2018-04-17 Thread Jehan-Guillaume de Rorthais
On Tue, 17 Apr 2018 04:16:38 +
范国腾  wrote:

> I check the status again. It is not not promoted but it promoted about 15
> minutes after the cluster starts. 
> 
> I try in three labs and the results are same: The promotion happens 15
> minutes after the cluster starts. 
> 
> Why is there about 15 minutes delay every time?

This was a bug in Pacemaker up to 1.1.17. I did a report about this last August
and Ken Gaillot fixed it few days later in 1.1.18. See:

https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html
https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html

I wonder if disabling the pgsql resource before shutting down the cluster might
be a simpler and safer workaround. Eg.:

 pcs resource disable pgsql-ha  --wait
 pcs cluster stop --all

and 

 pcs cluster start --all
 pcs resource enable pgsql-ha

Another fix would be to force a master score on one node **if needed** using:

  crm_master -N  -r  -l forever -v 1

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org