Re: [ClusterLabs] service flap as nodes join and leave

2016-04-18 Thread Christopher Harvey
On Thu, Apr 14, 2016, at 11:12 AM, Ken Gaillot wrote:
> On 04/14/2016 09:33 AM, Christopher Harvey wrote:
> > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on
> > every operation and logs to a file.
> 
> That's a common mistake, and will confuse the cluster. The cluster
> checks the status of resources both where they're supposed to be running
> and where they're not. If status always returns success, the cluster
> won't try to start it where it should,, and will continuously try to
> stop it elsewhere, because it thinks it's already running everywhere.
> 
> It's essential that an RA distinguish between running
> (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING),
> and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER).

Solved. Thanks!

> See pacemaker's Dummy agent as an example/template:
> 
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy
> 
> It touches a temporary file to know whether it is "running" or not.
> 
> ocf-shellfuncs has a ha_pseudo_resource() function that does the same
> thing. See the ocf:heartbeat:Delay agent for example usage.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] service flap as nodes join and leave

2016-04-14 Thread Adam Spiers
Ken Gaillot  wrote:
> On 04/14/2016 09:33 AM, Christopher Harvey wrote:
> > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on
> > every operation and logs to a file.
> 
> That's a common mistake, and will confuse the cluster. The cluster
> checks the status of resources both where they're supposed to be running
> and where they're not. If status always returns success, the cluster
> won't try to start it where it should,, and will continuously try to
> stop it elsewhere, because it thinks it's already running everywhere.
> 
> It's essential that an RA distinguish between running
> (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING),
> and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER).
> 
> See pacemaker's Dummy agent as an example/template:
> 
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy
> 
> It touches a temporary file to know whether it is "running" or not.

Yes, I very recently discovered we had made a similar mistake which
was confusing Pacemaker into thinking a pseudo-resource was running
everywhere, whereas we actually only wanted it running active/passive.
This was the fix:

  https://review.openstack.org/#/c/291286/

> ocf-shellfuncs has a ha_pseudo_resource() function that does the same
> thing. See the ocf:heartbeat:Delay agent for example usage.

Interesting thanks, I didn't know that.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] service flap as nodes join and leave

2016-04-14 Thread Ken Gaillot
On 04/14/2016 09:33 AM, Christopher Harvey wrote:
> MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on
> every operation and logs to a file.

That's a common mistake, and will confuse the cluster. The cluster
checks the status of resources both where they're supposed to be running
and where they're not. If status always returns success, the cluster
won't try to start it where it should,, and will continuously try to
stop it elsewhere, because it thinks it's already running everywhere.

It's essential that an RA distinguish between running
(OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING),
and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER).

See pacemaker's Dummy agent as an example/template:

https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy

It touches a temporary file to know whether it is "running" or not.

ocf-shellfuncs has a ha_pseudo_resource() function that does the same
thing. See the ocf:heartbeat:Delay agent for example usage.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] service flap as nodes join and leave

2016-04-14 Thread Christopher Harvey
actually, toggling vmr-132-5 in the following simpler setup produces the
same service flap as before.

Cluster Name:
Corosync Nodes:
 192.168.132.5 192.168.132.4 192.168.132.3
Pacemaker Nodes:
 vmr-132-3 vmr-132-4 vmr-132-5

Resources:
 Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active)
  Meta Attrs: migration-threshold=2 failure-timeout=1s
  Operations: start interval=0s timeout=2
  (MsgBB-Active-start-interval-0s)
  stop interval=0s timeout=2 (MsgBB-Active-stop-interval-0s)
  monitor interval=1s (MsgBB-Active-monitor-interval-1s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: MsgBB-Active
Enabled on: vmr-132-3 (score:100) (id:AUTO-REVERT)
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-recheck-interval: 1s
 dc-version: 1.1.13-10.el7_2.2-44eb2dd
 have-watchdog: false
 start-failure-is-fatal: false
 stonith-enabled: false


MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on
every operation and logs to a file.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] service flap as nodes join and leave

2016-04-13 Thread Ken Gaillot
On 04/13/2016 11:23 AM, Christopher Harvey wrote:
> I have a 3 node cluster (see the bottom of this email for 'pcs config'
> output) with 3 nodes. The MsgBB-Active and AD-Active service both flap
> whenever a node joins or leaves the cluster. I trigger the leave and
> join with a pacemaker service start and stop on any node.

That's the default behavior of clones used in ordering constraints. If
you set interleave=true on your clones, each dependent clone instance
will only care about the depended-on instances on its own node, rather
than all nodes.

See
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_clone_options

While the interleave=true behavior is much more commonly used,
interleave=false is the default because it's safer -- the cluster
doesn't know anything about the cloned service, so it can't assume the
service is OK with it. Since you know what your service does, you can
set interleave=true for services that can handle it.

> Here is the happy steady state setup:
> 
> 3 nodes and 4 resources configured
> 
> Online: [ vmr-132-3 vmr-132-4 vmr-132-5 ]
> 
>  Clone Set: Router-clone [Router]
>  Started: [ vmr-132-3 vmr-132-4 ]
> MsgBB-Active(ocf::solace:MsgBB-Active): Started vmr-132-3
> AD-Active   (ocf::solace:AD-Active):Started vmr-132-3
> 
> [root@vmr-132-4 ~]# supervisorctl stop pacemaker
> no change, except vmr-132-4 goes offline
> [root@vmr-132-4 ~]# supervisorctl start pacemaker
> vmr-132-4 comes back online
> MsgBB-Active and AD-Active flap very quickly (<1s)
> Steady state is resumed.
> 
> Why should the fact that vmr-132-4 coming and going affect the service
> on any other node?
> 
> Thanks,
> Chris
> 
> Cluster Name:
> Corosync Nodes:
>  192.168.132.5 192.168.132.4 192.168.132.3
> Pacemaker Nodes:
>  vmr-132-3 vmr-132-4 vmr-132-5
> 
> Resources:
>  Clone: Router-clone
>   Meta Attrs: clone-max=2 clone-node-max=1
>   Resource: Router (class=ocf provider=solace type=Router)
>Meta Attrs: migration-threshold=1 failure-timeout=1s
>Operations: start interval=0s timeout=2 (Router-start-timeout-2)
>stop interval=0s timeout=2 (Router-stop-timeout-2)
>monitor interval=1s (Router-monitor-interval-1s)
>  Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active)
>   Meta Attrs: migration-threshold=2 failure-timeout=1s
>   Operations: start interval=0s timeout=2 (MsgBB-Active-start-timeout-2)
>   stop interval=0s timeout=2 (MsgBB-Active-stop-timeout-2)
>   monitor interval=1s (MsgBB-Active-monitor-interval-1s)
>  Resource: AD-Active (class=ocf provider=solace type=AD-Active)
>   Meta Attrs: migration-threshold=2 failure-timeout=1s
>   Operations: start interval=0s timeout=2 (AD-Active-start-timeout-2)
>   stop interval=0s timeout=2 (AD-Active-stop-timeout-2)
>   monitor interval=1s (AD-Active-monitor-interval-1s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
>   Resource: AD-Active
> Disabled on: vmr-132-5 (score:-INFINITY) (id:ADNotOnMonitor)
>   Resource: MsgBB-Active
> Enabled on: vmr-132-4 (score:100) (id:vmr-132-4Priority)
> Enabled on: vmr-132-3 (score:250) (id:vmr-132-3Priority)
> Disabled on: vmr-132-5 (score:-INFINITY) (id:MsgBBNotOnMonitor)
>   Resource: Router-clone
> Disabled on: vmr-132-5 (score:-INFINITY) (id:RouterNotOnMonitor)
> Ordering Constraints:
>   Resource Sets:
> set Router-clone MsgBB-Active sequential=true
> (id:pcs_rsc_set_Router-clone_MsgBB-Active) setoptions kind=Mandatory
> (id:pcs_rsc_order_Router-clone_MsgBB-Active)
> set MsgBB-Active AD-Active sequential=true
> (id:pcs_rsc_set_MsgBB-Active_AD-Active) setoptions kind=Mandatory
> (id:pcs_rsc_order_MsgBB-Active_AD-Active)
> Colocation Constraints:
>   MsgBB-Active with Router-clone (score:INFINITY)
>   (id:colocation-MsgBB-Active-Router-clone-INFINITY)
>   AD-Active with MsgBB-Active (score:1000)
>   (id:colocation-AD-Active-MsgBB-Active-1000)
> 
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-recheck-interval: 1s
>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>  have-watchdog: false
>  maintenance-mode: false
>  start-failure-is-fatal: false
>  stonith-enabled: false


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] service flap as nodes join and leave

2016-04-13 Thread Christopher Harvey
I have a 3 node cluster (see the bottom of this email for 'pcs config'
output) with 3 nodes. The MsgBB-Active and AD-Active service both flap
whenever a node joins or leaves the cluster. I trigger the leave and
join with a pacemaker service start and stop on any node.

Here is the happy steady state setup:

3 nodes and 4 resources configured

Online: [ vmr-132-3 vmr-132-4 vmr-132-5 ]

 Clone Set: Router-clone [Router]
 Started: [ vmr-132-3 vmr-132-4 ]
MsgBB-Active(ocf::solace:MsgBB-Active): Started vmr-132-3
AD-Active   (ocf::solace:AD-Active):Started vmr-132-3

[root@vmr-132-4 ~]# supervisorctl stop pacemaker
no change, except vmr-132-4 goes offline
[root@vmr-132-4 ~]# supervisorctl start pacemaker
vmr-132-4 comes back online
MsgBB-Active and AD-Active flap very quickly (<1s)
Steady state is resumed.

Why should the fact that vmr-132-4 coming and going affect the service
on any other node?

Thanks,
Chris

Cluster Name:
Corosync Nodes:
 192.168.132.5 192.168.132.4 192.168.132.3
Pacemaker Nodes:
 vmr-132-3 vmr-132-4 vmr-132-5

Resources:
 Clone: Router-clone
  Meta Attrs: clone-max=2 clone-node-max=1
  Resource: Router (class=ocf provider=solace type=Router)
   Meta Attrs: migration-threshold=1 failure-timeout=1s
   Operations: start interval=0s timeout=2 (Router-start-timeout-2)
   stop interval=0s timeout=2 (Router-stop-timeout-2)
   monitor interval=1s (Router-monitor-interval-1s)
 Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active)
  Meta Attrs: migration-threshold=2 failure-timeout=1s
  Operations: start interval=0s timeout=2 (MsgBB-Active-start-timeout-2)
  stop interval=0s timeout=2 (MsgBB-Active-stop-timeout-2)
  monitor interval=1s (MsgBB-Active-monitor-interval-1s)
 Resource: AD-Active (class=ocf provider=solace type=AD-Active)
  Meta Attrs: migration-threshold=2 failure-timeout=1s
  Operations: start interval=0s timeout=2 (AD-Active-start-timeout-2)
  stop interval=0s timeout=2 (AD-Active-stop-timeout-2)
  monitor interval=1s (AD-Active-monitor-interval-1s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: AD-Active
Disabled on: vmr-132-5 (score:-INFINITY) (id:ADNotOnMonitor)
  Resource: MsgBB-Active
Enabled on: vmr-132-4 (score:100) (id:vmr-132-4Priority)
Enabled on: vmr-132-3 (score:250) (id:vmr-132-3Priority)
Disabled on: vmr-132-5 (score:-INFINITY) (id:MsgBBNotOnMonitor)
  Resource: Router-clone
Disabled on: vmr-132-5 (score:-INFINITY) (id:RouterNotOnMonitor)
Ordering Constraints:
  Resource Sets:
set Router-clone MsgBB-Active sequential=true
(id:pcs_rsc_set_Router-clone_MsgBB-Active) setoptions kind=Mandatory
(id:pcs_rsc_order_Router-clone_MsgBB-Active)
set MsgBB-Active AD-Active sequential=true
(id:pcs_rsc_set_MsgBB-Active_AD-Active) setoptions kind=Mandatory
(id:pcs_rsc_order_MsgBB-Active_AD-Active)
Colocation Constraints:
  MsgBB-Active with Router-clone (score:INFINITY)
  (id:colocation-MsgBB-Active-Router-clone-INFINITY)
  AD-Active with MsgBB-Active (score:1000)
  (id:colocation-AD-Active-MsgBB-Active-1000)

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-recheck-interval: 1s
 dc-version: 1.1.13-10.el7_2.2-44eb2dd
 have-watchdog: false
 maintenance-mode: false
 start-failure-is-fatal: false
 stonith-enabled: false

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org