Re: [ClusterLabs] service flap as nodes join and leave
On Thu, Apr 14, 2016, at 11:12 AM, Ken Gaillot wrote: > On 04/14/2016 09:33 AM, Christopher Harvey wrote: > > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on > > every operation and logs to a file. > > That's a common mistake, and will confuse the cluster. The cluster > checks the status of resources both where they're supposed to be running > and where they're not. If status always returns success, the cluster > won't try to start it where it should,, and will continuously try to > stop it elsewhere, because it thinks it's already running everywhere. > > It's essential that an RA distinguish between running > (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING), > and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER). Solved. Thanks! > See pacemaker's Dummy agent as an example/template: > > https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy > > It touches a temporary file to know whether it is "running" or not. > > ocf-shellfuncs has a ha_pseudo_resource() function that does the same > thing. See the ocf:heartbeat:Delay agent for example usage. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] service flap as nodes join and leave
Ken Gaillotwrote: > On 04/14/2016 09:33 AM, Christopher Harvey wrote: > > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on > > every operation and logs to a file. > > That's a common mistake, and will confuse the cluster. The cluster > checks the status of resources both where they're supposed to be running > and where they're not. If status always returns success, the cluster > won't try to start it where it should,, and will continuously try to > stop it elsewhere, because it thinks it's already running everywhere. > > It's essential that an RA distinguish between running > (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING), > and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER). > > See pacemaker's Dummy agent as an example/template: > > https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy > > It touches a temporary file to know whether it is "running" or not. Yes, I very recently discovered we had made a similar mistake which was confusing Pacemaker into thinking a pseudo-resource was running everywhere, whereas we actually only wanted it running active/passive. This was the fix: https://review.openstack.org/#/c/291286/ > ocf-shellfuncs has a ha_pseudo_resource() function that does the same > thing. See the ocf:heartbeat:Delay agent for example usage. Interesting thanks, I didn't know that. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] service flap as nodes join and leave
On 04/14/2016 09:33 AM, Christopher Harvey wrote: > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on > every operation and logs to a file. That's a common mistake, and will confuse the cluster. The cluster checks the status of resources both where they're supposed to be running and where they're not. If status always returns success, the cluster won't try to start it where it should,, and will continuously try to stop it elsewhere, because it thinks it's already running everywhere. It's essential that an RA distinguish between running (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING), and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER). See pacemaker's Dummy agent as an example/template: https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy It touches a temporary file to know whether it is "running" or not. ocf-shellfuncs has a ha_pseudo_resource() function that does the same thing. See the ocf:heartbeat:Delay agent for example usage. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] service flap as nodes join and leave
actually, toggling vmr-132-5 in the following simpler setup produces the same service flap as before. Cluster Name: Corosync Nodes: 192.168.132.5 192.168.132.4 192.168.132.3 Pacemaker Nodes: vmr-132-3 vmr-132-4 vmr-132-5 Resources: Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active) Meta Attrs: migration-threshold=2 failure-timeout=1s Operations: start interval=0s timeout=2 (MsgBB-Active-start-interval-0s) stop interval=0s timeout=2 (MsgBB-Active-stop-interval-0s) monitor interval=1s (MsgBB-Active-monitor-interval-1s) Stonith Devices: Fencing Levels: Location Constraints: Resource: MsgBB-Active Enabled on: vmr-132-3 (score:100) (id:AUTO-REVERT) Ordering Constraints: Colocation Constraints: Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-recheck-interval: 1s dc-version: 1.1.13-10.el7_2.2-44eb2dd have-watchdog: false start-failure-is-fatal: false stonith-enabled: false MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on every operation and logs to a file. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] service flap as nodes join and leave
On 04/13/2016 11:23 AM, Christopher Harvey wrote: > I have a 3 node cluster (see the bottom of this email for 'pcs config' > output) with 3 nodes. The MsgBB-Active and AD-Active service both flap > whenever a node joins or leaves the cluster. I trigger the leave and > join with a pacemaker service start and stop on any node. That's the default behavior of clones used in ordering constraints. If you set interleave=true on your clones, each dependent clone instance will only care about the depended-on instances on its own node, rather than all nodes. See http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_clone_options While the interleave=true behavior is much more commonly used, interleave=false is the default because it's safer -- the cluster doesn't know anything about the cloned service, so it can't assume the service is OK with it. Since you know what your service does, you can set interleave=true for services that can handle it. > Here is the happy steady state setup: > > 3 nodes and 4 resources configured > > Online: [ vmr-132-3 vmr-132-4 vmr-132-5 ] > > Clone Set: Router-clone [Router] > Started: [ vmr-132-3 vmr-132-4 ] > MsgBB-Active(ocf::solace:MsgBB-Active): Started vmr-132-3 > AD-Active (ocf::solace:AD-Active):Started vmr-132-3 > > [root@vmr-132-4 ~]# supervisorctl stop pacemaker > no change, except vmr-132-4 goes offline > [root@vmr-132-4 ~]# supervisorctl start pacemaker > vmr-132-4 comes back online > MsgBB-Active and AD-Active flap very quickly (<1s) > Steady state is resumed. > > Why should the fact that vmr-132-4 coming and going affect the service > on any other node? > > Thanks, > Chris > > Cluster Name: > Corosync Nodes: > 192.168.132.5 192.168.132.4 192.168.132.3 > Pacemaker Nodes: > vmr-132-3 vmr-132-4 vmr-132-5 > > Resources: > Clone: Router-clone > Meta Attrs: clone-max=2 clone-node-max=1 > Resource: Router (class=ocf provider=solace type=Router) >Meta Attrs: migration-threshold=1 failure-timeout=1s >Operations: start interval=0s timeout=2 (Router-start-timeout-2) >stop interval=0s timeout=2 (Router-stop-timeout-2) >monitor interval=1s (Router-monitor-interval-1s) > Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active) > Meta Attrs: migration-threshold=2 failure-timeout=1s > Operations: start interval=0s timeout=2 (MsgBB-Active-start-timeout-2) > stop interval=0s timeout=2 (MsgBB-Active-stop-timeout-2) > monitor interval=1s (MsgBB-Active-monitor-interval-1s) > Resource: AD-Active (class=ocf provider=solace type=AD-Active) > Meta Attrs: migration-threshold=2 failure-timeout=1s > Operations: start interval=0s timeout=2 (AD-Active-start-timeout-2) > stop interval=0s timeout=2 (AD-Active-stop-timeout-2) > monitor interval=1s (AD-Active-monitor-interval-1s) > > Stonith Devices: > Fencing Levels: > > Location Constraints: > Resource: AD-Active > Disabled on: vmr-132-5 (score:-INFINITY) (id:ADNotOnMonitor) > Resource: MsgBB-Active > Enabled on: vmr-132-4 (score:100) (id:vmr-132-4Priority) > Enabled on: vmr-132-3 (score:250) (id:vmr-132-3Priority) > Disabled on: vmr-132-5 (score:-INFINITY) (id:MsgBBNotOnMonitor) > Resource: Router-clone > Disabled on: vmr-132-5 (score:-INFINITY) (id:RouterNotOnMonitor) > Ordering Constraints: > Resource Sets: > set Router-clone MsgBB-Active sequential=true > (id:pcs_rsc_set_Router-clone_MsgBB-Active) setoptions kind=Mandatory > (id:pcs_rsc_order_Router-clone_MsgBB-Active) > set MsgBB-Active AD-Active sequential=true > (id:pcs_rsc_set_MsgBB-Active_AD-Active) setoptions kind=Mandatory > (id:pcs_rsc_order_MsgBB-Active_AD-Active) > Colocation Constraints: > MsgBB-Active with Router-clone (score:INFINITY) > (id:colocation-MsgBB-Active-Router-clone-INFINITY) > AD-Active with MsgBB-Active (score:1000) > (id:colocation-AD-Active-MsgBB-Active-1000) > > Resources Defaults: > No defaults set > Operations Defaults: > No defaults set > > Cluster Properties: > cluster-infrastructure: corosync > cluster-recheck-interval: 1s > dc-version: 1.1.13-10.el7_2.2-44eb2dd > have-watchdog: false > maintenance-mode: false > start-failure-is-fatal: false > stonith-enabled: false ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] service flap as nodes join and leave
I have a 3 node cluster (see the bottom of this email for 'pcs config' output) with 3 nodes. The MsgBB-Active and AD-Active service both flap whenever a node joins or leaves the cluster. I trigger the leave and join with a pacemaker service start and stop on any node. Here is the happy steady state setup: 3 nodes and 4 resources configured Online: [ vmr-132-3 vmr-132-4 vmr-132-5 ] Clone Set: Router-clone [Router] Started: [ vmr-132-3 vmr-132-4 ] MsgBB-Active(ocf::solace:MsgBB-Active): Started vmr-132-3 AD-Active (ocf::solace:AD-Active):Started vmr-132-3 [root@vmr-132-4 ~]# supervisorctl stop pacemaker no change, except vmr-132-4 goes offline [root@vmr-132-4 ~]# supervisorctl start pacemaker vmr-132-4 comes back online MsgBB-Active and AD-Active flap very quickly (<1s) Steady state is resumed. Why should the fact that vmr-132-4 coming and going affect the service on any other node? Thanks, Chris Cluster Name: Corosync Nodes: 192.168.132.5 192.168.132.4 192.168.132.3 Pacemaker Nodes: vmr-132-3 vmr-132-4 vmr-132-5 Resources: Clone: Router-clone Meta Attrs: clone-max=2 clone-node-max=1 Resource: Router (class=ocf provider=solace type=Router) Meta Attrs: migration-threshold=1 failure-timeout=1s Operations: start interval=0s timeout=2 (Router-start-timeout-2) stop interval=0s timeout=2 (Router-stop-timeout-2) monitor interval=1s (Router-monitor-interval-1s) Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active) Meta Attrs: migration-threshold=2 failure-timeout=1s Operations: start interval=0s timeout=2 (MsgBB-Active-start-timeout-2) stop interval=0s timeout=2 (MsgBB-Active-stop-timeout-2) monitor interval=1s (MsgBB-Active-monitor-interval-1s) Resource: AD-Active (class=ocf provider=solace type=AD-Active) Meta Attrs: migration-threshold=2 failure-timeout=1s Operations: start interval=0s timeout=2 (AD-Active-start-timeout-2) stop interval=0s timeout=2 (AD-Active-stop-timeout-2) monitor interval=1s (AD-Active-monitor-interval-1s) Stonith Devices: Fencing Levels: Location Constraints: Resource: AD-Active Disabled on: vmr-132-5 (score:-INFINITY) (id:ADNotOnMonitor) Resource: MsgBB-Active Enabled on: vmr-132-4 (score:100) (id:vmr-132-4Priority) Enabled on: vmr-132-3 (score:250) (id:vmr-132-3Priority) Disabled on: vmr-132-5 (score:-INFINITY) (id:MsgBBNotOnMonitor) Resource: Router-clone Disabled on: vmr-132-5 (score:-INFINITY) (id:RouterNotOnMonitor) Ordering Constraints: Resource Sets: set Router-clone MsgBB-Active sequential=true (id:pcs_rsc_set_Router-clone_MsgBB-Active) setoptions kind=Mandatory (id:pcs_rsc_order_Router-clone_MsgBB-Active) set MsgBB-Active AD-Active sequential=true (id:pcs_rsc_set_MsgBB-Active_AD-Active) setoptions kind=Mandatory (id:pcs_rsc_order_MsgBB-Active_AD-Active) Colocation Constraints: MsgBB-Active with Router-clone (score:INFINITY) (id:colocation-MsgBB-Active-Router-clone-INFINITY) AD-Active with MsgBB-Active (score:1000) (id:colocation-AD-Active-MsgBB-Active-1000) Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-recheck-interval: 1s dc-version: 1.1.13-10.el7_2.2-44eb2dd have-watchdog: false maintenance-mode: false start-failure-is-fatal: false stonith-enabled: false ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org