Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-10 Thread Andreas Janning
Hi All,

I have just tried assigning equal location scores to both nodes and it does
indeed fix my problem.
I think Andrei Borzenkovs explanation is spot on. That is what is happening
in the cluster.

I think when initially setting up the cluster, I used different scores to
define a "main" and "secondary" node so that in normal operations it would
always be the "main" node serving the requests.
Looking at it now it doesn't really make sense to apply those location
scores to the clone resource. It should be on the VIP resource if it is
really necessary to define a "main" node at all.

Thank you all for your help!

Regards,

Andreas

Am Mo., 9. Aug. 2021 um 22:23 Uhr schrieb Andrei Borzenkov <
arvidj...@gmail.com>:

> On 09.08.2021 22:57, Reid Wahl wrote:
> > On Mon, Aug 9, 2021 at 6:19 AM Andrei Borzenkov 
> wrote:
> >
> >> On 09.08.2021 16:00, Andreas Janning wrote:
> >>> Hi,
> >>>
> >>> yes, by "service" I meant the apache-clone resource.
> >>>
> >>> Maybe I can give a more stripped down and detailed example:
> >>>
> >>> *Given the following configuration:*
> >>> [root@pacemaker-test-1 cluster]# pcs cluster cib --config
> >>> 
> >>>   
> >>> 
> >>>>> name="have-watchdog"
> >>> value="false"/>
> >>>>>> value="1.1.23-1.el7_9.1-9acf116022"/>
> >>>>>> name="cluster-infrastructure" value="corosync"/>
> >>>name="cluster-name"
> >>> value="pacemaker-test"/>
> >>>>>> name="stonith-enabled" value="false"/>
> >>>>>> name="symmetric-cluster" value="false"/>
> >>>>>> name="last-lrm-refresh" value="1628511747"/>
> >>> 
> >>>   
> >>>   
> >>> 
> >>> 
> >>>   
> >>>   
> >>> 
> >>>>> type="apache">
> >>> 
> >>>>>> value="80"/>
> >>>>>> name="statusurl" value="http://localhost/server-status"/>
> >>> 
> >>> 
> >>>>>> name="monitor" timeout="20s"/>
> >>>>>> timeout="40s"/>
> >>>>>> timeout="60s"/>
> >>> 
> >>>   
> >>>   
> >>>  >>> name="clone-max" value="2"/>
> >>>  >>> name="clone-node-max" value="1"/>
> >>>  >>> name="interleave" value="true"/>
> >>>   
> >>> 
> >>>   
> >>>   
> >>>  >>> node="pacemaker-test-1" rsc="apache-clone" score="100"
> >>> resource-discovery="exclusive"/>
> >>>  >>> node="pacemaker-test-2" rsc="apache-clone" score="0"
> >>> resource-discovery="exclusive"/>
> >>>   
> >>>   
> >>> 
> >>>>>> name="resource-stickiness" value="50"/>
> >>> 
> >>>   
> >>> 
> >>>
> >>>
> >>> *With the cluster in a running state:*
> >>>
> >>> [root@pacemaker-test-1 cluster]# pcs status
> >>> Cluster name: pacemaker-test
> >>> Stack: corosync
> >>> Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -
> >>> partition with quorum
> >>> Last updated: Mon Aug  9 14:45:38 2021
> >>> Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on
> >>> pacemaker-test-1
> >>>
> >>> 2 nodes configured
> >>> 2 resource instances configured
> >>>
> >>> Online: [ pacemaker-test-1 pacemaker-test-2 ]
> >>>
> >>> Full list of resources:
> >>>
> >>>  Clone Set: apache-clone [apache]
> >>>  Started: [ pacemaker-test-1 pacemaker-test-2 ]
> >>>
> >>> Daemon Status:
> >>>   corosync: active/disabled
> >>>   pacemaker: active/disabled
> >>>   pcsd: active/enabled
> >>>
> >>> *When simulating an error by killing the apache-resource on
> >>> pacemaker-test-1:*
> >>>
> >>> [root@pacemaker-test-1 ~]# killall httpd
> >>>
> >>> *After a few seconds, the cluster notices that the apache-resource is
> >> down
> >>> on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is
> fine):*
> >>>
> >>> [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:
> >>
> >> Never ever filter logs that you show unless you know what you are doing.
> >>
> >> You skipped the most interesting part that is the intended actions.
> >> Which are
> >>
> >> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
> >>  * Recoverapache:0 ( ha1 -> ha2 )
> >> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
> >>  * Move   apache:1 ( ha2 -> ha1 )
> >>
> >> So pacemaker decides to "swap" nodes where current instances are
> running.
> >>
> >
> > Correct. I've only skimmed this thread but it looks like:
> >
> > https://github.com/ClusterLabs/pacemaker/pull/2313
> > https://bugzilla.redhat.com/show_bug.cgi?id=1931023
> >
>
> It is far over my head, but from problem summary it likely is it. In
> this case the problem is actually caused by
>
> a) allocating still active clone instances first:
>
> /* allocation order:
>  *  - active instances
>  *  - instances running on nodes with the least copies
>  *  - active instances on nodes that can't support them or are to be
> fenced
>  *  - failed instances
>  *  - inactive instances
>  */
>
> b) assigning unequal scores to different nodes
>
> So when clone instance on ha1 fails, pacemaker 

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andrei Borzenkov
On 09.08.2021 22:57, Reid Wahl wrote:
> On Mon, Aug 9, 2021 at 6:19 AM Andrei Borzenkov  wrote:
> 
>> On 09.08.2021 16:00, Andreas Janning wrote:
>>> Hi,
>>>
>>> yes, by "service" I meant the apache-clone resource.
>>>
>>> Maybe I can give a more stripped down and detailed example:
>>>
>>> *Given the following configuration:*
>>> [root@pacemaker-test-1 cluster]# pcs cluster cib --config
>>> 
>>>   
>>> 
>>>   > name="have-watchdog"
>>> value="false"/>
>>>   >> value="1.1.23-1.el7_9.1-9acf116022"/>
>>>   >> name="cluster-infrastructure" value="corosync"/>
>>>   >> value="pacemaker-test"/>
>>>   >> name="stonith-enabled" value="false"/>
>>>   >> name="symmetric-cluster" value="false"/>
>>>   >> name="last-lrm-refresh" value="1628511747"/>
>>> 
>>>   
>>>   
>>> 
>>> 
>>>   
>>>   
>>> 
>>>   > type="apache">
>>> 
>>>   >> value="80"/>
>>>   >> name="statusurl" value="http://localhost/server-status"/>
>>> 
>>> 
>>>   >> name="monitor" timeout="20s"/>
>>>   >> timeout="40s"/>
>>>   >> timeout="60s"/>
>>> 
>>>   
>>>   
>>> >> name="clone-max" value="2"/>
>>> >> name="clone-node-max" value="1"/>
>>> >> name="interleave" value="true"/>
>>>   
>>> 
>>>   
>>>   
>>> >> node="pacemaker-test-1" rsc="apache-clone" score="100"
>>> resource-discovery="exclusive"/>
>>> >> node="pacemaker-test-2" rsc="apache-clone" score="0"
>>> resource-discovery="exclusive"/>
>>>   
>>>   
>>> 
>>>   >> name="resource-stickiness" value="50"/>
>>> 
>>>   
>>> 
>>>
>>>
>>> *With the cluster in a running state:*
>>>
>>> [root@pacemaker-test-1 cluster]# pcs status
>>> Cluster name: pacemaker-test
>>> Stack: corosync
>>> Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -
>>> partition with quorum
>>> Last updated: Mon Aug  9 14:45:38 2021
>>> Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on
>>> pacemaker-test-1
>>>
>>> 2 nodes configured
>>> 2 resource instances configured
>>>
>>> Online: [ pacemaker-test-1 pacemaker-test-2 ]
>>>
>>> Full list of resources:
>>>
>>>  Clone Set: apache-clone [apache]
>>>  Started: [ pacemaker-test-1 pacemaker-test-2 ]
>>>
>>> Daemon Status:
>>>   corosync: active/disabled
>>>   pacemaker: active/disabled
>>>   pcsd: active/enabled
>>>
>>> *When simulating an error by killing the apache-resource on
>>> pacemaker-test-1:*
>>>
>>> [root@pacemaker-test-1 ~]# killall httpd
>>>
>>> *After a few seconds, the cluster notices that the apache-resource is
>> down
>>> on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):*
>>>
>>> [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:
>>
>> Never ever filter logs that you show unless you know what you are doing.
>>
>> You skipped the most interesting part that is the intended actions.
>> Which are
>>
>> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
>>  * Recoverapache:0 ( ha1 -> ha2 )
>> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
>>  * Move   apache:1 ( ha2 -> ha1 )
>>
>> So pacemaker decides to "swap" nodes where current instances are running.
>>
> 
> Correct. I've only skimmed this thread but it looks like:
> 
> https://github.com/ClusterLabs/pacemaker/pull/2313
> https://bugzilla.redhat.com/show_bug.cgi?id=1931023
> 

It is far over my head, but from problem summary it likely is it. In
this case the problem is actually caused by

a) allocating still active clone instances first:

/* allocation order:
 *  - active instances
 *  - instances running on nodes with the least copies
 *  - active instances on nodes that can't support them or are to be
fenced
 *  - failed instances
 *  - inactive instances
 */

b) assigning unequal scores to different nodes

So when clone instance on ha1 fails, pacemaker starts with allocating
instance that is still running on ha2. Because configuration makes ha1
preferred, it gets allocated to ha1. So instance that was running on ha1
gets assigned whatever is left.

> I've had some personal things get in the way of following up on the PR for
> a while. In my experience, configuring resource-stickiness has worked
> around the issue.
> 

In this case using equal weights for all nodes does it as well (+
implicit stickiness for clone instances). I am not sure what was
intended with these strange location constraints, it is up to OP to answer.

> 
>> Looking at scores
>>
>> Using the original execution date of: 2021-08-09 12:59:37Z
>>
>> Current cluster status:
>> Online: [ ha1 ha2 ]
>>
>>  vip(ocf::pacemaker:Dummy):  Started ha1
>>  Clone Set: apache-clone [apache]
>>  apache (ocf::pacemaker:Dummy):  FAILED ha1
>>  Started: [ ha2 ]
>>
>> Allocation scores:
>> pcmk__clone_allocate: apache-clone allocation score on ha1: 200
>> pcmk__clone_allocate: apache-clone allocation score on ha2: 0
>> 

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Reid Wahl
On Mon, Aug 9, 2021 at 6:19 AM Andrei Borzenkov  wrote:

> On 09.08.2021 16:00, Andreas Janning wrote:
> > Hi,
> >
> > yes, by "service" I meant the apache-clone resource.
> >
> > Maybe I can give a more stripped down and detailed example:
> >
> > *Given the following configuration:*
> > [root@pacemaker-test-1 cluster]# pcs cluster cib --config
> > 
> >   
> > 
> >name="have-watchdog"
> > value="false"/>
> >> value="1.1.23-1.el7_9.1-9acf116022"/>
> >> name="cluster-infrastructure" value="corosync"/>
> >> value="pacemaker-test"/>
> >> name="stonith-enabled" value="false"/>
> >> name="symmetric-cluster" value="false"/>
> >> name="last-lrm-refresh" value="1628511747"/>
> > 
> >   
> >   
> > 
> > 
> >   
> >   
> > 
> >type="apache">
> > 
> >> value="80"/>
> >> name="statusurl" value="http://localhost/server-status"/>
> > 
> > 
> >> name="monitor" timeout="20s"/>
> >> timeout="40s"/>
> >> timeout="60s"/>
> > 
> >   
> >   
> >  > name="clone-max" value="2"/>
> >  > name="clone-node-max" value="1"/>
> >  > name="interleave" value="true"/>
> >   
> > 
> >   
> >   
> >  > node="pacemaker-test-1" rsc="apache-clone" score="100"
> > resource-discovery="exclusive"/>
> >  > node="pacemaker-test-2" rsc="apache-clone" score="0"
> > resource-discovery="exclusive"/>
> >   
> >   
> > 
> >> name="resource-stickiness" value="50"/>
> > 
> >   
> > 
> >
> >
> > *With the cluster in a running state:*
> >
> > [root@pacemaker-test-1 cluster]# pcs status
> > Cluster name: pacemaker-test
> > Stack: corosync
> > Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -
> > partition with quorum
> > Last updated: Mon Aug  9 14:45:38 2021
> > Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on
> > pacemaker-test-1
> >
> > 2 nodes configured
> > 2 resource instances configured
> >
> > Online: [ pacemaker-test-1 pacemaker-test-2 ]
> >
> > Full list of resources:
> >
> >  Clone Set: apache-clone [apache]
> >  Started: [ pacemaker-test-1 pacemaker-test-2 ]
> >
> > Daemon Status:
> >   corosync: active/disabled
> >   pacemaker: active/disabled
> >   pcsd: active/enabled
> >
> > *When simulating an error by killing the apache-resource on
> > pacemaker-test-1:*
> >
> > [root@pacemaker-test-1 ~]# killall httpd
> >
> > *After a few seconds, the cluster notices that the apache-resource is
> down
> > on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):*
> >
> > [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:
>
> Never ever filter logs that you show unless you know what you are doing.
>
> You skipped the most interesting part that is the intended actions.
> Which are
>
> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
>  * Recoverapache:0 ( ha1 -> ha2 )
> Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
>  * Move   apache:1 ( ha2 -> ha1 )
>
> So pacemaker decides to "swap" nodes where current instances are running.
>

Correct. I've only skimmed this thread but it looks like:

https://github.com/ClusterLabs/pacemaker/pull/2313
https://bugzilla.redhat.com/show_bug.cgi?id=1931023

I've had some personal things get in the way of following up on the PR for
a while. In my experience, configuring resource-stickiness has worked
around the issue.


> Looking at scores
>
> Using the original execution date of: 2021-08-09 12:59:37Z
>
> Current cluster status:
> Online: [ ha1 ha2 ]
>
>  vip(ocf::pacemaker:Dummy):  Started ha1
>  Clone Set: apache-clone [apache]
>  apache (ocf::pacemaker:Dummy):  FAILED ha1
>  Started: [ ha2 ]
>
> Allocation scores:
> pcmk__clone_allocate: apache-clone allocation score on ha1: 200
> pcmk__clone_allocate: apache-clone allocation score on ha2: 0
> pcmk__clone_allocate: apache:0 allocation score on ha1: 101
> pcmk__clone_allocate: apache:0 allocation score on ha2: 0
> pcmk__clone_allocate: apache:1 allocation score on ha1: 100
> pcmk__clone_allocate: apache:1 allocation score on ha2: 1
> pcmk__native_allocate: apache:1 allocation score on ha1: 100
> pcmk__native_allocate: apache:1 allocation score on ha2: 1
> pcmk__native_allocate: apache:1 allocation score on ha1: 100
> pcmk__native_allocate: apache:1 allocation score on ha2: 1
> pcmk__native_allocate: apache:0 allocation score on ha1: -INFINITY
> ^^
> pcmk__native_allocate: apache:0 allocation score on ha2: 0
> pcmk__native_allocate: vip allocation score on ha1: 100
> pcmk__native_allocate: vip allocation score on ha2: 0
>
> Transition Summary:
>  * Recoverapache:0 ( ha1 -> ha2 )
>  * Move   apache:1 ( ha2 -> ha1 )
>
>
> No, I do not have explanation why pacemaker decides that apache:0 cannot
> run on ha1 in this case and 

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Strahil Nikolov via Users
> name="statusurl" value="http://localhost/server-status"/>

Can you show the apache config for the status page ? It must be accessible only 
from localhost (127.0.0.1) and should not be reachable from the other nodes.


Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andrei Borzenkov
On 09.08.2021 16:00, Andreas Janning wrote:
> Hi,
> 
> yes, by "service" I meant the apache-clone resource.
> 
> Maybe I can give a more stripped down and detailed example:
> 
> *Given the following configuration:*
> [root@pacemaker-test-1 cluster]# pcs cluster cib --config
> 
>   
> 
>value="false"/>
>value="1.1.23-1.el7_9.1-9acf116022"/>
>name="cluster-infrastructure" value="corosync"/>
>value="pacemaker-test"/>
>name="stonith-enabled" value="false"/>
>name="symmetric-cluster" value="false"/>
>name="last-lrm-refresh" value="1628511747"/>
> 
>   
>   
> 
> 
>   
>   
> 
>   
> 
>value="80"/>
>name="statusurl" value="http://localhost/server-status"/>
> 
> 
>name="monitor" timeout="20s"/>
>timeout="40s"/>
>timeout="60s"/>
> 
>   
>   
>  name="clone-max" value="2"/>
>  name="clone-node-max" value="1"/>
>  name="interleave" value="true"/>
>   
> 
>   
>   
>  node="pacemaker-test-1" rsc="apache-clone" score="100"
> resource-discovery="exclusive"/>
>  node="pacemaker-test-2" rsc="apache-clone" score="0"
> resource-discovery="exclusive"/>
>   
>   
> 
>name="resource-stickiness" value="50"/>
> 
>   
> 
> 
> 
> *With the cluster in a running state:*
> 
> [root@pacemaker-test-1 cluster]# pcs status
> Cluster name: pacemaker-test
> Stack: corosync
> Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -
> partition with quorum
> Last updated: Mon Aug  9 14:45:38 2021
> Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on
> pacemaker-test-1
> 
> 2 nodes configured
> 2 resource instances configured
> 
> Online: [ pacemaker-test-1 pacemaker-test-2 ]
> 
> Full list of resources:
> 
>  Clone Set: apache-clone [apache]
>  Started: [ pacemaker-test-1 pacemaker-test-2 ]
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> *When simulating an error by killing the apache-resource on
> pacemaker-test-1:*
> 
> [root@pacemaker-test-1 ~]# killall httpd
> 
> *After a few seconds, the cluster notices that the apache-resource is down
> on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):*
> 
> [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:

Never ever filter logs that you show unless you know what you are doing.

You skipped the most interesting part that is the intended actions.
Which are

Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
 * Recoverapache:0 ( ha1 -> ha2 )
Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:
 * Move   apache:1 ( ha2 -> ha1 )

So pacemaker decides to "swap" nodes where current instances are running.

Looking at scores

Using the original execution date of: 2021-08-09 12:59:37Z

Current cluster status:
Online: [ ha1 ha2 ]

 vip(ocf::pacemaker:Dummy):  Started ha1
 Clone Set: apache-clone [apache]
 apache (ocf::pacemaker:Dummy):  FAILED ha1
 Started: [ ha2 ]

Allocation scores:
pcmk__clone_allocate: apache-clone allocation score on ha1: 200
pcmk__clone_allocate: apache-clone allocation score on ha2: 0
pcmk__clone_allocate: apache:0 allocation score on ha1: 101
pcmk__clone_allocate: apache:0 allocation score on ha2: 0
pcmk__clone_allocate: apache:1 allocation score on ha1: 100
pcmk__clone_allocate: apache:1 allocation score on ha2: 1
pcmk__native_allocate: apache:1 allocation score on ha1: 100
pcmk__native_allocate: apache:1 allocation score on ha2: 1
pcmk__native_allocate: apache:1 allocation score on ha1: 100
pcmk__native_allocate: apache:1 allocation score on ha2: 1
pcmk__native_allocate: apache:0 allocation score on ha1: -INFINITY
^^
pcmk__native_allocate: apache:0 allocation score on ha2: 0
pcmk__native_allocate: vip allocation score on ha1: 100
pcmk__native_allocate: vip allocation score on ha2: 0

Transition Summary:
 * Recoverapache:0 ( ha1 -> ha2 )
 * Move   apache:1 ( ha2 -> ha1 )


No, I do not have explanation why pacemaker decides that apache:0 cannot
run on ha1 in this case and so decides to move it to another node. It
most certainly has something to do with asymmetric cluster and location
scores. If you set the same location scores for apache-clone on both
nodes pacemaker will recover failed instance and won't attempt to move
it. Like

location location-apache-clone-ha1-100 apache-clone
resource-discovery=exclusive 100: ha1
location location-apache-clone-ha2-100 apache-clone
resource-discovery=exclusive 100: ha2

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andreas Janning
Hi,

yes, by "service" I meant the apache-clone resource.

Maybe I can give a more stripped down and detailed example:

*Given the following configuration:*
[root@pacemaker-test-1 cluster]# pcs cluster cib --config

  

  
  
  
  
  
  
  

  
  


  
  

  

  
  http://localhost/server-status"/>


  
  
  

  
  



  

  
  


  
  

  

  



*With the cluster in a running state:*

[root@pacemaker-test-1 cluster]# pcs status
Cluster name: pacemaker-test
Stack: corosync
Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -
partition with quorum
Last updated: Mon Aug  9 14:45:38 2021
Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on
pacemaker-test-1

2 nodes configured
2 resource instances configured

Online: [ pacemaker-test-1 pacemaker-test-2 ]

Full list of resources:

 Clone Set: apache-clone [apache]
 Started: [ pacemaker-test-1 pacemaker-test-2 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

*When simulating an error by killing the apache-resource on
pacemaker-test-1:*

[root@pacemaker-test-1 ~]# killall httpd

*After a few seconds, the cluster notices that the apache-resource is down
on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):*

[root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:
Aug 09 14:49:30 [10336] pacemaker-test-1   crmd: info:
process_lrm_event: Result of monitor operation for apache on
pacemaker-test-1: 7 (not running) | call=12 key=apache_monitor_1
confirmed=false cib-update=22
Aug 09 14:49:30 [10336] pacemaker-test-1   crmd: info:
do_lrm_rsc_op: Performing key=3:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68
op=apache_stop_0
Aug 09 14:49:30 [10336] pacemaker-test-1   crmd: info:
process_lrm_event: Result of monitor operation for apache on
pacemaker-test-1: Cancelled | call=12 key=apache_monitor_1
confirmed=true
Aug 09 14:49:30 [10336] pacemaker-test-1   crmd:   notice:
process_lrm_event: Result of stop operation for apache on pacemaker-test-1:
0 (ok) | call=14 key=apache_stop_0 confirmed=true cib-update=24
Aug 09 14:49:32 [10336] pacemaker-test-1   crmd: info:
do_lrm_rsc_op: Performing key=5:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68
op=apache_start_0
Aug 09 14:49:33 [10336] pacemaker-test-1   crmd:   notice:
process_lrm_event: Result of start operation for apache on
pacemaker-test-1: 0 (ok) | call=15 key=apache_start_0 confirmed=true
cib-update=26
Aug 09 14:49:33 [10336] pacemaker-test-1   crmd: info:
do_lrm_rsc_op: Performing key=6:4:0:0fe9a8dd-1a73-4770-a36e-b14a6bb37d68
op=apache_monitor_1
Aug 09 14:49:34 [10336] pacemaker-test-1   crmd: info:
process_lrm_event: Result of monitor operation for apache on
pacemaker-test-1: 0 (ok) | call=16 key=apache_monitor_1 confirmed=false
cib-update=28

*BUT the cluster also restarts the apache-resource on pacemaker-test-2.
Which it should not do because the apache-resource on pacemaker-test-2 did
not crash:*

[root@pacemaker-test-2 cluster]# cat corosync.log | grep crmd:
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd: info:
update_failcount: Updating failcount for apache on pacemaker-test-1 after
failed monitor: rc=7 (update=value++, time=1628513370)
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd: info:
process_graph_event: Detected action (2.6) apache_monitor_1.12=not
running: failed
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd:   notice:
do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd: info:
do_te_invoke: Processing graph 3 (ref=pe_calc-dc-1628513370-25) derived
from /var/lib/pacemaker/pengine/pe-input-51.bz2
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd:   notice:
abort_transition_graph: Transition aborted by
status-1-fail-count-apache.monitor_1 doing modify
fail-count-apache#monitor_1=2: Transient attribute change | cib=0.33.33
source=abort_unless_down:356
path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-apache.monitor_1']
complete=false
Aug 09 14:49:30 [18553] pacemaker-test-2   crmd: info:
abort_transition_graph: Transition aborted by
status-1-last-failure-apache.monitor_1 doing modify
last-failure-apache#monitor_1=1628513370: Transient attribute change |
cib=0.33.34 source=abort_unless_down:356

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andrei Borzenkov
On Mon, Aug 9, 2021 at 3:07 PM Andreas Janning
 wrote:
>
> Hi,
>
> I have just tried your suggestion by adding
>  name="interleave" value="true"/>
> to the clone configuration.
> Unfortunately, the behavior stays the same. The service is still restarted on 
> the passive node when crashing it on the active node.
>

What is "service"? Is it the resource with id=apache-clone in your
configuration?

Logs from DC around time of crash would certainly be useful here.

> Regards
>
> Andreas
>
> Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov 
> :
>>
>> Hi.
>> I'd suggest to set your clone meta attribute 'interleaved' to 'true'
>>
>> Best,
>> Vladislav
>>
>> On August 9, 2021 1:43:16 PM Andreas Janning  
>> wrote:
>>>
>>> Hi all,
>>>
>>> we recently experienced an outage in our pacemaker cluster and I would like 
>>> to understand how we can configure the cluster to avoid this problem in the 
>>> future.
>>>
>>> First our basic setup:
>>> - CentOS7
>>> - Pacemaker 1.1.23
>>> - Corosync 2.4.5
>>> - Resource-Agents 4.1.1
>>>
>>> Our cluster is composed of multiple active/passive nodes. Each software 
>>> component runs on two nodes simultaneously and all traffic is routed to the 
>>> active node via Virtual IP.
>>> If the active node fails, the passive node grabs the Virtual IP and 
>>> immediately takes over all work of the failed node. Since the software is 
>>> already up and running on the passive node, there should be virtually no 
>>> downtime.
>>> We have tried achieved this in pacemaker by configuring clone-sets for each 
>>> software component.
>>>
>>> Now the problem:
>>> When a software component fails on the active node, the Virtual-IP is 
>>> correctly grabbed by the passive node. BUT the software component is also 
>>> immediately restarted on the passive Node.
>>> That unfortunately defeats the purpose of the whole setup, since we now 
>>> have a downtime until the software component is restarted on the passive 
>>> node and the restart might even fail and lead to a complete outage.
>>> After some investigating I now understand that the cloned resource is 
>>> restarted on all nodes after a monitoring failure because the default 
>>> "on-fail" of "monitor" is restart. But that is not what I want.
>>>
>>> I have created a minimal setup that reproduces the problem:
>>>
 
  
  
  >>> value="false"/>
  >>> value="1.1.23-1.el7_9.1-9acf116022"/>
  >>> name="cluster-infrastructure" value="corosync"/>
  >>> value="pacemaker-test"/>
  >>> value="false"/>
  >>> name="symmetric-cluster" value="false"/>
  
  
  
  
  
  
  
  
  
  >>> value="{{infrastructure.virtual_ip}}"/>
  
  
  >>> timeout="20s"/>
  >>> timeout="20s"/>
  >>> timeout="20s"/>
  
  
  
  
  
  
  >>> value="http://localhost/server-status"/>
  
  
  >>> timeout="20s"/>
  >>> timeout="40s"/>
  
  
  
  
  >>> value="2"/>
  >>> name="clone-node-max" value="1"/>
  
  
  
  
  >>> node="active-node" rsc="apache-clone" score="100" 
 resource-discovery="exclusive"/>
  >>> node="passive-node" rsc="apache-clone" score="0" 
 resource-discovery="exclusive"/>
  >>> rsc="vip" score="100" resource-discovery="exclusive"/>
  >>> rsc="vip" score="0" resource-discovery="exclusive"/>
  >>> score="INFINITY" with-rsc="apache-clone"/>
  
  
  
  >>> name="resource-stickiness" value="50"/>
  
  
 
>>>
>>>
>>>
>>> When this configuration is started, httpd will be running on active-node 
>>> and passive-node. The VIP runs only on active-node.
>>> When crashing the httpd on active-node (with killall httpd), passive-node 
>>> immediately grabs the VIP and restarts its own httpd.
>>>
>>> How can I change this configuration so that when the resource fails on 
>>> active-node:
>>> - passive-node immediately grabs the VIP (as it does now).
>>> - active-node tries to restart the failed resource, giving up after x 
>>> attempts.
>>> - passive-node does NOT restart the resource.
>>>
>>> Regards
>>>
>>> Andreas Janning
>>>
>>>
>>>
>>> --
>>> 
>>>
>>> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
>>> ausgezeichnet von Great Place to Work
>>>
>>> 
>>>
>>> Andreas Janning
>>> Expert Software Engineer
>>>
>>> QAware GmbH
>>> Aschauer Straße 32
>>> 81549 München, Germany
>>> Mobil +49 160 1492426
>>> andreas.jann...@qaware.de
>>> www.qaware.de
>>>
>>> 
>>>
>>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
>>> Registergericht: München
>>> Handelsregisternummer: HRB 163761
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>
>
> --
> 

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andreas Janning
Hi,

I have just tried your suggestion by adding

to the clone configuration.
Unfortunately, the behavior stays the same. The service is still restarted
on the passive node when crashing it on the active node.

Regards

Andreas

Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov <
bub...@hoster-ok.com>:

> Hi.
> I'd suggest to set your clone meta attribute 'interleaved' to 'true'
>
> Best,
> Vladislav
>
> On August 9, 2021 1:43:16 PM Andreas Janning 
> wrote:
>
>> Hi all,
>>
>> we recently experienced an outage in our pacemaker cluster and I would
>> like to understand how we can configure the cluster to avoid this problem
>> in the future.
>>
>> First our basic setup:
>> - CentOS7
>> - Pacemaker 1.1.23
>> - Corosync 2.4.5
>> - Resource-Agents 4.1.1
>>
>> Our cluster is composed of multiple active/passive nodes. Each software
>> component runs on two nodes simultaneously and all traffic is routed to the
>> active node via Virtual IP.
>> If the active node fails, the passive node grabs the Virtual IP and
>> immediately takes over all work of the failed node. Since the software is
>> already up and running on the passive node, there should be virtually no
>> downtime.
>> We have tried achieved this in pacemaker by configuring clone-sets for
>> each software component.
>>
>> Now the problem:
>> When a software component fails on the active node, the Virtual-IP is
>> correctly grabbed by the passive node. BUT the software component is also
>> immediately restarted on the passive Node.
>> That unfortunately defeats the purpose of the whole setup, since we now
>> have a downtime until the software component is restarted on the passive
>> node and the restart might even fail and lead to a complete outage.
>> After some investigating I now understand that the cloned resource is
>> restarted on all nodes after a monitoring failure because the default
>> "on-fail" of "monitor" is restart. But that is not what I want.
>>
>> I have created a minimal setup that reproduces the problem:
>>
>> 
>>>  
>>>  
>>>  >> value="false"/>
>>>  >> value="1.1.23-1.el7_9.1-9acf116022"/>
>>>  >> name="cluster-infrastructure" value="corosync"/>
>>>  >> value="pacemaker-test"/>
>>>  >> name="stonith-enabled" value="false"/>
>>>  >> name="symmetric-cluster" value="false"/>
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  >> value="{{infrastructure.virtual_ip}}"/>
>>>  
>>>  
>>>  >> timeout="20s"/>
>>>  >> timeout="20s"/>
>>>  >> timeout="20s"/>
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  >> value="http://localhost/server-status"/>
>>>  
>>>  
>>>  >> timeout="20s"/>
>>>  >> timeout="40s"/>
>>>  >> timeout="60s"/>
>>>  
>>>  
>>>  
>>>  >> value="2"/>
>>>  >> name="clone-node-max" value="1"/>
>>>  
>>>  
>>>  
>>>  
>>>  >> node="active-node" rsc="apache-clone" score="100"
>>> resource-discovery="exclusive"/>
>>>  >> node="passive-node" rsc="apache-clone" score="0"
>>> resource-discovery="exclusive"/>
>>>  >> node="active-node" rsc="vip" score="100" resource-discovery="exclusive"/>
>>>  >> node="passive-node" rsc="vip" score="0" resource-discovery="exclusive"/>
>>>  >> score="INFINITY" with-rsc="apache-clone"/>
>>>  
>>>  
>>>  
>>>  >> name="resource-stickiness" value="50"/>
>>>  
>>>  
>>> 
>>>
>>
>>
>> When this configuration is started, httpd will be running on active-node
>> and passive-node. The VIP runs only on active-node.
>> When crashing the httpd on active-node (with killall httpd), passive-node
>> immediately grabs the VIP and restarts its own httpd.
>>
>> How can I change this configuration so that when the resource fails on
>> active-node:
>> - passive-node immediately grabs the VIP (as it does now).
>> - active-node tries to restart the failed resource, giving up after x
>> attempts.
>> - passive-node does NOT restart the resource.
>>
>> Regards
>>
>> Andreas Janning
>>
>>
>>
>> --
>> --
>>
>> *Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
>> ausgezeichnet von Great Place to Work
>> 
>> --
>>
>> Andreas Janning
>> Expert Software Engineer
>>
>> QAware GmbH
>> Aschauer Straße 32
>> 81549 München, Germany
>> Mobil +49 160 1492426
>> andreas.jann...@qaware.de
>> www.qaware.de
>> --
>>
>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
>> Registergericht: München
>> Handelsregisternummer: HRB 163761
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>

-- 
--

*Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
ausgezeichnet von Great Place to Work

--

Andreas Janning
Expert Software Engineer

QAware GmbH
Aschauer Straße 32
81549 München, 

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Vladislav Bogdanov

Hi.
I'd suggest to set your clone meta attribute 'interleaved' to 'true'

Best,
Vladislav

On August 9, 2021 1:43:16 PM Andreas Janning  wrote:

Hi all,

we recently experienced an outage in our pacemaker cluster and I would like 
to understand how we can configure the cluster to avoid this problem in the 
future.


First our basic setup:
- CentOS7
- Pacemaker 1.1.23
- Corosync 2.4.5
- Resource-Agents 4.1.1

Our cluster is composed of multiple active/passive nodes. Each software 
component runs on two nodes simultaneously and all traffic is routed to the 
active node via Virtual IP.
If the active node fails, the passive node grabs the Virtual IP and 
immediately takes over all work of the failed node. Since the software is 
already up and running on the passive node, there should be virtually no 
downtime.
We have tried achieved this in pacemaker by configuring clone-sets for each 
software component.


Now the problem:
When a software component fails on the active node, the Virtual-IP is 
correctly grabbed by the passive node. BUT the software component is also 
immediately restarted on the passive Node.
That unfortunately defeats the purpose of the whole setup, since we now 
have a downtime until the software component is restarted on the passive 
node and the restart might even fail and lead to a complete outage.
After some investigating I now understand that the cloned resource is 
restarted on all nodes after a monitoring failure because the default 
"on-fail" of "monitor" is restart. But that is not what I want.


I have created a minimal setup that reproduces the problem:




value="false"/>
value="1.1.23-1.el7_9.1-9acf116022"/>
name="cluster-infrastructure" value="corosync"/>
value="pacemaker-test"/>
value="false"/>
name="symmetric-cluster" value="false"/>










value="{{infrastructure.virtual_ip}}"/>



timeout="20s"/>









value="http://localhost/server-status"/>



timeout="20s"/>






value="2"/>
name="clone-node-max" value="1"/>





rsc="apache-clone" score="100" resource-discovery="exclusive"/>
rsc="apache-clone" score="0" resource-discovery="exclusive"/>
rsc="vip" score="100" resource-discovery="exclusive"/>
rsc="vip" score="0" resource-discovery="exclusive"/>
score="INFINITY" with-rsc="apache-clone"/>




name="resource-stickiness" value="50"/>






When this configuration is started, httpd will be running on active-node 
and passive-node. The VIP runs only on active-node.
When crashing the httpd on active-node (with killall httpd), passive-node 
immediately grabs the VIP and restarts its own httpd.


How can I change this configuration so that when the resource fails on 
active-node:

- passive-node immediately grabs the VIP (as it does now).
- active-node tries to restart the failed resource, giving up after x attempts.
- passive-node does NOT restart the resource.

Regards

Andreas Janning



--

Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
ausgezeichnet von Great Place to Work
Andreas Janning
Expert Software Engineer
QAware GmbH
Aschauer Straße 32
81549 München, Germany
Mobil +49 160 1492426
andreas.jann...@qaware.de
www.qaware.de
Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
Registergericht: München
Handelsregisternummer: HRB 163761
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Strahil Nikolov via Users
I've setup something similar with VIP that is everywhere using the 
globally-unique=true (where cluster controls which node to be passive and which 
active). This allows that the VIP is everywhere but only 1 node answers the 
requests , while the WEB server was running everywhere with config and data on 
a shared FS.
Sadly, I can't find my notes right now.
Best Regards,Strahil Nikolov
 
 
  On Mon, Aug 9, 2021 at 13:43, Andreas Janning 
wrote:   Hi all,
we recently experienced an outage in our pacemaker cluster and I would like to 
understand how we can configure the cluster to avoid this problem in the future.
First our basic setup:- CentOS7- Pacemaker 1.1.23- Corosync 2.4.5- 
Resource-Agents 4.1.1
Our cluster is composed of multiple active/passive nodes. Each software 
component runs on two nodes simultaneously and all traffic is routed to the 
active node via Virtual IP.If the active node fails, the passive node grabs the 
Virtual IP and immediately takes over all work of the failed node. Since the 
software is already up and running on the passive node, there should be 
virtually no downtime.We have tried achieved this in pacemaker by configuring 
clone-sets for each software component.
Now the problem:When a software component fails on the active node, the 
Virtual-IP is correctly grabbed by the passive node. BUT the software component 
is also immediately restarted on the passive Node.That unfortunately defeats 
the purpose of the whole setup, since we now have a downtime until the software 
component is restarted on the passive node and the restart might even fail and 
lead to a complete outage.After some investigating I now understand that the 
cloned resource is restarted on all nodes after a monitoring failure because 
the default "on-fail" of "monitor" is restart. But that is not what I want.
I have created a minimal setup that reproduces the problem:


 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 http://localhost/server-status"/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




When this configuration is started, httpd will be running on active-node and 
passive-node. The VIP runs only on active-node.When crashing the httpd on 
active-node (with killall httpd), passive-node immediately grabs the VIP and 
restarts its own httpd.
How can I change this configuration so that when the resource fails on 
active-node:- passive-node immediately grabs the VIP (as it does now).
- active-node tries to restart the failed resource, giving up after x 
attempts.- passive-node does NOT restart the resource.
Regards
Andreas Janning



-- 
   
 Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
 ausgezeichnet von Great Place to Work 
  
 Andreas Janning
 Expert Software Engineer
 
 
 QAware GmbH
 Aschauer Straße 32
 81549 München, Germany
 Mobil +49 160 1492426
 andreas.jann...@qaware.de
 www.qaware.de
 

 Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
 Registergericht: München
 Handelsregisternummer: HRB 163761
 
 ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

2021-08-09 Thread Andreas Janning
Hi all,

we recently experienced an outage in our pacemaker cluster and I would like
to understand how we can configure the cluster to avoid this problem in the
future.

First our basic setup:
- CentOS7
- Pacemaker 1.1.23
- Corosync 2.4.5
- Resource-Agents 4.1.1

Our cluster is composed of multiple active/passive nodes. Each software
component runs on two nodes simultaneously and all traffic is routed to the
active node via Virtual IP.
If the active node fails, the passive node grabs the Virtual IP and
immediately takes over all work of the failed node. Since the software is
already up and running on the passive node, there should be virtually no
downtime.
We have tried achieved this in pacemaker by configuring clone-sets for each
software component.

Now the problem:
When a software component fails on the active node, the Virtual-IP is
correctly grabbed by the passive node. BUT the software component is also
immediately restarted on the passive Node.
That unfortunately defeats the purpose of the whole setup, since we now
have a downtime until the software component is restarted on the passive
node and the restart might even fail and lead to a complete outage.
After some investigating I now understand that the cloned resource is
restarted on all nodes after a monitoring failure because the default
"on-fail" of "monitor" is restart. But that is not what I want.

I have created a minimal setup that reproduces the problem:


>  
>  
>   value="false"/>
>   value="1.1.23-1.el7_9.1-9acf116022"/>
>   name="cluster-infrastructure" value="corosync"/>
>   value="pacemaker-test"/>
>   value="false"/>
>   name="symmetric-cluster" value="false"/>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>   value="{{infrastructure.virtual_ip}}"/>
>  
>  
>   timeout="20s"/>
>   timeout="20s"/>
>   timeout="20s"/>
>  
>  
>  
>  
>  
>  
>  http://localhost/server-status"/>
>  
>  
>   timeout="20s"/>
>   timeout="40s"/>
>  
>  
>  
>  
>   value="2"/>
>   name="clone-node-max" value="1"/>
>  
>  
>  
>  
>   node="active-node" rsc="apache-clone" score="100"
> resource-discovery="exclusive"/>
>   node="passive-node" rsc="apache-clone" score="0"
> resource-discovery="exclusive"/>
>   rsc="vip" score="100" resource-discovery="exclusive"/>
>   rsc="vip" score="0" resource-discovery="exclusive"/>
>   score="INFINITY" with-rsc="apache-clone"/>
>  
>  
>  
>   name="resource-stickiness" value="50"/>
>  
>  
> 
>


When this configuration is started, httpd will be running on active-node
and passive-node. The VIP runs only on active-node.
When crashing the httpd on active-node (with killall httpd), passive-node
immediately grabs the VIP and restarts its own httpd.

How can I change this configuration so that when the resource fails on
active-node:
- passive-node immediately grabs the VIP (as it does now).
- active-node tries to restart the failed resource, giving up after x
attempts.
- passive-node does NOT restart the resource.

Regards

Andreas Janning



-- 
--

*Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
ausgezeichnet von Great Place to Work

--

Andreas Janning
Expert Software Engineer

QAware GmbH
Aschauer Straße 32
81549 München, Germany
Mobil +49 160 1492426
andreas.jann...@qaware.de
www.qaware.de
--

Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
Registergericht: München
Handelsregisternummer: HRB 163761
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/