Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Andrei Borzenkov Mon, 09 Aug 2021 05:14:51 -0700

On Mon, Aug 9, 2021 at 3:07 PM Andreas Janning
<andreas.jann...@qaware.de> wrote:
>
> Hi,
>
> I have just tried your suggestion by adding
>                 <nvpair id="apache-clone-meta_attributes-interleave" 
> name="interleave" value="true"/>
> to the clone configuration.
> Unfortunately, the behavior stays the same. The service is still restarted on 
> the passive node when crashing it on the active node.
>


What is "service"? Is it the resource with id=apache-clone in your
configuration?

Logs from DC around time of crash would certainly be useful here.

> Regards
>
> Andreas
>
> Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov 
> <bub...@hoster-ok.com>:
>>
>> Hi.
>> I'd suggest to set your clone meta attribute 'interleaved' to 'true'
>>
>> Best,
>> Vladislav
>>
>> On August 9, 2021 1:43:16 PM Andreas Janning <andreas.jann...@qaware.de> 
>> wrote:
>>>
>>> Hi all,
>>>
>>> we recently experienced an outage in our pacemaker cluster and I would like 
>>> to understand how we can configure the cluster to avoid this problem in the 
>>> future.
>>>
>>> First our basic setup:
>>> - CentOS7
>>> - Pacemaker 1.1.23
>>> - Corosync 2.4.5
>>> - Resource-Agents 4.1.1
>>>
>>> Our cluster is composed of multiple active/passive nodes. Each software 
>>> component runs on two nodes simultaneously and all traffic is routed to the 
>>> active node via Virtual IP.
>>> If the active node fails, the passive node grabs the Virtual IP and 
>>> immediately takes over all work of the failed node. Since the software is 
>>> already up and running on the passive node, there should be virtually no 
>>> downtime.
>>> We have tried achieved this in pacemaker by configuring clone-sets for each 
>>> software component.
>>>
>>> Now the problem:
>>> When a software component fails on the active node, the Virtual-IP is 
>>> correctly grabbed by the passive node. BUT the software component is also 
>>> immediately restarted on the passive Node.
>>> That unfortunately defeats the purpose of the whole setup, since we now 
>>> have a downtime until the software component is restarted on the passive 
>>> node and the restart might even fail and lead to a complete outage.
>>> After some investigating I now understand that the cloned resource is 
>>> restarted on all nodes after a monitoring failure because the default 
>>> "on-fail" of "monitor" is restart. But that is not what I want.
>>>
>>> I have created a minimal setup that reproduces the problem:
>>>
>>>> <configuration>
>>>>  <crm_config>
>>>>  <cluster_property_set id="cib-bootstrap-options">
>>>>  <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" 
>>>> value="false"/>
>>>>  <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
>>>> value="1.1.23-1.el7_9.1-9acf116022"/>
>>>>  <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
>>>> name="cluster-infrastructure" value="corosync"/>
>>>>  <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" 
>>>> value="pacemaker-test"/>
>>>>  <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
>>>> value="false"/>
>>>>  <nvpair id="cib-bootstrap-options-symmetric-cluster" 
>>>> name="symmetric-cluster" value="false"/>
>>>>  </cluster_property_set>
>>>>  </crm_config>
>>>>  <nodes>
>>>>  <node id="1" uname="active-node"/>
>>>>  <node id="2" uname="passive-node"/>
>>>>  </nodes>
>>>>  <resources>
>>>>  <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2">
>>>>  <instance_attributes id="vip-instance_attributes">
>>>>  <nvpair id="vip-instance_attributes-ip" name="ip" 
>>>> value="{{infrastructure.virtual_ip}}"/>
>>>>  </instance_attributes>
>>>>  <operations>
>>>>  <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor" 
>>>> timeout="20s"/>
>>>>  <op id="psa-vip-start-interval-0s" interval="0s" name="start" 
>>>> timeout="20s"/>
>>>>  <op id="psa-vip-stop-interval-0s" interval="0s" name="stop" 
>>>> timeout="20s"/>
>>>>  </operations>
>>>>  </primitive>
>>>>  <clone id="apache-clone">
>>>>  <primitive class="ocf" id="apache" provider="heartbeat" type="apache">
>>>>  <instance_attributes id="apache-instance_attributes">
>>>>  <nvpair id="apache-instance_attributes-port" name="port" value="80"/>
>>>>  <nvpair id="apache-instance_attributes-statusurl" name="statusurl" 
>>>> value="http://localhost/server-status"/>
>>>>  </instance_attributes>
>>>>  <operations>
>>>>  <op id="apache-monitor-interval-10s" interval="10s" name="monitor" 
>>>> timeout="20s"/>
>>>>  <op id="apache-start-interval-0s" interval="0s" name="start" 
>>>> timeout="40s"/>
>>>>  <op id="apache-stop-interval-0s" interval="0s" name="stop" timeout="60s"/>
>>>>  </operations>
>>>>  </primitive>
>>>>  <meta_attributes id="apache-meta_attributes">
>>>>  <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" 
>>>> value="2"/>
>>>>  <nvpair id="apache-clone-meta_attributes-clone-node-max" 
>>>> name="clone-node-max" value="1"/>
>>>>  </meta_attributes>
>>>>  </clone>
>>>>  </resources>
>>>>  <constraints>
>>>>  <rsc_location id="location-apache-clone-active-node-100" 
>>>> node="active-node" rsc="apache-clone" score="100" 
>>>> resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-apache-clone-passive-node-0" 
>>>> node="passive-node" rsc="apache-clone" score="0" 
>>>> resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-vip-clone-active-node-100" node="active-node" 
>>>> rsc="vip" score="100" resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-vip-clone-passive-node-0" node="passive-node" 
>>>> rsc="vip" score="0" resource-discovery="exclusive"/>
>>>>  <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip" 
>>>> score="INFINITY" with-rsc="apache-clone"/>
>>>>  </constraints>
>>>>  <rsc_defaults>
>>>>  <meta_attributes id="rsc_defaults-options">
>>>>  <nvpair id="rsc_defaults-options-resource-stickiness" 
>>>> name="resource-stickiness" value="50"/>
>>>>  </meta_attributes>
>>>>  </rsc_defaults>
>>>> </configuration>
>>>
>>>
>>>
>>> When this configuration is started, httpd will be running on active-node 
>>> and passive-node. The VIP runs only on active-node.
>>> When crashing the httpd on active-node (with killall httpd), passive-node 
>>> immediately grabs the VIP and restarts its own httpd.
>>>
>>> How can I change this configuration so that when the resource fails on 
>>> active-node:
>>> - passive-node immediately grabs the VIP (as it does now).
>>> - active-node tries to restart the failed resource, giving up after x 
>>> attempts.
>>> - passive-node does NOT restart the resource.
>>>
>>> Regards
>>>
>>> Andreas Janning
>>>
>>>
>>>
>>> --
>>> ________________________________
>>>
>>> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
>>> ausgezeichnet von Great Place to Work
>>>
>>> ________________________________
>>>
>>> Andreas Janning
>>> Expert Software Engineer
>>>
>>> QAware GmbH
>>> Aschauer Straße 32
>>> 81549 München, Germany
>>> Mobil +49 160 1492426
>>> andreas.jann...@qaware.de
>>> www.qaware.de
>>>
>>> ________________________________
>>>
>>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
>>> Registergericht: München
>>> Handelsregisternummer: HRB 163761
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>
>
> --
> ________________________________
>
> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
> ausgezeichnet von Great Place to Work
>
> ________________________________
>
> Andreas Janning
> Expert Software Engineer
>
> QAware GmbH
> Aschauer Straße 32
> 81549 München, Germany
> Mobil +49 160 1492426
> andreas.jann...@qaware.de
> www.qaware.de
>
> ________________________________
>
> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
> Registergericht: München
> Handelsregisternummer: HRB 163761
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Reply via email to