On Mon, Aug 9, 2021 at 3:07 PM Andreas Janning <andreas.jann...@qaware.de> wrote: > > Hi, > > I have just tried your suggestion by adding > <nvpair id="apache-clone-meta_attributes-interleave" > name="interleave" value="true"/> > to the clone configuration. > Unfortunately, the behavior stays the same. The service is still restarted on > the passive node when crashing it on the active node. >
What is "service"? Is it the resource with id=apache-clone in your configuration? Logs from DC around time of crash would certainly be useful here. > Regards > > Andreas > > Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov > <bub...@hoster-ok.com>: >> >> Hi. >> I'd suggest to set your clone meta attribute 'interleaved' to 'true' >> >> Best, >> Vladislav >> >> On August 9, 2021 1:43:16 PM Andreas Janning <andreas.jann...@qaware.de> >> wrote: >>> >>> Hi all, >>> >>> we recently experienced an outage in our pacemaker cluster and I would like >>> to understand how we can configure the cluster to avoid this problem in the >>> future. >>> >>> First our basic setup: >>> - CentOS7 >>> - Pacemaker 1.1.23 >>> - Corosync 2.4.5 >>> - Resource-Agents 4.1.1 >>> >>> Our cluster is composed of multiple active/passive nodes. Each software >>> component runs on two nodes simultaneously and all traffic is routed to the >>> active node via Virtual IP. >>> If the active node fails, the passive node grabs the Virtual IP and >>> immediately takes over all work of the failed node. Since the software is >>> already up and running on the passive node, there should be virtually no >>> downtime. >>> We have tried achieved this in pacemaker by configuring clone-sets for each >>> software component. >>> >>> Now the problem: >>> When a software component fails on the active node, the Virtual-IP is >>> correctly grabbed by the passive node. BUT the software component is also >>> immediately restarted on the passive Node. >>> That unfortunately defeats the purpose of the whole setup, since we now >>> have a downtime until the software component is restarted on the passive >>> node and the restart might even fail and lead to a complete outage. >>> After some investigating I now understand that the cloned resource is >>> restarted on all nodes after a monitoring failure because the default >>> "on-fail" of "monitor" is restart. But that is not what I want. >>> >>> I have created a minimal setup that reproduces the problem: >>> >>>> <configuration> >>>> <crm_config> >>>> <cluster_property_set id="cib-bootstrap-options"> >>>> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" >>>> value="false"/> >>>> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" >>>> value="1.1.23-1.el7_9.1-9acf116022"/> >>>> <nvpair id="cib-bootstrap-options-cluster-infrastructure" >>>> name="cluster-infrastructure" value="corosync"/> >>>> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" >>>> value="pacemaker-test"/> >>>> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" >>>> value="false"/> >>>> <nvpair id="cib-bootstrap-options-symmetric-cluster" >>>> name="symmetric-cluster" value="false"/> >>>> </cluster_property_set> >>>> </crm_config> >>>> <nodes> >>>> <node id="1" uname="active-node"/> >>>> <node id="2" uname="passive-node"/> >>>> </nodes> >>>> <resources> >>>> <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2"> >>>> <instance_attributes id="vip-instance_attributes"> >>>> <nvpair id="vip-instance_attributes-ip" name="ip" >>>> value="{{infrastructure.virtual_ip}}"/> >>>> </instance_attributes> >>>> <operations> >>>> <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor" >>>> timeout="20s"/> >>>> <op id="psa-vip-start-interval-0s" interval="0s" name="start" >>>> timeout="20s"/> >>>> <op id="psa-vip-stop-interval-0s" interval="0s" name="stop" >>>> timeout="20s"/> >>>> </operations> >>>> </primitive> >>>> <clone id="apache-clone"> >>>> <primitive class="ocf" id="apache" provider="heartbeat" type="apache"> >>>> <instance_attributes id="apache-instance_attributes"> >>>> <nvpair id="apache-instance_attributes-port" name="port" value="80"/> >>>> <nvpair id="apache-instance_attributes-statusurl" name="statusurl" >>>> value="http://localhost/server-status"/> >>>> </instance_attributes> >>>> <operations> >>>> <op id="apache-monitor-interval-10s" interval="10s" name="monitor" >>>> timeout="20s"/> >>>> <op id="apache-start-interval-0s" interval="0s" name="start" >>>> timeout="40s"/> >>>> <op id="apache-stop-interval-0s" interval="0s" name="stop" timeout="60s"/> >>>> </operations> >>>> </primitive> >>>> <meta_attributes id="apache-meta_attributes"> >>>> <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" >>>> value="2"/> >>>> <nvpair id="apache-clone-meta_attributes-clone-node-max" >>>> name="clone-node-max" value="1"/> >>>> </meta_attributes> >>>> </clone> >>>> </resources> >>>> <constraints> >>>> <rsc_location id="location-apache-clone-active-node-100" >>>> node="active-node" rsc="apache-clone" score="100" >>>> resource-discovery="exclusive"/> >>>> <rsc_location id="location-apache-clone-passive-node-0" >>>> node="passive-node" rsc="apache-clone" score="0" >>>> resource-discovery="exclusive"/> >>>> <rsc_location id="location-vip-clone-active-node-100" node="active-node" >>>> rsc="vip" score="100" resource-discovery="exclusive"/> >>>> <rsc_location id="location-vip-clone-passive-node-0" node="passive-node" >>>> rsc="vip" score="0" resource-discovery="exclusive"/> >>>> <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip" >>>> score="INFINITY" with-rsc="apache-clone"/> >>>> </constraints> >>>> <rsc_defaults> >>>> <meta_attributes id="rsc_defaults-options"> >>>> <nvpair id="rsc_defaults-options-resource-stickiness" >>>> name="resource-stickiness" value="50"/> >>>> </meta_attributes> >>>> </rsc_defaults> >>>> </configuration> >>> >>> >>> >>> When this configuration is started, httpd will be running on active-node >>> and passive-node. The VIP runs only on active-node. >>> When crashing the httpd on active-node (with killall httpd), passive-node >>> immediately grabs the VIP and restarts its own httpd. >>> >>> How can I change this configuration so that when the resource fails on >>> active-node: >>> - passive-node immediately grabs the VIP (as it does now). >>> - active-node tries to restart the failed resource, giving up after x >>> attempts. >>> - passive-node does NOT restart the resource. >>> >>> Regards >>> >>> Andreas Janning >>> >>> >>> >>> -- >>> ________________________________ >>> >>> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware >>> ausgezeichnet von Great Place to Work >>> >>> ________________________________ >>> >>> Andreas Janning >>> Expert Software Engineer >>> >>> QAware GmbH >>> Aschauer Straße 32 >>> 81549 München, Germany >>> Mobil +49 160 1492426 >>> andreas.jann...@qaware.de >>> www.qaware.de >>> >>> ________________________________ >>> >>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger >>> Registergericht: München >>> Handelsregisternummer: HRB 163761 >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >> > > > -- > ________________________________ > > Beste Arbeitgeber ITK 2021 - 1. Platz für QAware > ausgezeichnet von Great Place to Work > > ________________________________ > > Andreas Janning > Expert Software Engineer > > QAware GmbH > Aschauer Straße 32 > 81549 München, Germany > Mobil +49 160 1492426 > andreas.jann...@qaware.de > www.qaware.de > > ________________________________ > > Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger > Registergericht: München > Handelsregisternummer: HRB 163761 > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/