[ClusterLabs] host in standby causes havoc
Hello, We had a strange issue here: 7 node cluster, one node was put into standby mode to test a new iscsi setting on it. During configuring the machine it was rebooted and after the reboot the iscsi didn't come up. That caused a malformed communication (atlas5 is the node in standby) with the cluster: Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]: warning: Unexpected result (error) was recorded for probe of ocsi on atlas5 at Jun 15 10:09:32 2023 Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]: notice: If it is not possible for ocsi to run on atlas5, see the resource-discovery option for location constraints Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]: error: Resource ocsi is active on 2 nodes (attempting recovery) The resource was definitely not active on 2 nodes. And that caused a storm of killing all virtual machines as resources. How could one prevent such cases to come up? Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.hu PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] node utilization attributes are lost during upgrade
Hi, On Mon, 17 Aug 2020, Ken Gaillot wrote: > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote: > > > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian > > stretch to buster, all the node utilization attributes were erased > > from the configuration. However, the same attributes were kept at the > > VirtualDomain resources. This resulted that all resources with > > utilization attributes were stopped. > > Ouch :( > > There are two types of node attributes, transient and permanent. > Transient attributes last only until pacemaker is next stopped on the > node, while permanent attributes persist between reboots/restarts. > > If you configured the utilization attributes with crm_attribute -z/ > --utilization, it will default to permanent, but it's possible to > override that with -l/--lifetime reboot (or equivalently, -t/--type > status). The attributes were defined by "crm configure edit", simply stating: node 1084762113: atlas0 \ utilization hv_memory=192 cpu=32 \ attributes standby=off ... node 1084762119: atlas6 \ utilization hv_memory=192 cpu=32 \ But I believe now that corosync caused the problem, because the nodes had been renumbered: node 3232245761: atlas0 ... node 3232245767: atlas6 The upgrade process was: for each node do set the "hold" mark on the corosync package put the node standby wait for the resources to be migrated off upgrade from stretch to buster reboot put the node online wait for the resources to be migrated (back) done Up to this point all resources were running fine. In order to upgrade corosync, we followed the next steps: enable maintenance mode stop pacemaker and corosync on all nodes for each node do delete the hold mark and upgrade corosync install new config file (nodeid not specified) restart corosync, start pacemaker done We could see that all resources were running unmanaged. When disabling the maintenance mode, then those were stopped. So I think corosync renumbered the nodes and I suspect the reason for that was that "clear_node_high_bit: yes" was not specified in the new config file. It means it was an admin error then. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.hu PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] node utilization attributes are lost during upgrade
Hello, At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian stretch to buster, all the node utilization attributes were erased from the configuration. However, the same attributes were kept at the VirtualDomain resources. This resulted that all resources with utilization attributes were stopped. The documentation says: "You can name utilization attributes according to your preferences and define as many name/value pairs as your configuration needs.", so one assumes utilization attributes are kept during upgrades, for nodes and resources as well. The corosync incompatibility made the upgrade more stressful anyway and the stopping of the resources came out of the blue. The resources could not be started of course - and there were no log warning/error messages that the resources are not started because the utilization constrains could not be satisfied. Pacemaker logs a lot (from admin point of view it is too much), but in this case there was no indication why the resources could not be started (or we were unable to find it in the logs?). So we wasted a lot of time with debugging the VirtualDomain agent. Currently we run the cluster with the placement-strategy set to default. In my opinion node attributes should be kept and preserved during an upgrade. Also, it should be logged when a resource must be stopped/cannot be started because the utilization constrains cannot be satisfied. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.hu PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 9 Oct 2019, Digimer wrote: > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > > CPU#7 stuck for 23s"), which resulted that the node could process > > traffic on the backend interface but not on the fronted one. Thus the > > services became unavailable but the cluster thought the node is all > > right and did not stonith it. > > > > How could we protect the cluster against such failures? > > > We use mode=1 (active-passive) bonded network interfaces for each > network connection (we also have a back-end, front-end and a storage > network). Each bond has a link going to one switch and the other link to > a second switch. For fence devices, we use IPMI fencing connected via > switch 1 and PDU fencing as the backup method connected on switch 2. > > With this setup, no matter what might fail, one of the fence methods > will still be available. It's saved us in the field a few times now. A bonded interface helps, but I suspect that in this case it could not save the situation. It was not an interface failure but a strange kind of system lockup: some of the already running processes were fine (corosync), but for example sshd could not accept new connections from the direction of the seemingly fine backbone interface either. In the backend direction we have got bonded (LACP) interfaces - the frontend uses single interfaces only. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 9 Oct 2019, Ken Gaillot wrote: > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > > CPU#7 stuck for 23s"), which resulted that the node could process > > traffic on the backend interface but not on the fronted one. Thus the > > services became unavailable but the cluster thought the node is all > > right and did not stonith it. > > > > How could we protect the cluster against such failures? > > See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) > and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP > such as a gateway) This looks really promising, thank you! Does the cluster regard it as a failure when a ocf:heartbeat:ethmonitor agent clone on a node does not run? :-) Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
Hi, On Wed, 9 Oct 2019, Jan Pokorný wrote: > On 09/10/19 09:58 +0200, Kadlecsik József wrote: > > The nodes in our cluster have got backend and frontend interfaces: the > > former ones are for the storage and cluster (corosync) traffic and the > > latter ones are for the public services of KVM guests only. > > > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > > stuck for 23s"), which resulted that the node could process traffic on the > > backend interface but not on the fronted one. Thus the services became > > unavailable but the cluster thought the node is all right and did not > > stonith it. > > > Which is the best way to solve the problem? > > Looks like heuristics of corosync-qdevice that would ping/attest your > frontend interface could be a way to go. You'd need an additional > host in your setup, though. As far as I see, corosync-qdevice can add/increase the votes for a node and cannot decrease it. I hope I'm wrong, I wouldn't mind adding an additional host :-) Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Howto stonith in the case of any interface failure?
Hello, The nodes in our cluster have got backend and frontend interfaces: the former ones are for the storage and cluster (corosync) traffic and the latter ones are for the public services of KVM guests only. One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 stuck for 23s"), which resulted that the node could process traffic on the backend interface but not on the fronted one. Thus the services became unavailable but the cluster thought the node is all right and did not stonith it. How could we protect the cluster against such failures? We could configure a second corosync ring, but that would be a redundancy ring only. We could setup a second, independent corosync configuration for a second pacemaker just with stonith agents. Is it enough to specify the cluster name in the corosync config to pair pacemaker to corosync? What about the pairing of pacemaker to this corosync instance, how can we tell pacemaker to connect to this corosync instance? Which is the best way to solve the problem? Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: Re: Constant stop/start of resource in spite of interval=0
Hi, On Tue, 21 May 2019, Ulrich Windl wrote: > So maybe the original defective RA would be valuable for debugging the > issue. I guess the RA was invalid in some way that wasn't detected or > handled properly... With the attached skeleton RA and the setting primitive testid-testid0 ocf:local:testid \ params name=testid0 id=0 foo=foo0 \ op monitor timeout=30s interval=30s \ meta target-role=Started I can reproduce it easily. Maybe it's required that the RA and the instance be reloadable. Best regards, Jozsef > >>> Andrei Borzenkov schrieb am 21.05.2019 um 09:13 in > Nachricht : > > 21.05.2019 0:46, Ken Gaillot пишет: > >>> > From what's described here, the op-restart-digest is changing every > time, which means something is going wrong in the hash comparison > (since the definition is not really changing). > > The log that stands out to me is: > > trace May 18 23:02:49 calculate_xml_digest_v1(83):0: > digest:source > > The id is the resource name, which isn't "0". That leads me to: > > trace May 18 23:02:49 svc_read_output(87):0: Got 499 chars: > > > which is the likely source of the problem. "id" is a pacemaker > property, > not an OCF resource parameter. It shouldn't be in the resource > agent > meta-data. Remove that, and I bet it will be OK. > >>> > >>> I renamed the parameter to "tunnel_id", redefined the resources and > >>> started them again. > >>> > BTW the "every 15 minutes" would be the cluster-recheck-interval > cluster property. > >>> > >>> I have waited more than half an hour and there are no more > >>> stopping/starting of the resources. :-) I haven't thought that "id" > >>> is > >>> reserved as parameter name. > >> > >> It isn't, by the OCF standard. :) This could be considered a pacemaker > >> bug; pacemaker should be able to distinguish its own "id" from an OCF > >> parameter "id", but it currently can't. > >> > > > > > > I'm really baffled by this explanation. I tried to create resource with > > "id" unique instance property and I do not observe this problem. No > > restarts. > > > > As none of traces provided captures of the moment of restart-digest > > mismatch I also am not sure where to look. I do not see "id" being > > treated anyway specially in the code. > > > > Somewhat interesting is that restart digest source in two traces is > > different: > > > > bor@bor-Latitude-E5450:~$ grep -w 'restart digest' /tmp/trace.log* > > /tmp/trace.log:trace May 18 23:02:49 append_restart_list(694):0: > > restart digest source > > /tmp/trace.log:trace May 18 23:02:50 append_restart_list(694):0: > > restart digest source > > /tmp/trace.log.2:trace May 20 13:56:16 append_restart_list(694):0: > > restart digest source > > /tmp/trace.log.2:trace May 20 13:56:17 append_restart_list(694):0: > > restart digest source > > /tmp/trace.log.2:trace May 20 13:56:18 append_restart_list(694):0: > > restart digest sourceid="1"/> > > bor@bor-Latitude-E5450:~$ > > > > In one case it does not include "name" parameter. Whether configuration > > was changed in between is unknown, we never have seen full RA metadata > > in each case nor full resource definition so ... > > > > My hunch is that "id" is red herring and something else has changed when > > resource definition was edited. If I'm wrong I appreciate pointer to > > code where "id" is mishandled. > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary#!/bin/sh # # Tunnel OCF RA. Enables and disables testids, with iptables rules # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a
Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0
Hi, On Mon, 20 May 2019, Ken Gaillot wrote: > On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote: > > What worries me is "Rejecting name for unique". > > Trace messages are often not user-friendly. The rejecting/accepting is > nothing to be concerned about; it just refers to which parameters are > being used to calculate that particular hash. > > Pacemaker calculates up to three hashes. > > The first is a hash of all the resource parameters, to detect if > anything changed; this is stored as "op-digest" in the CIB status > entries. > > If the resource is reloadable, another hash is calculated with just the > parameters marked as unique=1 (which means they can't be reloaded). Any > change in these parameters requires a full restart. This one is "op- > restart-digest". > > Finally, if the resource has sensitive parameters like passwords, a > hash of everything but those parameters is stored as "op-secure- > digest". This one is only used when simulating CIBs grabbed from > cluster reports, which have sensitive info scrubbed. Thanks for the explanation! It seemed very cryptic in the trace messages that different hashes were calculated with differen parameter lists. > From what's described here, the op-restart-digest is changing every > time, which means something is going wrong in the hash comparison > (since the definition is not really changing). > > The log that stands out to me is: > > trace May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source > > > The id is the resource name, which isn't "0". That leads me to: > > trace May 18 23:02:49 svc_read_output(87):0: Got 499 chars: name="id" unique="1" required="1"> > > which is the likely source of the problem. "id" is a pacemaker property, > not an OCF resource parameter. It shouldn't be in the resource agent > meta-data. Remove that, and I bet it will be OK. I renamed the parameter to "tunnel_id", redefined the resources and started them again. > BTW the "every 15 minutes" would be the cluster-recheck-interval > cluster property. I have waited more than half an hour and there are no more stopping/starting of the resources. :-) I haven't thought that "id" is reserved as parameter name. Thank you! Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Constant stop/start of resource in spite of interval=0
On Sat, 18 May 2019, Kadlecsik József wrote: > On Sat, 18 May 2019, Andrei Borzenkov wrote: > > > 18.05.2019 18:34, Kadlecsik József пишет: > > > > We have a resource agent which creates IP tunnels. In spite of the > > > configuration setting > > > > > > primitive tunnel-eduroam ocf:local:tunnel \ > > > params > > > op start timeout=120s interval=0 \ > > > op stop timeout=300s interval=0 \ > > > op monitor timeout=30s interval=30s depth=0 \ > > > meta target-role=Started > > > order bifur-eduroam-ipv4-before-tunnel-eduroam \ > > > Mandatory: bifur-eduroam-ipv4 tunnel-eduroam > > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \ > > > bifur-eduroam-ipv4:Started > > > > > > the resource is restarted again and again. According to the debug logs: > > > > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: > > > recurring_action_timer: > > > Scheduling another invocation of tunnel-eduroam_monitor_3 > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > > tunnel-eduroam_monitor_3:62066 - exited with rc=0 > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > > tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ] > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > > tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ] > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: log_finished: > > > finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 > > > exit-code:0 exec-time:0ms queue-time:0ms > > > May 16 14:21:04 [3054] bifur1pengine: info: native_print: > > > tunnel-eduroam (ocf::local:tunnel):Started bifur1 > > > May 16 14:21:04 [3054] bifur1pengine: info: > > > check_action_definition: > > > Parameters to tunnel-eduroam_start_0 on bifur1 changed: was > > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 > > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32 > > > > This means that instance attributes changed in this case pacemaker > > restarts resource to apply new values. Turning on trace level hopefully > > will show what exactly is being changed. You can also dump CIB before > > and after restart to compare current information. > > The strange thing is that the new value seems never be stored. Just the > "was-now" part from the log lines: > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 > ... > > However, after issuing "cibadmin --query --local", the whole flipping > stopped! :-) Thanks! No, I was wrong - it still repeats every ~15mins. The diff between two cib xml dumps doesn't say much to me - I'm going to enable tracing. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Constant stop/start of resource in spite of interval=0
On Sat, 18 May 2019, Andrei Borzenkov wrote: > 18.05.2019 18:34, Kadlecsik József пишет: > > We have a resource agent which creates IP tunnels. In spite of the > > configuration setting > > > > primitive tunnel-eduroam ocf:local:tunnel \ > > params > > op start timeout=120s interval=0 \ > > op stop timeout=300s interval=0 \ > > op monitor timeout=30s interval=30s depth=0 \ > > meta target-role=Started > > order bifur-eduroam-ipv4-before-tunnel-eduroam \ > > Mandatory: bifur-eduroam-ipv4 tunnel-eduroam > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \ > > bifur-eduroam-ipv4:Started > > > > the resource is restarted again and again. According to the debug logs: > > > > May 16 14:20:35 [3052] bifur1 lrmd:debug: recurring_action_timer: > > Scheduling another invocation of tunnel-eduroam_monitor_3 > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > tunnel-eduroam_monitor_3:62066 - exited with rc=0 > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ] > > May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: > > tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ] > > May 16 14:20:35 [3052] bifur1 lrmd:debug: log_finished: > > finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 > > exit-code:0 exec-time:0ms queue-time:0ms > > May 16 14:21:04 [3054] bifur1pengine: info: native_print: > > tunnel-eduroam (ocf::local:tunnel):Started bifur1 > > May 16 14:21:04 [3054] bifur1pengine: info: > > check_action_definition: > > Parameters to tunnel-eduroam_start_0 on bifur1 changed: was > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32 > > This means that instance attributes changed in this case pacemaker > restarts resource to apply new values. Turning on trace level hopefully > will show what exactly is being changed. You can also dump CIB before > and after restart to compare current information. The strange thing is that the new value seems never be stored. Just the "was-now" part from the log lines: was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 ... However, after issuing "cibadmin --query --local", the whole flipping stopped! :-) Thanks! Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Constant stop/start of resource in spite of interval=0
Hello, We have a resource agent which creates IP tunnels. In spite of the configuration setting primitive tunnel-eduroam ocf:local:tunnel \ params op start timeout=120s interval=0 \ op stop timeout=300s interval=0 \ op monitor timeout=30s interval=30s depth=0 \ meta target-role=Started order bifur-eduroam-ipv4-before-tunnel-eduroam \ Mandatory: bifur-eduroam-ipv4 tunnel-eduroam colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \ bifur-eduroam-ipv4:Started the resource is restarted again and again. According to the debug logs: May 16 14:20:35 [3052] bifur1 lrmd:debug: recurring_action_timer: Scheduling another invocation of tunnel-eduroam_monitor_3 May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: tunnel-eduroam_monitor_3:62066 - exited with rc=0 May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ] May 16 14:20:35 [3052] bifur1 lrmd:debug: operation_finished: tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ] May 16 14:20:35 [3052] bifur1 lrmd:debug: log_finished: finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 exit-code:0 exec-time:0ms queue-time:0ms May 16 14:21:04 [3054] bifur1pengine: info: native_print: tunnel-eduroam (ocf::local:tunnel):Started bifur1 May 16 14:21:04 [3054] bifur1pengine: info: check_action_definition: Parameters to tunnel-eduroam_start_0 on bifur1 changed: was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32 May 16 14:21:04 [3054] bifur1pengine:debug: native_assign_node: Assigning bifur1 to tunnel-eduroam May 16 14:21:04 [3054] bifur1pengine: info: RecurringOp: Start recurring monitor (30s) for tunnel-eduroam on bifur1 May 16 14:21:04 [3054] bifur1pengine: notice: LogActions: Restart tunnel-eduroam (Started bifur1) May 16 14:21:04 [3055] bifur1 crmd: notice: te_rsc_command: Initiating stop operation tunnel-eduroam_stop_0 locally on bifur1 | action 50 May 16 14:21:04 [3055] bifur1 crmd:debug: stop_recurring_action_by_rsc: Cancelling op 1045 for tunnel-eduroam (tunnel-eduroam:1045) May 16 14:21:04 [3055] bifur1 crmd:debug: cancel_op: Cancelling op 1045 for tunnel-eduroam (tunnel-eduroam:1045) May 16 14:21:04 [3052] bifur1 lrmd: info: cancel_recurring_action:Cancelling ocf operation tunnel-eduroam_monitor_3 May 16 14:21:04 [3052] bifur1 lrmd:debug: log_finished: finished - rsc:tunnel-eduroam action:monitor call_id:1045 exit-code:0 exec-time:0ms queue-time:0ms May 16 14:21:04 [3055] bifur1 crmd:debug: cancel_op: Op 1045 for tunnel-eduroam (tunnel-eduroam:1045): cancelled May 16 14:21:04 [3055] bifur1 crmd: info: do_lrm_rsc_op: Performing key=50:4:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32 op=tunnel-eduroam_stop_0 May 16 14:21:04 [3052] bifur1 lrmd: info: log_execute: executing - rsc:tunnel-eduroam action:stop call_id:1047 May 16 14:21:04 [3055] bifur1 crmd: info: process_lrm_event: Result of monitor operation for tunnel-eduroam on bifur1: Cancelled | call=1045 key=tunnel-eduroam_monitor_3 confirmed=true ... >From where does the restart operation come? Why does it happen? The IP address is at the same node where the tunnel resource is already running. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions
On Wed, 5 Sep 2018, Kadlecsik József wrote: > On Wed, 5 Sep 2018, Ken Gaillot wrote: > > > > > For testing purposes one of our nodes was put in standby node and > > > > then rebooted several times. When the standby node started up, it > > > > joined the cluster as a new member and it resulted in transitions > > > > between the online nodes. However, when the standby node was > > > > rebooted in mid‑transitions, it triggered another transitions > > > > again. As a result, live migrations was aborted and guests > > > > stopped/started. > > > > > > > > How can one make sure that join/leave operations of standby nodes > > > > do not > > > > affect the location of the running resources? > > > > > > > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian > > > > stretch > > > > nodes. > > > > Node joins/leaves do and should trigger new transitions, but that should > > not result in any actions if the node is in standby. > > > > The cluster will wait for any actions in progress (such as a live > > migration) to complete before beginning a new transition, so there is > > likely something else going on that is affecting the migration. > > > > > Logs and more details, please! > > > > Particularly the detail log on the DC should be helpful. It will have > > "pengine:" messages with "saving inputs" at each transition. > > I attached the log file. > > There are log lines like this > > Sep 5 12:22:30 atlas4 crmd[32776]: notice: Transition aborted by > w2-utilization-cpu doing modify cpu=1: Configuration change > > which I don't understand: in the configuration the cpu utilization is > explicitly set to cpu=2 for w2. > > Nothing changed, just the node atlas0 (in standby mode) was halted/started > several times. Still, resources were migrated, like in this case: > > Sep 5 12:22:31 atlas4 VirtualDomain(mail0)[61781]: INFO: mail0: Starting > live migration to atlas3 (using: virsh --connect=qemu:///system --quiet > migrate --live mail0 qemu+tls://atlas3/system ). > > And besides the successful migrations, sometimes the guest was > stopped/started instead of migration: > > Sep 5 12:25:22 atlas4 crmd[32776]: notice: Result of stop operation for > mail0 on atlas4: 0 (ok) > Sep 5 12:25:22 atlas4 crmd[32776]: notice: Initiating start operation > mail0_start_0 on atlas3 Just guessing: maybe utilization is taken into account even when a node is offline and that cause transitions? I can provide pe-input files which was recorded during the events. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Rebooting a standby node triggers lots of transitions
Hi, For testing purposes one of our nodes was put in standby node and then rebooted several times. When the standby node started up, it joined the cluster as a new member and it resulted in transitions between the online nodes. However, when the standby node was rebooted in mid-transitions, it triggered another transitions again. As a result, live migrations was aborted and guests stopped/started. How can one make sure that join/leave operations of standby nodes do not affect the location of the running resources? It's pacemaker 1.1.16-1 with corosync 2.4.2-3+deb9u1 on debian stretch nodes. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org