[ClusterLabs] host in standby causes havoc

2023-06-15 Thread Kadlecsik József
Hello,

We had a strange issue here: 7 node cluster, one node was put into standby 
mode to test a new iscsi setting on it. During configuring the machine it 
was rebooted and after the reboot the iscsi didn't come up. That caused a 
malformed communication (atlas5 is the node in standby) with the cluster:

Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]:  warning: Unexpected 
result (error) was recorded for probe of ocsi on atlas5 at Jun 15 10:09:32 2023
Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]:  notice: If it is not 
possible for ocsi to run on atlas5, see the resource-discovery option for 
location constraints
Jun 15 10:10:13 atlas0 pacemaker-schedulerd[7153]:  error: Resource ocsi 
is active on 2 nodes (attempting recovery)

The resource was definitely not active on 2 nodes. And that caused a storm 
of killing all virtual machines as resources.

How could one prevent such cases to come up?

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.hu
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] node utilization attributes are lost during upgrade

2020-08-18 Thread Kadlecsik József
Hi,

On Mon, 17 Aug 2020, Ken Gaillot wrote:

> On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
> > 
> > At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian 
> > stretch to buster, all the node utilization attributes were erased 
> > from the configuration. However, the same attributes were kept at the 
> > VirtualDomain resources. This resulted that all resources with 
> > utilization attributes were stopped.
> 
> Ouch :(
> 
> There are two types of node attributes, transient and permanent. 
> Transient attributes last only until pacemaker is next stopped on the 
> node, while permanent attributes persist between reboots/restarts.
> 
> If you configured the utilization attributes with crm_attribute -z/ 
> --utilization, it will default to permanent, but it's possible to 
> override that with -l/--lifetime reboot (or equivalently, -t/--type 
> status).

The attributes were defined by "crm configure edit", simply stating:

node 1084762113: atlas0 \
utilization hv_memory=192 cpu=32 \
attributes standby=off
...
node 1084762119: atlas6 \
utilization hv_memory=192 cpu=32 \

But I believe now that corosync caused the problem, because the nodes had 
been renumbered:

node 3232245761: atlas0
...
node 3232245767: atlas6

The upgrade process was:

for each node do
set the "hold" mark on the corosync package
put the node standby
wait for the resources to be migrated off
upgrade from stretch to buster
reboot
put the node online
wait for the resources to be migrated (back)
done

Up to this point all resources were running fine.

In order to upgrade corosync, we followed the next steps:

enable maintenance mode
stop pacemaker and corosync on all nodes
for each node do
delete the hold mark and upgrade corosync
install new config file (nodeid not specified)
restart corosync, start pacemaker
done

We could see that all resources were running unmanaged. When disabling the 
maintenance mode, then those were stopped.

So I think corosync renumbered the nodes and I suspect the reason for that 
was that "clear_node_high_bit: yes" was not specified in the new config 
file. It means it was an admin error then.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.hu
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] node utilization attributes are lost during upgrade

2020-08-17 Thread Kadlecsik József
Hello,

At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian stretch 
to buster, all the node utilization attributes were erased from the 
configuration. However, the same attributes were kept at the VirtualDomain 
resources. This resulted that all resources with utilization attributes 
were stopped.

The documentation says: "You can name utilization attributes according to 
your preferences and define as many name/value pairs as your configuration 
needs.", so one assumes utilization attributes are kept during upgrades, 
for nodes and resources as well.

The corosync incompatibility made the upgrade more stressful anyway and 
the stopping of the resources came out of the blue. The resources could 
not be started of course - and there were no log warning/error messages 
that the resources are not started because the utilization constrains 
could not be satisfied. Pacemaker logs a lot (from admin point of view it 
is too much), but in this case there was no indication why the resources 
could not be started (or we were unable to find it in the logs?). So we 
wasted a lot of time with debugging the VirtualDomain agent.

Currently we run the cluster with the placement-strategy set to default.

In my opinion node attributes should be kept and preserved during an 
upgrade. Also, it should be logged when a resource must be stopped/cannot 
be started because the utilization constrains cannot be satisfied.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.hu
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-10 Thread Kadlecsik József
On Wed, 9 Oct 2019, Digimer wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> > 
> We use mode=1 (active-passive) bonded network interfaces for each 
> network connection (we also have a back-end, front-end and a storage 
> network). Each bond has a link going to one switch and the other link to 
> a second switch. For fence devices, we use IPMI fencing connected via 
> switch 1 and PDU fencing as the backup method connected on switch 2.
> 
> With this setup, no matter what might fail, one of the fence methods
> will still be available. It's saved us in the field a few times now.

A bonded interface helps, but I suspect that in this case it could not 
save the situation. It was not an interface failure but a strange kind of 
system lockup: some of the already running processes were fine (corosync), 
but for example sshd could not accept new connections from the direction 
of the seemingly fine backbone interface either.

In the backend direction we have got bonded (LACP) interfaces - the 
frontend uses single interfaces only.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
On Wed, 9 Oct 2019, Ken Gaillot wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> 
> See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) 
> and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP 
> such as a gateway)

This looks really promising, thank you! Does the cluster regard it as a 
failure when a ocf:heartbeat:ethmonitor agent clone on a node does not 
run? :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hi,

On Wed, 9 Oct 2019, Jan Pokorný wrote:

> On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> > The nodes in our cluster have got backend and frontend interfaces: the 
> > former ones are for the storage and cluster (corosync) traffic and the 
> > latter ones are for the public services of KVM guests only.
> > 
> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> > stuck for 23s"), which resulted that the node could process traffic on the 
> > backend interface but not on the fronted one. Thus the services became 
> > unavailable but the cluster thought the node is all right and did not 
> > stonith it. 
> 
> > Which is the best way to solve the problem? 
> 
> Looks like heuristics of corosync-qdevice that would ping/attest your
> frontend interface could be a way to go.  You'd need an additional
> host in your setup, though.

As far as I see, corosync-qdevice can add/increase the votes for a node 
and cannot decrease it. I hope I'm wrong, I wouldn't mind adding an 
additional host :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hello,

The nodes in our cluster have got backend and frontend interfaces: the 
former ones are for the storage and cluster (corosync) traffic and the 
latter ones are for the public services of KVM guests only.

One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
stuck for 23s"), which resulted that the node could process traffic on the 
backend interface but not on the fronted one. Thus the services became 
unavailable but the cluster thought the node is all right and did not 
stonith it. 

How could we protect the cluster against such failures?

We could configure a second corosync ring, but that would be a redundancy 
ring only.

We could setup a second, independent corosync configuration for a second 
pacemaker just with stonith agents. Is it enough to specify the cluster 
name in the corosync config to pair pacemaker to corosync? What about the 
pairing of pacemaker to this corosync instance, how can we tell pacemaker 
to connect to this corosync instance?

Which is the best way to solve the problem? 

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Kadlecsik József
Hi,

On Tue, 21 May 2019, Ulrich Windl wrote:

> So maybe the original defective RA would be valuable for debugging the 
> issue. I guess the RA was invalid in some way that wasn't detected or 
> handled properly...

With the attached skeleton RA and the setting

primitive testid-testid0 ocf:local:testid \
params name=testid0 id=0 foo=foo0 \
op monitor timeout=30s interval=30s \
meta target-role=Started

I can reproduce it easily. Maybe it's required that the RA and the 
instance be reloadable.

Best regards,
Jozsef

> >>> Andrei Borzenkov  schrieb am 21.05.2019 um 09:13 in
> Nachricht :
> > 21.05.2019 0:46, Ken Gaillot пишет:
> >>>  
>  From what's described here, the op-restart-digest is changing every
>  time, which means something is going wrong in the hash comparison
>  (since the definition is not really changing).
> 
>  The log that stands out to me is:
> 
>  trace   May 18 23:02:49 calculate_xml_digest_v1(83):0:
>  digest:source   
> 
>  The id is the resource name, which isn't "0". That leads me to:
> 
>  trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:
>  
> 
>  which is the likely source of the problem. "id" is a pacemaker
>  property, 
>  not an OCF resource parameter. It shouldn't be in the resource
>  agent 
>  meta-data. Remove that, and I bet it will be OK.
> >>>
> >>> I renamed the parameter to "tunnel_id", redefined the resources and 
> >>> started them again.
> >>>  
>  BTW the "every 15 minutes" would be the cluster-recheck-interval
>  cluster property.
> >>>
> >>> I have waited more than half an hour and there are no more 
> >>> stopping/starting of the resources. :-) I haven't thought that "id"
> >>> is 
> >>> reserved as parameter name.
> >> 
> >> It isn't, by the OCF standard. :) This could be considered a pacemaker
> >> bug; pacemaker should be able to distinguish its own "id" from an OCF
> >> parameter "id", but it currently can't.
> >> 
> > 
> > 
> > I'm really baffled by this explanation. I tried to create resource with
> > "id" unique instance property and I do not observe this problem. No
> > restarts.
> > 
> > As none of traces provided captures of the moment of restart-digest
> > mismatch I also am not sure where to look. I do not see "id" being
> > treated anyway specially in the code.
> > 
> > Somewhat interesting is that restart digest source in two traces is
> > different:
> > 
> > bor@bor-Latitude-E5450:~$ grep -w 'restart digest' /tmp/trace.log*
> > /tmp/trace.log:trace   May 18 23:02:49 append_restart_list(694):0:
> > restart digest source   
> > /tmp/trace.log:trace   May 18 23:02:50 append_restart_list(694):0:
> > restart digest source   
> > /tmp/trace.log.2:trace   May 20 13:56:16 append_restart_list(694):0:
> > restart digest source   
> > /tmp/trace.log.2:trace   May 20 13:56:17 append_restart_list(694):0:
> > restart digest source   
> > /tmp/trace.log.2:trace   May 20 13:56:18 append_restart_list(694):0:
> > restart digest sourceid="1"/>
> > bor@bor-Latitude-E5450:~$
> > 
> > In one case it does not include "name" parameter. Whether configuration
> > was changed in between is unknown, we never have seen full RA metadata
> > in each case nor full resource definition so ...
> > 
> > My hunch is that "id" is red herring and something else has changed when
> > resource definition was edited. If I'm wrong I appreciate pointer to
> > code where "id" is mishandled.
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary#!/bin/sh
#
#   Tunnel OCF RA. Enables and disables testids, with iptables rules
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Kadlecsik József
Hi,

On Mon, 20 May 2019, Ken Gaillot wrote:

> On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> > What worries me is "Rejecting name for unique".
> 
> Trace messages are often not user-friendly. The rejecting/accepting is 
> nothing to be concerned about; it just refers to which parameters are 
> being used to calculate that particular hash.
>
> Pacemaker calculates up to three hashes.
> 
> The first is a hash of all the resource parameters, to detect if
> anything changed; this is stored as "op-digest" in the CIB status
> entries.
> 
> If the resource is reloadable, another hash is calculated with just the
> parameters marked as unique=1 (which means they can't be reloaded). Any
> change in these parameters requires a full restart. This one is "op-
> restart-digest".
> 
> Finally, if the resource has sensitive parameters like passwords, a
> hash of everything but those parameters is stored as "op-secure-
> digest". This one is only used when simulating CIBs grabbed from
> cluster reports, which have sensitive info scrubbed.

Thanks for the explanation! It seemed very cryptic in the trace messages 
that different hashes were calculated with differen parameter lists.
 
> From what's described here, the op-restart-digest is changing every
> time, which means something is going wrong in the hash comparison
> (since the definition is not really changing).
> 
> The log that stands out to me is:
> 
> trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   
> 
> 
> The id is the resource name, which isn't "0". That leads me to:
> 
> trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:  name="id" unique="1" required="1">
> 
> which is the likely source of the problem. "id" is a pacemaker property, 
> not an OCF resource parameter. It shouldn't be in the resource agent 
> meta-data. Remove that, and I bet it will be OK.

I renamed the parameter to "tunnel_id", redefined the resources and 
started them again.
 
> BTW the "every 15 minutes" would be the cluster-recheck-interval
> cluster property.

I have waited more than half an hour and there are no more 
stopping/starting of the resources. :-) I haven't thought that "id" is 
reserved as parameter name.

Thank you!

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Constant stop/start of resource in spite of interval=0

2019-05-18 Thread Kadlecsik József
On Sat, 18 May 2019, Kadlecsik József wrote:

> On Sat, 18 May 2019, Andrei Borzenkov wrote:
> 
> > 18.05.2019 18:34, Kadlecsik József пишет:
> 
> > > We have a resource agent which creates IP tunnels. In spite of the 
> > > configuration setting
> > > 
> > > primitive tunnel-eduroam ocf:local:tunnel \
> > > params 
> > > op start timeout=120s interval=0 \
> > > op stop timeout=300s interval=0 \
> > > op monitor timeout=30s interval=30s depth=0 \
> > > meta target-role=Started
> > > order bifur-eduroam-ipv4-before-tunnel-eduroam \
> > >   Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
> > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \
> > >   bifur-eduroam-ipv4:Started
> > > 
> > > the resource is restarted again and again. According to the debug logs:
> > > 
> > > May 16 14:20:35 [3052] bifur1   lrmd:debug: 
> > > recurring_action_timer:
> > >  Scheduling another invocation of tunnel-eduroam_monitor_3
> > > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > > tunnel-eduroam_monitor_3:62066 - exited with rc=0
> > > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > > tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ]
> > > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > > tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ]
> > > May 16 14:20:35 [3052] bifur1   lrmd:debug: log_finished:   
> > > finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 
> > > exit-code:0 exec-time:0ms queue-time:0ms
> > > May 16 14:21:04 [3054] bifur1pengine: info: native_print:   
> > > tunnel-eduroam  (ocf::local:tunnel):Started bifur1
> > > May 16 14:21:04 [3054] bifur1pengine: info: 
> > > check_action_definition:
> > > Parameters to tunnel-eduroam_start_0 on bifur1 changed: was 
> > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 
> > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32
> > 
> > This means that instance attributes changed in this case pacemaker
> > restarts resource to apply new values. Turning on trace level hopefully
> > will show what exactly is being changed. You can also dump CIB before
> > and after restart to compare current information.
> 
> The strange thing is that the new value seems never be stored. Just the 
> "was-now" part from the log lines:
> 
> was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
> was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
> was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
> ...
> 
> However, after issuing "cibadmin --query --local", the whole flipping 
> stopped! :-) Thanks!

No, I was wrong - it still repeats every ~15mins. The diff between two cib 
xml dumps doesn't say much to me - I'm going to enable tracing.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Constant stop/start of resource in spite of interval=0

2019-05-18 Thread Kadlecsik József
On Sat, 18 May 2019, Andrei Borzenkov wrote:

> 18.05.2019 18:34, Kadlecsik József пишет:

> > We have a resource agent which creates IP tunnels. In spite of the 
> > configuration setting
> > 
> > primitive tunnel-eduroam ocf:local:tunnel \
> > params 
> > op start timeout=120s interval=0 \
> > op stop timeout=300s interval=0 \
> > op monitor timeout=30s interval=30s depth=0 \
> > meta target-role=Started
> > order bifur-eduroam-ipv4-before-tunnel-eduroam \
> > Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
> > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \
> > bifur-eduroam-ipv4:Started
> > 
> > the resource is restarted again and again. According to the debug logs:
> > 
> > May 16 14:20:35 [3052] bifur1   lrmd:debug: recurring_action_timer:
> >  Scheduling another invocation of tunnel-eduroam_monitor_3
> > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > tunnel-eduroam_monitor_3:62066 - exited with rc=0
> > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ]
> > May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
> > tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ]
> > May 16 14:20:35 [3052] bifur1   lrmd:debug: log_finished:   
> > finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 
> > exit-code:0 exec-time:0ms queue-time:0ms
> > May 16 14:21:04 [3054] bifur1pengine: info: native_print:   
> > tunnel-eduroam  (ocf::local:tunnel):Started bifur1
> > May 16 14:21:04 [3054] bifur1pengine: info: 
> > check_action_definition:
> > Parameters to tunnel-eduroam_start_0 on bifur1 changed: was 
> > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 
> > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32
> 
> This means that instance attributes changed in this case pacemaker
> restarts resource to apply new values. Turning on trace level hopefully
> will show what exactly is being changed. You can also dump CIB before
> and after restart to compare current information.

The strange thing is that the new value seems never be stored. Just the 
"was-now" part from the log lines:

was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8
...

However, after issuing "cibadmin --query --local", the whole flipping 
stopped! :-) Thanks!

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Constant stop/start of resource in spite of interval=0

2019-05-18 Thread Kadlecsik József
Hello,

We have a resource agent which creates IP tunnels. In spite of the 
configuration setting

primitive tunnel-eduroam ocf:local:tunnel \
params 
op start timeout=120s interval=0 \
op stop timeout=300s interval=0 \
op monitor timeout=30s interval=30s depth=0 \
meta target-role=Started
order bifur-eduroam-ipv4-before-tunnel-eduroam \
Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam \
bifur-eduroam-ipv4:Started

the resource is restarted again and again. According to the debug logs:

May 16 14:20:35 [3052] bifur1   lrmd:debug: recurring_action_timer:
 Scheduling another invocation of tunnel-eduroam_monitor_3
May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
tunnel-eduroam_monitor_3:62066 - exited with rc=0
May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
tunnel-eduroam_monitor_3:62066:stderr [ -- empty -- ]
May 16 14:20:35 [3052] bifur1   lrmd:debug: operation_finished: 
tunnel-eduroam_monitor_3:62066:stdout [ -- empty -- ]
May 16 14:20:35 [3052] bifur1   lrmd:debug: log_finished:   
finished - rsc:tunnel-eduroam action:monitor call_id:1045 pid:62066 
exit-code:0 exec-time:0ms queue-time:0ms
May 16 14:21:04 [3054] bifur1pengine: info: native_print:   
tunnel-eduroam  (ocf::local:tunnel):Started bifur1
May 16 14:21:04 [3054] bifur1pengine: info: 
check_action_definition:
Parameters to tunnel-eduroam_start_0 on bifur1 changed: was 
94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now f2317cad3d54cec5d7d7aa7d0bf35cf8 
(restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32
May 16 14:21:04 [3054] bifur1pengine:debug: native_assign_node: 
Assigning bifur1 to tunnel-eduroam
May 16 14:21:04 [3054] bifur1pengine: info: RecurringOp: 
Start recurring monitor (30s) for tunnel-eduroam on bifur1
May 16 14:21:04 [3054] bifur1pengine:   notice: LogActions: Restart 
tunnel-eduroam  (Started bifur1)
May 16 14:21:04 [3055] bifur1   crmd:   notice: te_rsc_command: 
Initiating stop operation tunnel-eduroam_stop_0 locally on bifur1 | action 
50
May 16 14:21:04 [3055] bifur1   crmd:debug: 
stop_recurring_action_by_rsc:   Cancelling op 1045 for tunnel-eduroam 
(tunnel-eduroam:1045)
May 16 14:21:04 [3055] bifur1   crmd:debug: cancel_op:  Cancelling 
op 1045 for tunnel-eduroam (tunnel-eduroam:1045)
May 16 14:21:04 [3052] bifur1   lrmd: info: 
cancel_recurring_action:Cancelling ocf operation 
tunnel-eduroam_monitor_3
May 16 14:21:04 [3052] bifur1   lrmd:debug: log_finished:   
finished - rsc:tunnel-eduroam action:monitor call_id:1045  exit-code:0 
exec-time:0ms queue-time:0ms
May 16 14:21:04 [3055] bifur1   crmd:debug: cancel_op:  Op 1045 
for tunnel-eduroam (tunnel-eduroam:1045): cancelled
May 16 14:21:04 [3055] bifur1   crmd: info: do_lrm_rsc_op:  
Performing key=50:4:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32 
op=tunnel-eduroam_stop_0
May 16 14:21:04 [3052] bifur1   lrmd: info: log_execute:
executing - rsc:tunnel-eduroam action:stop call_id:1047
May 16 14:21:04 [3055] bifur1   crmd: info: process_lrm_event:  
Result of monitor operation for tunnel-eduroam on bifur1: Cancelled | 
call=1045 key=tunnel-eduroam_monitor_3 confirmed=true
...

>From where does the restart operation come? Why does it happen? The IP 
address is at the same node where the tunnel resource is already running.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions

2018-09-07 Thread Kadlecsik József
On Wed, 5 Sep 2018, Kadlecsik József wrote:

> On Wed, 5 Sep 2018, Ken Gaillot wrote:
> 
> > > > For testing purposes one of our nodes was put in standby node and 
> > > > then  rebooted several times. When the standby node started up, it 
> > > > joined the  cluster as a new member and it resulted in transitions 
> > > > between the online  nodes. However, when the standby node was 
> > > > rebooted in mid‑transitions, it  triggered another transitions 
> > > > again. As a result, live migrations was  aborted and guests 
> > > > stopped/started.
> > > > 
> > > > How can one make sure that join/leave operations of standby nodes
> > > > do not 
> > > > affect the location of the running resources?
> > > > 
> > > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian
> > > > stretch 
> > > > nodes.
> > 
> > Node joins/leaves do and should trigger new transitions, but that should 
> > not result in any actions if the node is in standby.
> >
> > The cluster will wait for any actions in progress (such as a live 
> > migration) to complete before beginning a new transition, so there is 
> > likely something else going on that is affecting the migration.
> > 
> > > Logs and more details, please!
> > 
> > Particularly the detail log on the DC should be helpful. It will have
> > "pengine:" messages with "saving inputs" at each transition.
> 
> I attached the log file.
> 
> There are log lines like this
> 
> Sep  5 12:22:30 atlas4 crmd[32776]:   notice: Transition aborted by 
> w2-utilization-cpu doing modify cpu=1: Configuration change 
> 
> which I don't understand: in the configuration the cpu utilization is 
> explicitly set to cpu=2 for w2.
> 
> Nothing changed, just the node atlas0 (in standby mode) was halted/started 
> several times. Still, resources were migrated, like in this case:
> 
> Sep  5 12:22:31 atlas4 VirtualDomain(mail0)[61781]: INFO: mail0: Starting 
> live migration to atlas3 (using: virsh --connect=qemu:///system --quiet 
> migrate --live  mail0 qemu+tls://atlas3/system ).
> 
> And besides the successful migrations, sometimes the guest was 
> stopped/started instead of migration:
> 
> Sep  5 12:25:22 atlas4 crmd[32776]:   notice: Result of stop operation for 
> mail0 on atlas4: 0 (ok) 
> Sep  5 12:25:22 atlas4 crmd[32776]:   notice: Initiating start operation 
> mail0_start_0 on atlas3 

Just guessing: maybe utilization is taken into account even when a node is 
offline and that cause transitions?

I can provide pe-input files which was recorded during the events.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Rebooting a standby node triggers lots of transitions

2018-09-05 Thread Kadlecsik József
Hi,

For testing purposes one of our nodes was put in standby node and then 
rebooted several times. When the standby node started up, it joined the 
cluster as a new member and it resulted in transitions between the online 
nodes. However, when the standby node was rebooted in mid-transitions, it 
triggered another transitions again. As a result, live migrations was 
aborted and guests stopped/started.

How can one make sure that join/leave operations of standby nodes do not 
affect the location of the running resources?

It's pacemaker 1.1.16-1 with corosync 2.4.2-3+deb9u1 on debian stretch 
nodes.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org