Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster

2023-03-13 Thread Lentes, Bernd

> -Original Message-
> From: Reid Wahl 
> Sent: Friday, March 10, 2023 10:30 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Lentes, Bernd  muenchen.de>
> Subject: Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't 
> start
> cluster
>
> On Fri, Mar 10, 2023 at 10:49 AM Lentes, Bernd  muenchen.de> wrote:
> > (192.168.100.10:2340) was formed. Members joined: 1084777482
>
> Is this really the corosync node ID of one of your nodes? If not, what's 
> your
> corosync version? Is the number the same every time the issue happens?
> The number is so large and seemingly random that I wonder if there's some
> kind of memory corruption.
>
Yes it's correct.

ha-idg-1:~ # crm node show
ha-idg-1(1084777482): member
maintenance=off standby=off
ha-idg-2(1084777492): member(offline)
maintenance=off standby=off
ha-idg-1:~ #

> > Cluster node ha-idg-1 is now in unknown state  ⇐= is that the
> > problem ?
>
> Probably a normal part of the startup process but I haven't tested it yet.
>
> > Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: handle_request:
> > Received manual confirmation that ha-idg-1 is fenced
>
> Yes
>
> > tengine_stonith_notify:  We were allegedly just fenced by a human for
> > ha-idg-1!  <=  what does that mean ? I didn't
> fence
> > it
>
> It means you ran `stonith_admin -C`
>
> https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-
> 1.1.24/fencing/remote.c#L945-L961
>
> > Mar 10 19:36:34 [31050] ha-idg-1   crmd: info: crm_xml_cleanup:
> > Cleaning up memory from libxml2
> > Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:  warning:
> pcmk_child_exit:
> > Shutting cluster down because crmd[31050] had fatal failure
> > <===  ???
>
> Pacemaker is shutting down on the local node because it just received
> confirmation that it was fenced (because you ran `stonith_admin -C`).
> This is expected behavior.

OK. If it is expected then it's fine.

>
> Can you help me understand the issue here? You started the cluster on this
> node at 19:36:24. 10 seconds later, you ran `stonith_admin -C`, and the
> local node shut down Pacemaker, as expected. It doesn't look like
> Pacemaker stopped until you ran that command.

I didn't know that this is expected.

Bernd
(null)
Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und 
Umwelt (GmbH), Ingolstadter Landstr. 1, 85764 Neuherberg, 
www.helmholtz-munich.de. Geschaeftsfuehrung:  Prof. Dr. med. Dr. h.c. Matthias 
Tschoep, Kerstin Guenther, Daniela Sommer (kom.) | Aufsichtsratsvorsitzende: 
Prof. Dr. Veronika von Messling | Registergericht: Amtsgericht Muenchen  HRB 
6466 | USt-IdNr. DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster

2023-03-10 Thread Lentes, Bernd
Hi,

I don’t get my cluster running. I had problems with an OCFS2 Volume, both 
nodes have been fenced.
When I do now a “systemctl start pacemaker.service”, crm_mon shows for a few 
seconds both nodes as UNCLEAN, then pacemaker stops.
I try to confirm the fendcing with “Stonith_admin –C”, but it doesn’t work.
Maybe time is to short, pacemaker is just running for a few seconds.

Here is the log:

Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster 
Engine ('2.3.6'): started and ready to provide service.
Mar 10 19:36:24 [31037] ha-idg-1 corosync info[MAIN  ] Corosync built-in 
features: debug testagents augeas systemd pie relro bindnow
Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: aes256 hash: sha1
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] The network 
interface [192.168.100.10] is now up.
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync configuration map access [0]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cmap
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync configuration service [1]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cfg
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync cluster closed process group service v1.01 [2]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cpg
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync profile loading service [4]
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Using quorum 
provider corosync_votequorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] This node is 
within the primary component and will provide service.
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[0]:
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync vote quorum service v1.0 [5]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: 
votequorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine 
loaded: corosync cluster quorum service v0.1 [3]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: 
quorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] A new membership 
(192.168.100.10:2340) was formed. Members joined: 1084777482
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[1]: 
1084777482
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: main:Starting 
Pacemaker 1.1.24+20210811.f5abda0ee-3.27.1 | build=1.1.24+20210811.f5abda0ee 
features: generated-manpages agent-manp
ages ncurses libqb-logging libqb-ipc lha-fencing systemd nagios 
corosync-native atomic-attrd snmp libesmtp acls cibsecrets
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: main:Maximum core 
file size is: 18446744073709551615
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: qb_ipcs_us_publish: 
server name: pacemakerd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to lrmd IPC: 
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to cib_ro IPC: 
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to crmd IPC: 
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to attrd IPC: 
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to pengine IPC: 
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: 
pcmk__ipc_is_authentic_process_active:   Could not connect to stonith-ng 
IPC: Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: corosync_node_name: 
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: get_node_name: 
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_get_peer: 
Created entry 3c2499de-58a8-44f7-bf1e-03ff1fbec774/0x1456550 for node 
(null)/1084777482 (1 total)
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_get_peer:Node 
1084777482 has uuid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_update_peer_proc: 
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: 
cluster_connect_quorum:  Quorum 

[ClusterLabs] VirtualDomain did not stop although "crm resource stop"

2022-11-02 Thread Lentes, Bernd
Hi,

i think i found the reason, but i want to be sure.
I wanted to stop a VirtualDomain and did a "crm resource stop ..."
But it didn't shut down. After waiting several minutes i stopped with libvirt, 
circumventing the cluster software.
First i wondered "why didn't it shutdown ?", but then i realized that also a 
Livemigration of another VirtualDomain was running and was started before the 
stop of the resource.
Am i correct that that it didn't shutdown because of the running Live-Migration 
and would have start the shutdown when the Live-Migration is finished ?

Bernd

-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
https://www.helmholtz-munich.de/en/mcd

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource trace

2022-10-24 Thread Lentes, Bernd

- On 24 Oct, 2022, at 10:08, Klaus Wenninger kwenn...@redhat.com wrote:

> On Mon, Oct 24, 2022 at 9:50 AM Xin Liang via Users < [
> mailto:users@clusterlabs.org | users@clusterlabs.org ] > wrote:



> Did you try a cleanup in between?

When i do a cleanup before trace/untrace the resource is not restarted.
When i don't do a cleanup it is restarted.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource trace

2022-10-21 Thread Lentes, Bernd

- On 17 Oct, 2022, at 21:41, Ken Gaillot kgail...@redhat.com wrote:

> This turned out to be interesting.
> 
> In the first case, the resource history contains a start action and a
> recurring monitor. The parameters to both change, so the resource
> requires a restart.
> 
> In the second case, the resource's history was apparently cleaned at
> some point, so the cluster re-probed it and found it running. That
> means its history contained only the probe and the recurring monitor.
> Neither probe nor recurring monitor changes require a restart, so
> nothing is done.
> 
> It would probably make sense to distinguish between probes that found
> the resource running and probes that found it not running. Parameter
> changes in the former should probably be treated like start.
> 

Is that now a bug or by design ?
And what is the conclusion of it all ?
Do a "crm resource cleanup" before each "crm resource [un]trace" ?
And test everything with ptest before commit ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource trace

2022-10-18 Thread Lentes, Bernd

- On 17 Oct, 2022, at 21:41, Ken Gaillot kgail...@redhat.com wrote:

> This turned out to be interesting.
> 
> In the first case, the resource history contains a start action and a
> recurring monitor. The parameters to both change, so the resource
> requires a restart.
> 
> In the second case, the resource's history was apparently cleaned at
> some point, so the cluster re-probed it and found it running. That
> means its history contained only the probe and the recurring monitor.
> Neither probe nor recurring monitor changes require a restart, so
> nothing is done.

"vm-genetrap_monitor_0". Is that a probe ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] trace of resource ‑ sometimes restart, sometimes not

2022-10-18 Thread Lentes, Bernd

- On 18 Oct, 2022, at 14:35, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:
> 
> # crm configure
> edit ...
> verify
> ptest nograph #!!!
> commit

That's very helpful. I didn't know that, Thanks.

> --
> If you used that, you would have seen the restart.
> Despite of that I wonder why enabling tracing to start/stop must induce a
> resource restart.
> 
> Bernd, are you sure that was the only thing changed? Do you have a record of
> commands issued?

I'm pretty sure it was the only thing.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource trace

2022-10-17 Thread Lentes, Bernd
Hi,

i try to find out why there is sometimes a restart of the resource and 
sometimes not.
Unpredictable behaviour is someting i expect from Windows, not from Linux.
Here you see two "crm resource trace "resource"".
In the first case the resource is restarted , in the second not.
The command i used is identical in both cases.

ha-idg-2:~/trace-untrace # date; crm resource trace vm-genetrap
Fri Oct 14 19:05:51 CEST 2022
INFO: Trace for vm-genetrap is written to /var/lib/heartbeat/trace_ra/
INFO: Trace set, restart vm-genetrap to trace non-monitor operations

==

1st try:
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  Diff: 
--- 7.28974.3 2
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  Diff: 
+++ 7.28975.0 299af44e1c8a3867f9e7a4b25f2c3d6a
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  +  
/cib:  @epoch=28975, @num_updates=0
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-monitor-30']:
  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

 
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

   
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-stop-0']:
  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

 
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

   
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-start-0']:
  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  


Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-migrate_from-0']:
  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

 
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

   
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-migrate_to-0']:
  
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

   
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op:  ++  

 
Oct 14 19:05:52 [26001] ha-idg-1   crmd: info: abort_transition_graph:  
Transition 791 aborted by 
instance_attributes.vm-genetrap-monitor-30-instance_attributes 'create': 
Configuration change | cib=7.28975.0 source=te_update_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-monitor-30']
 complete=true
Oct 14 19:05:52 [26001] ha-idg-1   crmd:   notice: do_state_transition: 
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_process_request: 
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha-idg-2/cibadmin/2, version=7.28975.0)
Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng: info: 
update_cib_stonith_devices_v2:   Updating device list from the cib: create 
op[@id='vm-genetrap-monitor-30']
Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng: info: cib_devices_update:  
Updating devices to version 7.28975.0
Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng:   notice: unpack_config:   On loss 
of CCM Quorum: Ignore
Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_file_backup: 
Archived previous version as 

Re: [ClusterLabs] crm resource trace (Was: Re: trace of resource - sometimes restart, sometimes not)

2022-10-10 Thread Lentes, Bernd
- On 7 Oct, 2022, at 21:37, Reid Wahl nw...@redhat.com wrote:

> On Fri, Oct 7, 2022 at 6:02 AM Lentes, Bernd
>  wrote:
>> - On 7 Oct, 2022, at 01:18, Reid Wahl nw...@redhat.com wrote:
>>
>> > How did you set a trace just for monitor?
>>
>> crm resource trace dlm monitor.
> 
> crm resource trace   adds "trace_ra=1" to the end of the
> monitor operation:
> https://github.com/ClusterLabs/crmsh/blob/8cf6a9d13af6496fdd384c18c54680ceb354b72d/crmsh/ui_resource.py#L638-L646
> 
> That's a schema violation and pcs doesn't even allow it. I installed
> `crmsh` and tried to reproduce... `trace_ra=1` shows up in the
> configuration for the monitor operation but it gets ignored. I don't
> get *any* trace logs. That makes sense -- ocf-shellfuncs.in enables
> tracing only if OCF_RESKEY_trace_ra is true. Pacemaker doesn't add
> operation attribute to the OCF_RESKEY_* environment variables... at
> least in the current upstream main.
> 
> Apparently (since you got logs) this works in some way, or worked at
> some point in the past. Out of curiosity, what version are you on?
> 

SLES 12 SP5:
ha-idg-1:/usr/lib/ocf/resource.d/heartbeat # rpm -qa|grep -iE 
'pacemaker|corosync'

libpacemaker3-1.1.24+20210811.f5abda0ee-3.21.9.x86_64
corosync-2.3.6-9.22.1.x86_64
pacemaker-debugsource-1.1.23+20200622.28dd98fad-3.9.2.20591.0.PTF.1177212.x86_64
libcorosync4-2.3.6-9.22.1.x86_64
pacemaker-cli-1.1.24+20210811.f5abda0ee-3.21.9.x86_64
pacemaker-cts-1.1.24+20210811.f5abda0ee-3.21.9.x86_64
pacemaker-1.1.24+20210811.f5abda0ee-3.21.9.x86_64


Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not

2022-10-10 Thread Lentes, Bernd
- On 7 Oct, 2022, at 01:08, Ken Gaillot kgail...@redhat.com wrote:

> 
> Yes, trace_ra is an agent-defined resource parameter, not a Pacemaker-
> defined meta-attribute. Resources are restarted anytime a parameter
> changes (unless the parameter is set up for reloads).
> 
> trace_ra is unusual in that it's supported automatically by the OCF
> shell functions, rather than by the agents directly. That means it's
> not advertised in metadata. Otherwise agents could mark it as
> reloadable, and reload would be a quick no-op.
> 

OK. But why no restart if i just set "crm resource trace dlm monitor" ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] expected_votes in cluster conf

2022-10-09 Thread Lentes, Bernd
Dear all,

while checking my cluster with "crm status xml" i stumbled across:
ha-idg-1:/usr/lib/ocf/resource.d/heartbeat # crm status xml








   
<=





expected_votes="unknown" ?
I didn't find expected_votes in the pacemaker doc 
(https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html).
It's a setting for corosync.conf and in mine it is:
expected_votes: 2. I have a two-node cluster.

I don't know from where "expected_votes="unknown"" comes in my case, maybe a 
typo.
Can you confirm that it isn't an option for pacemaker conf ? Or maybe an 
undocumentated ?

Thanks.

Bernd


-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
https://www.helmholtz-munich.de/en/mcd

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not

2022-10-07 Thread Lentes, Bernd

- On 7 Oct, 2022, at 01:08, Ken Gaillot kgail...@redhat.com wrote:

> 
> Yes, trace_ra is an agent-defined resource parameter, not a Pacemaker-
> defined meta-attribute. Resources are restarted anytime a parameter
> changes (unless the parameter is set up for reloads).
> 
> trace_ra is unusual in that it's supported automatically by the OCF
> shell functions, rather than by the agents directly. That means it's
> not advertised in metadata. Otherwise agents could mark it as
> reloadable, and reload would be a quick no-op.
> 

OK. But why no restart if i just set "crm resource trace dlm monitor" ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not

2022-10-07 Thread Lentes, Bernd


- On 7 Oct, 2022, at 01:18, Reid Wahl nw...@redhat.com wrote:

> How did you set a trace just for monitor?

crm resource trace dlm monitor.

> Wish I could help with that -- it's mostly a mystery to me too ;)

:-))

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] trace of resource - sometimes restart, sometimes not

2022-10-06 Thread Lentes, Bernd
Hi,

i have some problems with our DLM, so i wanted to trace it. Yesterday i just 
set a trace for "monitor". No restart of DLM afterwards. It went fine as 
expected.
I got logs in /var/lib/heartbeat/trace_ra. After some monitor i stopped tracing.

Today i set a trace for all operations.
Now resource DLM restarted:
* Restartdlm:0   ( ha-idg-1 )   due to 
resource definition change
I didn't expect that so i had some trouble.
Is the difference in this behaviour intentional ? If yes, why ? Is there a rule 
?

Furthermore i'd like to ask where i can find more information about DLM, 
because it is a mystery for me.
Sometimes the DLM does not respond to the "monitor", so it needs to be 
restarted, and therefore all depending resources (which is a lot).
This happens under some load (although not completely overwhelmed).

Thanks.

Bernd

-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
https://www.helmholtz-munich.de/en/mcd

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd

- On 24 Aug, 2022, at 16:26, kwenning kwenn...@redhat.com wrote:

>>
>> if I get Ulrich right - and my fading memory of when I really used crmsh the
>> last time is telling me the same thing ...
>>

I get the impression many people prefer pcs to crm. Is there any reason for 
that ?
And can i use pcs on Suse ? If yes, how ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd

- On 24 Aug, 2022, at 16:26, kwenning kwenn...@redhat.com wrote:



> 
> Guess the resources running now are those you tried to enable before
> while they were globally stopped 
> 

No. First i set stop-all-resources to false. Then SOME resources started.
Then i tried several times to start some VirtualDomains using "crm resource 
start"
which didn't succeed. Some time later i tried it again and it succeeded ...

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources

2022-08-24 Thread Lentes, Bernd


- On 24 Aug, 2022, at 16:01, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:


>> Now with "crm resource start" all resources started. I didn't change
>> anything !?!
> 
> I guess that command set the roles of all resources to "started", so you 
> changed
> something ;-)

I did it before and nothing happened ...

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd
Hi,


Now with "crm resource start" all resources started. I didn't change anything 
!?!

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd

- On 24 Aug, 2022, at 07:21, Reid Wahl nw...@redhat.com wrote:


> As a result, your command might start the virtual machines, but
> Pacemaker will still show that the resources are "Stopped (disabled)".
> To fix that, you'll need to enable the resources.

How do i achieve that ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources

2022-08-24 Thread Lentes, Bernd


- On 24 Aug, 2022, at 08:17, Reid Wahl nw...@redhat.com wrote:
> I'm not sure off the top of my head what (if anything) gets sent to
> the logs. Do note that Bernd is using pacemaker v1, which hasn't been
> receiving new features for quite a while.

An Update is recommended ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources

2022-08-24 Thread Lentes, Bernd


- On 24 Aug, 2022, at 08:10, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:


> 
> Bernd,
> 
> that command would simply set the role to "started", but I guess it already 
> is.
> Obviously to be effective stop-all-resources must habve precedence. You see?
> 
> Regards,
> Ulrich

Yes, but i set stop-all-resources to false.
Some resources started afterwards, some didn't.
And the target-role for all "disabled" VirtualDomains is still stopped.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd


> 
> There is no resource with the name "virtual_domain" in your list. All
> non-active resources in your list are either disabled or unmanaged.
> Without actual commands that list resource state before "crm resource
> start", "crm resource start" itself and once more resource state after
> this command any answer will be just a wild guess.

"crm resource start virtual_domain" is just an example, virtual_domain is a 
placeholder for the name of the VM.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-24 Thread Lentes, Bernd


- On 24 Aug, 2022, at 07:22, Reid Wahl nw...@redhat.com wrote:


> Are the VMs running after your start command?

No.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-23 Thread Lentes, Bernd


- On 24 Aug, 2022, at 07:03, arvidjaar arvidj...@gmail.com wrote:

> On 24.08.2022 07:34, Lentes, Bernd wrote:
>> 
>> 
>> - On 24 Aug, 2022, at 05:33, Reid Wahl nw...@redhat.com wrote:
>> 
>> 
>>> The stop-all-resources cluster property is set to true. Is that intentional?
>> OMG. Thanks Reid !
>> 
>> But unfortunately not all virtual domains are running:
>> 
> 
> what exactly is not clear in this output? All these resources are
> explicitly disabled (target-role=stopped) and so will not be started.
> 
That's clear. But a manual "crm resource start virtual_domain" should start 
them,
but it doesn't.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-23 Thread Lentes, Bernd


- On 24 Aug, 2022, at 05:33, Reid Wahl nw...@redhat.com wrote:


> The stop-all-resources cluster property is set to true. Is that intentional?
OMG. Thanks Reid ! 

But unfortunately not all virtual domains are running:

Stack: corosync
Current DC: ha-idg-2 (version 
1.1.24+20210811.f5abda0ee-3.21.9-1.1.24+20210811.f5abda0ee) - partition with 
quorum
Last updated: Wed Aug 24 06:14:37 2022
Last change: Wed Aug 24 06:04:24 2022 by root via cibadmin on ha-idg-1

2 nodes configured
40 resource instances configured (21 DISABLED)

Node ha-idg-1: online
fence_ilo_ha-idg-2  (stonith:fence_ilo2):   Started fenct ha-idg-2 
mit ILO
dlm (ocf::pacemaker:controld):  Started
clvmd   (ocf::heartbeat:clvm):  Started
vm-mausdb   (ocf::lentes:VirtualDomain):Started
fs_ocfs2(ocf::lentes:Filesystem.new):   Started
vm-nc-mcd   (ocf::lentes:VirtualDomain):Started
fs_test_ocfs2   (ocf::lentes:Filesystem.new):   Started
gfs2_snap   (ocf::heartbeat:Filesystem):Started
gfs2_share  (ocf::heartbeat:Filesystem):Started
Node ha-idg-2: online
fence_ilo_ha-idg-1  (stonith:fence_ilo4):   Started fenct ha-idg-1 
mit ILO
clvmd   (ocf::heartbeat:clvm):  Started
dlm (ocf::pacemaker:controld):  Started
vm-sim  (ocf::lentes:VirtualDomain):Started
gfs2_snap   (ocf::heartbeat:Filesystem):Started
fs_ocfs2(ocf::lentes:Filesystem.new):   Started
gfs2_share  (ocf::heartbeat:Filesystem):Started
vm-seneca   (ocf::lentes:VirtualDomain):Started
vm-ssh  (ocf::lentes:VirtualDomain):Started

Inactive resources:

 Clone Set: ClusterMon-clone [ClusterMon-SMTP]
 Stopped (disabled): [ ha-idg-1 ha-idg-2 ]
vm-geneious (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-idcc-devel   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-genetrap (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-mouseidgenes (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-greensql (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-severin  (ocf::lentes:VirtualDomain):Stopped (disabled)
ping_19216810010(ocf::pacemaker:ping):  Stopped (disabled)
ping_19216810020(ocf::pacemaker:ping):  Stopped (disabled)
vm_crispor  (ocf::heartbeat:VirtualDomain): Stopped (unmanaged)
vm-dietrich (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-pathway  (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-crispor-server   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-geneious-license (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-amok (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-geneious-license-mcd (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-documents-oo (ocf::lentes:VirtualDomain):Stopped (disabled)
vm_snipanalysis (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged)
vm-photoshop(ocf::lentes:VirtualDomain):Stopped (disabled)
vm-check-mk (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-encore   (ocf::lentes:VirtualDomain):Stopped (disabled)

Migration Summary:
* Node ha-idg-1:
* Node ha-idg-2:

Also a manual "crm resource start" wasn't successfull.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster does not start resources

2022-08-23 Thread Lentes, Bernd



- On 24 Aug, 2022, at 04:04, Reid Wahl nw...@redhat.com wrote:

> Can you share your CIB? Not sure off hand what everything means (resource not
> found, IPC error, crmd failure and respawn), and pacemaker v1 logs aren't the
> easiest to interpret. But perhaps something in the CIB will show itself as an
> issue.

Attached


> --
> Regards,

> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker


cib.xml
Description: XML document


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Cluster does not start resources

2022-08-23 Thread Lentes, Bernd
Hi,

currently i can't start resources on our 2-node-cluster.
Cluster seems to be ok:

Stack: corosync
Current DC: ha-idg-1 (version 
1.1.24+20210811.f5abda0ee-3.21.9-1.1.24+20210811.f5abda0ee) - partition with 
quorum
Last updated: Wed Aug 24 02:56:46 2022
Last change: Wed Aug 24 02:56:41 2022 by hacluster via crmd on ha-idg-1

2 nodes configured
40 resource instances configured (26 DISABLED)

Node ha-idg-1: online
Node ha-idg-2: online

Inactive resources:

fence_ilo_ha-idg-2  (stonith:fence_ilo2):   Stopped
fence_ilo_ha-idg-1  (stonith:fence_ilo4):   Stopped
 Clone Set: cl_share [gr_share]
 Stopped: [ ha-idg-1 ha-idg-2 ]
 Clone Set: ClusterMon-clone [ClusterMon-SMTP]
 Stopped (disabled): [ ha-idg-1 ha-idg-2 ]
vm-mausdb   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-sim  (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-geneious (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-idcc-devel   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-genetrap (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-mouseidgenes (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-greensql (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-severin  (ocf::lentes:VirtualDomain):Stopped (disabled)
ping_19216810010(ocf::pacemaker:ping):  Stopped (disabled)
ping_19216810020(ocf::pacemaker:ping):  Stopped (disabled)
vm_crispor  (ocf::heartbeat:VirtualDomain): Stopped (unmanaged)
vm-dietrich (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-pathway  (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-crispor-server   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-geneious-license (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-nc-mcd   (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged)
vm-amok (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-geneious-license-mcd (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-documents-oo (ocf::lentes:VirtualDomain):Stopped (disabled)
fs_test_ocfs2   (ocf::lentes:Filesystem.new):   Stopped
vm-ssh  (ocf::lentes:VirtualDomain):Stopped (disabled)
vm_snipanalysis (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged)
vm-seneca   (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-photoshop(ocf::lentes:VirtualDomain):Stopped (disabled)
vm-check-mk (ocf::lentes:VirtualDomain):Stopped (disabled)
vm-encore   (ocf::lentes:VirtualDomain):Stopped (disabled)

Migration Summary:
* Node ha-idg-1:
* Node ha-idg-2:

Fencing History:
* Off of ha-idg-2 successful: delegate=ha-idg-1, client=crmd.27356, 
origin=ha-idg-1,
last-successful='Wed Aug 24 01:53:49 2022'

Trying to start e.g. cl_share, which is a prerequisite for the virtual domains 
... nothing happens.
I did a "crm resource cleanup" (although crm_mon shows no error) hoping this 
will help ... it didn't.
my command history:
 1471  2022-08-24 03:11:27 crm resource cleanup
 1472  2022-08-24 03:11:52 crm resource cleanup cl_share
 1473  2022-08-24 03:12:45 crm resource start cl_share
(to correlate with the log)

I found some weird entries in the log after the "crm resource cleanup":

Aug 24 03:11:28 [27351] ha-idg-1cib:  warning: do_local_notify: A-Sync 
reply to crmd failed: No message of desired type
Aug 24 03:11:33 [27351] ha-idg-1cib: info: cib_process_ping:
Reporting our current digest to ha-idg-1: ed5bb7d32532ebf1ce3c45d0067c55b3 for 
7.28627.70 (0x15073e0 0)
Aug 24 03:11:52 [27353] ha-idg-1   lrmd: info: 
process_lrmd_get_rsc_info:   Resource 'dlm:0' not found (0 active resources)
Aug 24 03:11:52 [27356] ha-idg-1   crmd:   notice: do_lrm_invoke:   Not 
registering resource 'dlm:0' for a delete event | get-rc=-19 (No such device) 
transition-key=(null)

What does that mean "Resource not found" ?

 ...
Aug 24 03:11:57 [27351] ha-idg-1cib: info: cib_process_ping:
Reporting our current digest to ha-idg-1: 0b3e9ad9ad8103ce2da3b6b8d41e6716 for 
7.28628.0 (0x1352bf0 0)
Aug 24 03:11:58 [27356] ha-idg-1   crmd:error: do_pe_invoke_callback:   
Could not retrieve the Cluster Information Base: Timer expired | rc=-62 call=222
Aug 24 03:11:58 [27356] ha-idg-1   crmd: info: register_fsa_error_adv:  
Resetting the current action list
Aug 24 03:11:58 [27356] ha-idg-1   crmd:error: do_log:  Input I_ERROR 
received in state S_POLICY_ENGINE from do_pe_invoke_callback
Aug 24 03:11:58 [27356] ha-idg-1   crmd:  warning: do_state_transition: 
State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR 
cause=C_FSA_INTERNAL origin=do_pe_invoke_callback
Aug 24 03:11:58 [27356] ha-idg-1   crmd:  warning: do_recover:  
Fast-tracking shutdown in response to errors
Aug 24 03:11:58 [27356] ha-idg-1   crmd:  warning: do_election_vote:
Not voting in election, we're in state S_RECOVERY
Aug 24 03:11:58 [27356] ha-idg-1   crmd: info: do_dc_release:   DC role 

Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?

2022-08-04 Thread Lentes, Bernd
- On 4 Aug, 2022, at 19:46, Reid Wahl nw...@redhat.com wrote:

>> It shuts down ha-idg-2:
>> 2022-08-03T01:19:51.866200+02:00 ha-idg-2 systemd-logind[1535]: Power key
>> pressed.
>> 2022-08-03T01:19:52.048335+02:00 ha-idg-2 systemd-logind[1535]: System is
>> powering down.
>> 2022-08-03T01:19:52.051815+02:00 ha-idg-2 systemd[1]: Stopped target
>> resource-agents dependencies.
>>  ...
> 
> Yes, but it thought it was shutting down ha-idg-1.
> 
>>
>> Then it stops cluster software on ha-idg-1:
>> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:  warning: pcmk_child_exit: 
>> Shutting
>> cluster down because crmd[19368] had fatal failure
>> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
>> Shutting down Pacemaker
>> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:   notice: stop_child:  
>> Stopping
>> pengine | sent signal 15 to process 19367
>>  ...
> 
> Node ha-idg-1 received a notification from the fencer that said "hey,
> we just fenced ha-idg-1!" Then it said "oh no, that's me! I'll shut
> myself down now."
> 
> That can be helpful if we're using fabric fencing. That's not supposed
> to happen with power fencing. The shutdown on ha-idg-1 didn't hurt
> anything, but it should have gotten powered off (instead of powering
> off ha-idg-2.
> 
What is "fabric" fencing and "power" fencing ?
Fabric something like ILO or IPMI ? And power fencing is cutting off power by 
controllable
UPS or power switches ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?

2022-08-04 Thread Lentes, Bernd

- On 4 Aug, 2022, at 15:14, arvidjaar arvidj...@gmail.com wrote:

> On 04.08.2022 16:06, Lentes, Bernd wrote:
>> 
>> - On 4 Aug, 2022, at 00:27, Reid Wahl nw...@redhat.com wrote:
>> 
>> Would do you mean by "banned" ? "crm resource ban ..." ?
>> Is that something different than a location constraint ?
> "crm resource ban" creates location constraint, but not every location
> constraint is created by "crm resource ban".

OK.
 
It seems that the cluster realizes that something went wrong.
It wants to shutdown ha-idg-1:
Aug 03 01:19:12 [19367] ha-idg-1pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: vm-mausdb failed there
Aug 03 01:19:12 [19367] ha-idg-1pengine: info: native_stop_constraints: 
fence_ilo_ha-idg-2_stop_0 is implicit after ha-idg-1 is fenced
Aug 03 01:19:12 [19367] ha-idg-1pengine:   notice: LogNodeActions:   * 
Fence (Off) ha-idg-1 'vm-mausdb failed there'
Aug 03 01:19:14 [19367] ha-idg-1pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: vm-mausdb failed there
Aug 03 01:19:15 [19368] ha-idg-1   crmd:   notice: te_fence_node:   
Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=6
 ...

It shuts down ha-idg-2:
2022-08-03T01:19:51.866200+02:00 ha-idg-2 systemd-logind[1535]: Power key 
pressed.
2022-08-03T01:19:52.048335+02:00 ha-idg-2 systemd-logind[1535]: System is 
powering down.
2022-08-03T01:19:52.051815+02:00 ha-idg-2 systemd[1]: Stopped target 
resource-agents dependencies.
 ...

Then it stops cluster software on ha-idg-1:
Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:  warning: pcmk_child_exit: 
Shutting cluster down because crmd[19368] had fatal failure
Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
Shutting down Pacemaker
Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd:   notice: stop_child:  
Stopping pengine | sent signal 15 to process 19367
 ...

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?

2022-08-04 Thread Lentes, Bernd

- On 4 Aug, 2022, at 00:27, Reid Wahl nw...@redhat.com wrote:

> 
> Such constraints are unnecessary.
> 
> Let's say we have two stonith devices called "fence_dev1" and
> "fence_dev2" that fence nodes 1 and 2, respectively. If node 2 needs
> to be fenced, and fence_dev2 is running on node 2, node 1 will still
> use fence_dev2 to fence node 2. The current location of the stonith
> device only tells us which node is running the recurring monitor
> operation for that stonith device. The device is available to ALL
> nodes, unless it's disabled or it's banned from a given node. So these
> constraints serve no purpose in most cases.

Would do you mean by "banned" ? "crm resource ban ..." ?
Is that something different than a location constraint ?

> If you ban fence_dev2 from node 1, then node 1 won't be able to use
> fence_dev2 to fence node 2. Likewise, if you ban fence_dev1 from node
> 1, then node 1 won't be able to use fence_dev1 to fence itself.
> Usually that's unnecessary anyway, but it may be preferable to power
> ourselves off if we're the last remaining node and a stop operation
> fails.
So banning a fencing device from a node means that this node can't use the 
fencing device ?
 
> If ha-idg-2 is in standby, it can still fence ha-idg-1. Since it
> sounds like you've banned fence_ilo_ha-idg-1 from ha-idg-1, so that it
> can't run anywhere when ha-idg-2 is in standby, I'm not sure off the
> top of my head whether fence_ilo_ha-idg-1 is available in this
> situation. It may not be.

ha-idg-2 was not only in standby, i also stopped pacemaker on that node.
Then ha-idg-2 can't fence ha-idg-1 i assume.

> 
> A solution would be to stop banning the stonith devices from their
> respective nodes. Surely if fence_ilo_ha-idg-1 had been running on
> ha-idg-1, ha-idg-2 would have been able to use it to fence ha-idg-1.
> (Again, I'm not sure if that's still true if ha-idg-2 is in standby
> **and** fence_ilo_ha-idg-1 is banned from ha-idg-1.)
> 
>> Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng:   notice: log_operation:
>> Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with
>> device 'fence_ilo_ha-idg-2' returned: 0 (OK)
>> So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2,
>> which isn't necessary.
> 
> Here, it sounds like the pcmk_host_list setting is either missing or
> misconfigured for fence_ilo_ha-idg-2. fence_ilo_ha-idg-2 should NOT be
> usable for fencing ha-idg-1.
> 
> fence_ilo_ha-idg-1 should be configured with pcmk_host_list=ha-idg-1,
> and fence_ilo_ha-idg-2 should be configured with
> pcmk_host_list=ha-idg-2.

I will check that.

> What happened is that ha-idg-1 used fence_ilo_ha-idg-2 to fence
> itself. Of course, this only rebooted ha-idg-2. But based on the
> stonith device configuration, pacemaker on ha-idg-1 believed that
> ha-idg-1 had been fenced. Hence the "allegedly just fenced" message.
> 
>>
>> Finally the cluster seems to realize that something went wrong:
>> Aug 03 01:19:58 [19368] ha-idg-1   crmd: crit: 
>> tengine_stonith_notify:
>> We were allegedly just fenced by ha-idg-1 for ha-idg-1!

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] 2-Node Cluster - fencing with just one node running ?

2022-08-03 Thread Lentes, Bernd
Hi,

i have the following situation:
2-node Cluster, just one node running (ha-idg-1).
The second node (ha-idg-2) is in standby. DLM monitor on ha-idg-1 times out.
Cluster tries to restart all services depending on DLM:
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Recoverdlm:0   ( ha-idg-1 )
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartclvmd:0 ( ha-idg-1 )   due to required dlm:0 
start
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartgfs2_share:0( ha-idg-1 )   due to required clvmd:0 
start
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartgfs2_snap:0 ( ha-idg-1 )   due to required 
gfs2_share:0 start
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartfs_ocfs2:0  ( ha-idg-1 )   due to required 
gfs2_snap:0 start
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
dlm:1   (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
clvmd:1 (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
gfs2_share:1(Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
gfs2_snap:1 (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
fs_ocfs2:1  (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
ClusterMon-SMTP:0   (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions:  Leave   
ClusterMon-SMTP:1   (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartvm-mausdb   ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartvm-sim  ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartvm-geneious ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1pengine:   notice: LogAction:* 
Restartvm-idcc-devel   ( ha-idg-1 )   due to required cl_share 
running
 ...

restart of vm-mausdb failed, stop timed out:
VirtualDomain(vm-mausdb)[32415]:2022/08/03_01:19:06 INFO: Issuing 
forced shutdown (destroy) request for domain vm-mausdb.
Aug 03 01:19:11 [19365] ha-idg-1   lrmd:  warning: child_timeout_callback:  
vm-mausdb_stop_0 process (PID 32415) timed out
Aug 03 01:19:11 [19365] ha-idg-1   lrmd:  warning: operation_finished:  
vm-mausdb_stop_0:32415 - timed out after 72ms
 ...
Aug 03 01:19:14 [19367] ha-idg-1pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: vm-mausdb failed there
Aug 03 01:19:15 [19368] ha-idg-1   crmd:   notice: te_fence_node:   
Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=6

I have two fencing resources defined. One for ha-idg-1, one for ha-idg-2. Both 
are HP ILO network adapters.
I have two location constraints: both take care that the resource for fencing 
node ha-idg-1 is running on ha-idg-2 and vice versa.
I never thought that it's necessary that a node has to fence itself.
So now ha-idg-2 is in standby, there is no fence device to stonith ha-idg-1.
Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng:   notice: log_operation:   
Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with 
device 'fence_ilo_ha-idg-2' returned: 0 (OK)
So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2, 
which isn't necessary.

Finally the cluster seems to realize that something went wrong:
Aug 03 01:19:58 [19368] ha-idg-1   crmd: crit: tengine_stonith_notify:  
We were allegedly just fenced by ha-idg-1 for ha-idg-1!

So my question now: is it necessary to have a fencing device that a node can 
commit suicide ?

Bernd

-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 

[ClusterLabs] cluster log not unambiguous about state of VirtualDomains

2022-08-03 Thread Lentes, Bernd
Hi,

i have a strange behaviour found in the cluster log 
(/var/log/cluster/corosync.log).
I KNOW that i put one node (ha-idg-2) in standby mode and stopped the pacemaker 
service on that node:
The history of the shell says:
993  2022-08-02 18:28:25 crm node standby ha-idg-2
994  2022-08-02 18:28:58 systemctl stop pacemaker.service

Later on i had some trouble with high load.
I found contradictory entries in the log on the DC (ha-idg-1):
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-documents-oo active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-documents-oo active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-mausdb active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-mausdb active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-photoshop active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-photoshop active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-encore active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-encore active on ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource dlm:1 active on ha-idg-2
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-seneca active on ha-idg-2<===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-pathway active on ha-idg-2   <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-dietrich active on ha-idg-2  <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-sim active on ha-idg-2   <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-ssh active on ha-idg-2   <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-nextcloud active on ha-idg-2 <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource fs_ocfs2:1 active on ha-idg-2
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource gfs2_share:1 active on ha-idg-2
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-geneious active on ha-idg-2  <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource gfs2_snap:1 active on ha-idg-2  <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource vm-geneious-license-mcd active on ha-idg-2 <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: 
Operation monitor found resource clvmd:1 active on ha-idg-2

The log says some VirtualDomains are running on ha-idg-2 !?!

But just a few lines later the log says all VirtualDomains are running on 
ha-idg-1:
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-mausdb   (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:vm-sim  
(ocf::lentes:VirtualDomain):Started ha-idg-1  <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-geneious (ocf::lentes:VirtualDomain):Started ha-idg-1  <===
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-idcc-devel   (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-genetrap (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-mouseidgenes (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-greensql (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
vm-severin  (ocf::lentes:VirtualDomain):Started ha-idg-1
Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:
ping_19216810010(ocf::pacemaker:ping):  Stopped (disabled)
Aug 03 00:14:04 [19367] ha-idg-1pengine: 

Re: [ClusterLabs] [EXT] Problem with DLM

2022-07-26 Thread Lentes, Bernd


- On 26 Jul, 2022, at 20:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:

> Hi Bernd!
> 
> I think the answer may be some time before the timeout was reported; maybe a
> network issue? Or a very high load. It's hard to say from the logs...

Yes, i had a high load before:
Jul 20 00:17:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 90.080002
Jul 20 00:18:12 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 76.169998
Jul 20 00:18:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 85.629997
Jul 20 00:19:12 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 70.660004
Jul 20 00:19:42 [32512] ha-idg-1   crmd:   notice: 
throttle_check_thresholds:   High CPU load detected: 58.34
Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 48.740002
Jul 20 00:20:12 [32512] ha-idg-1   crmd: info: throttle_send_command:   
New throttle mode: 0010 (was 0100)
Jul 20 00:20:42 [32512] ha-idg-1   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 41.88
Jul 20 00:21:12 [32512] ha-idg-1   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was 0010)
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: child_timeout_callback:  
dlm_monitor_3 process (PID 11816) timed out
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: operation_finished:  
dlm_monitor_3:11816 - timed out after 2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd:error: process_lrm_event:   
Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 
key=dlm_monitor_3 timeout=2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd: info: exec_alert_list: Sending 
resource alert via smtp_alert to informatic@helmholtz-muenchen.de
Jul 20 00:21:56 [12204] ha-idg-1   lrmd: info: process_lrmd_alert_exec: 
Executing alert smtp_alert for 8f934e90-12f5-4bad-b4f4-55ac933f01c6

Can that interfere with DLM ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Problem with DLM

2022-07-26 Thread Lentes, Bernd
Hi,

it seems my DLM went grazy:

/var/log/cluster/corosync.log:
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: child_timeout_callback:  
dlm_monitor_3 process (PID 11816) timed out
Jul 20 00:21:56 [12204] ha-idg-1   lrmd:  warning: operation_finished:  
dlm_monitor_3:11816 - timed out after 2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd:error: process_lrm_event:   
Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 
key=dlm_monitor_3 timeout=2ms
Jul 20 00:21:56 [32512] ha-idg-1   crmd: info: exec_alert_list: Sending 
resource alert via smtp_alert to informatic@helmholtz-muenchen.de

/var/log/messages:
2022-07-20T00:21:56.644677+02:00 ha-idg-1 Cluster: alert_smtp.sh
2022-07-20T00:22:16.076936+02:00 ha-idg-1 kernel: [2366794.757496] dlm: 
FD5D3C7CE9104CF5916A84DA0DBED302: leaving the lockspace group...
2022-07-20T00:22:16.364971+02:00 ha-idg-1 kernel: [2366795.045657] dlm: 
FD5D3C7CE9104CF5916A84DA0DBED302: group event done 0 0
2022-07-20T00:22:16.364982+02:00 ha-idg-1 kernel: [2366795.045777] dlm: 
FD5D3C7CE9104CF5916A84DA0DBED302: release_lockspace final free
2022-07-20T00:22:15.533571+02:00 ha-idg-1 Cluster: message repeated 22 times: [ 
alert_smtp.sh]
2022-07-20T00:22:17.164442+02:00 ha-idg-1 ocfs2_hb_ctl[19106]: ocfs2_hb_ctl 
/sbin/ocfs2_hb_ctl -K -u FD5D3C7CE9104CF5916A84DA0DBED302
2022-07-20T00:22:18.904936+02:00 ha-idg-1 kernel: [2366797.586278] ocfs2: 
Unmounting device (254,24) on (node 1084777482)
2022-07-20T00:22:19.116701+02:00 ha-idg-1 Cluster: alert_smtp.sh

What do these kernel messages mean ? Why stopped DLM ? I think this is the 
second time this happened. It is really a show stopper because node is fenced 
some minutes later:
00:34:40.709002 ha-idg: Fencing Operation Off of ha-idg-1 by ha-idg-2 for 
crmd.28253@ha-idg-2: OK (ref=9710f0e2-a9a9-42c3-a294-ed0bd78bba1a)

What can i do ? Is there an alternative DLM ? 
System is SLES 12 SP5. Update to SLES 15 SP3 ?

Bernd



-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] is there a way to cancel a running live migration or a "resource stop" ?

2022-07-07 Thread Lentes, Bernd
Hi,

is there a way to cancel a running live migration or a "resource stop" ?

Bernd

-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
   +49 89 3187 49123 
fax:   +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] modified RA can't be used

2022-06-27 Thread Lentes, Bernd


- On Jun 27, 2022, at 3:54 PM, kgaillot kgail...@redhat.com wrote:

> As an aside, the preferred naming for custom agents is to change the
> provider (ocf:PROVIDER:AGENT), putting them in
> /usr/lib/ocf/resource.d/PROVIDER/AGENT.
> 
> For example, ocf:local:VirtualDomain or ocf:mcd:VirtualDomain
> 
> The main advantage is having your own namespace and not having to worry
> about a current or future resource-agents package having any side
> effects on your agent.
> 

Hi Ken,

i did it this way.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] modified RA can't be used

2022-06-27 Thread Lentes, Bernd


- On Jun 27, 2022, at 2:57 PM, Oyvind Albrigtsen oalbr...@redhat.com wrote:

> You need to update the agent name in the metadata section to be the
> same as the filename.
> 
> 
> Oyvind
> 

OMG. Thank you !!!

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] modified RA can't be used

2022-06-27 Thread Lentes, Bernd
Hi,

i adapted the RA ocf/heartbeat/VirtualDomain to my needs and renamed it to 
VirtualDomain.ssh
When i try to use it now, i get an error message.
I start e.g. "crm configure edit vm-idcc-devel" to modify an existing 
VirtualDomain that it uses the new RA
and want to save it i get the following error:
ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist?
ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist?
ERROR: ocf:heartbeat:VirtualDomain.ssh: no such resource agent

The RA exists in the filesystem and has the same permissions as the original:
ha-idg-1:~ # ll /usr/lib/ocf/resource.d/heartbeat/Virt*
-rwxr-xr-x 1 root root 35607 Feb 15 07:21 
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain
-rwxr-xr-x 1 root root 35747 Jun 27 14:22 
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh

The difference is only in one line i added:
ha-idg-1:~ # diff /usr/lib/ocf/resource.d/heartbeat/VirtualDomain 
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh
732a733,734
>  ssh -i /root/ssh/id_rsa.mcd.shutdown mcd.shutdown@${DOMAIN_NAME} shutdown.bat
>  ## new by bernd.len...@helmholtz-muenchen.de 26062022

I also copied the new RA to another folder ... same problem.
When i try to get info about the new RA i get the same error:
ha-idg-1:~ # crm ra info ocf:heartbeat:VirtualDomain.ssh
ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist?

The VirtualDomain is shutdown. It's a two-node cluster with SLES 12 SP5,
RA exists on both nodes and is identical:

ha-idg-1:~ # sha1sum /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh
8d075cb0745c674525802f94d4d7d2b88af8156c  
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh

ha-idg-2:~ # sha1sum /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh
8d075cb0745c674525802f94d4d7d2b88af8156c  
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh

Any ideas ?

Bernd


-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] how does the VirtualDomain RA know with which options it's called ?

2022-05-12 Thread Lentes, Bernd
Hi,

from my understanding the resource agents in /usr/lib/ocf/resource.d/heartbeat 
are quite similar
to the old scripts in /etc/init.d started by init.
Init starts these scripts with "script [start|stop|reload|restart|status]".
Inside the script there is a case construct which checks the options the script 
is started with, and calls the appropriate function.

Similar to the init scripts the cluster calls the RA with "script 
[start|stop|monitor ...]"
But i'm missing this construct in the VirtualDomain RA. From where does it know 
how it is invoked ?
I don't see any logic which checks the options the script is called with.

Bernd
-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-18 Thread Lentes, Bernd


- On Feb 17, 2022, at 4:25 PM, kgaillot kgail...@redhat.com wrote:
>> So for me the big question is:
>> When a transition is happening, and there is a change in the cluster,
>> is the transition "aborted"
>> (delayed or interrupted would be better) or not ?
>> Is this behaviour consistent ? If no, from what does it depend ?
>> 
>> Bernd
> 
> Yes, anytime the DC sees a change that could affect resources, it will
> abort the current transition and calculate a new one. Aborting means
> not initiating any new actions from the transition -- but any actions
> currently in flight must complete before the new transition can be
> calculated.
> 
> Changes that abort a transition include configuration changes, a node
> joining or leaving, an unexpected action result being received, a node
> attribute changing, the cluster-recheck-interval passing since the last
> transition, or a timer popping for a time-based event (failure timeout,
> rule, etc.). I may be forgetting some, but you get the idea.
> --

Hi Ken,

thanks for your explanation. 
Now i try to resume if i understood everything correctly:
I started the shutdown of several VirtualDomains with "crm resource vm_xxx 
stop".
Not concurrently, one by one with some delay of about 30 sec.
But there was already one VirtualDomain shutting down before.
Cluster said this transition is aborted, but in real it couldn't be aborted. 
How to abort a running shutdown ?
So we had to wait for the shutdown of that domain.
It has been switched off by libvirt with "virsh destroy" after 10 minutes.
After that the shutdown of the other domains was initiated, and the domains 
shutdown cleanly.

So, to conclude:
I forgot that i had already one domain in shutdown. I should have waited for 
this to finish before starting the stop of the other resources.
Cluster tried to "abort" the shutdown, but shutdown can't be aborted.
And i had bad luck that the shutdown of this domain took so long.

Correct ?

Bernd



smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Lentes, Bernd

- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com wrote:
> 
> 
> Splitting logs between different messages does not really help in interpreting
> them.

I agree.
Here is the complete excerpt from the respective time:
https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8

> 
> I guess the real question here is why "Transition aborted" is logged although
> transition apparently continues. Transition 128 started at 20:54:30 and
> completed
> at 21:04:26, but there were multiple "Transition 128 aborted" messages in
> between

That's correct. The shutdown_timeout for the domain is set with 600 sec. in the 
CIB.
The RA says:
# The "shutdown_timeout" we use here is the operation
# timeout specified in the CIB, minus 5 seconds
And between 20:54:30 and 21:04:26 we have very close 595 sec.

> It looks like "Transition aborted" is more "we try to abort this transition if
> possible". My guess is that pacemaker must wait for currently running 
> action(s)
> which can take quite some time when stopping virtual domain. Transition 128
> was initiated when stopping vm_pathway, but we have no idea when it was 
> stopped.

We have:
Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice: run_graph:   
Transition 128 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3548.bz2): Complete

and the log from libvirt confirms it:
/var/log/libvirtd/qemu/vm_pathway.log:
2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal 15 from 
pid 7368 (/usr/sbin/libvirtd)
2022-02-15 20:04:26.769+: shutting down, reason=destroyed

Time in libvirt logs is UTC, and in Munich we have currently UTC+1, so the time 
differs in the logs.
We see that the domain is "switched off" via libvirt exactly at 21:04:26.

So for me the big question is:
When a transition is happening, and there is a change in the cluster, is the 
transition "aborted"
(delayed or interrupted would be better) or not ?
Is this behaviour consistent ? If no, from what does it depend ?

Bernd




smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Lentes, Bernd


- On Feb 17, 2022, at 10:26 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

 "Ulrich Windl"  schrieb am 17.02.2022

> 
> To correct myself: crm was a "-w" (wait) option that will wait until the DC is
> idle. In most cases it just waits until the operation requeszted has completed
> (or failed).
Hi Ulrich,

but stopping the domains with -w would last very long. We have 20 
VirtualDomains.
Our UPS does not have so much capacity for waiting such a long time.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Lentes, Bernd


- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote:

> A transition is the set of actions that need to be taken in response to
> current conditions. A transition is aborted any time conditions change
> (here, the target-role being changed in the configuration), so that a
> new set of actions can be calculated.
> 
> Someone once defined a transition as an "action plan", and I'm tempted
> to use that instead. Plus maybe replace "aborted" with "interrupted",
> so then we'd have "Action plan interrupted" which is maybe a little
> more understandable.

These "transition  aborted" happen quite often.

Feb 15 20:53:25 [15370] ha-idg-2   crmd:   notice: abort_transition_graph:  
Transition 126 aborted by vm_documents-oo-meta_attributes-target-role doing 
modify target-role=Stopped: Configuration change | cib=7.27453.0 
source=te_update
_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm_documents-oo']/meta_attributes[@id='vm_documents-oo-meta_attributes']/nvpair[@id='vm_documents-oo-meta_attributes-target-role']
 complete=false
  
Feb 15 20:53:00 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 125 aborted by vm_amok-meta_attributes-target-role doing modify 
target-role=Stopped: Configuration change | cib=7.27452.0 
source=te_update_diff_v2
:483 
path=/cib/configuration/resources/primitive[@id='vm_amok']/meta_attributes[@id='vm_amok-meta_attributes']/nvpair[@id='vm_amok-meta_attributes-target-role']
 complete=true

Why is there sometimes "complete=true" and sometimes "complete=false" ?
What does that mean ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Lentes, Bernd
- On Feb 16, 2022, at 1:01 PM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:
> Bernd,
> 
> I guess the syslog(/journal of the DC has better logs.

Unfortunately the journal didn't reveal something.

> As I see it now, it seems stop of vm_pathway takes a few minutes, and no other
> action is started befor that is done.
> I think I once said it "Clusters are not for the impatient", i.e.: Don't 
> start a
> noew action when the previous action did not complete yet.

Does that mean when i want to shutdown some VirtualDomains that i have to do 
this one by one,
always waiting for the complete shutdown before stopping the next one ?
That could last very long.

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Lentes, Bernd


- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote:


>> Any idea ?
>> What is about that transition 128, which is aborted ?
> 
> A transition is the set of actions that need to be taken in response to
> current conditions. A transition is aborted any time conditions change
> (here, the target-role being changed in the configuration), so that a
> new set of actions can be calculated.
> 
> Someone once defined a transition as an "action plan", and I'm tempted
> to use that instead. Plus maybe replace "aborted" with "interrupted",
> so then we'd have "Action plan interrupted" which is maybe a little
> more understandable.
> 
>> 
>> Transition 128 is finished:
>> Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice:
>> run_graph:   Transition 128 (Complete=1, Pending=0, Fired=0,
>> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
>> 3548.bz2): Complete
>> 
>> And one second later the shutdown starts. Is that normal that there
>> is such a big time gap ?
>>
> 
> No, there should be another transition calculated (with a "saving
> input" message) immediately after the original transition is aborted.
> What's the timestamp on that?
> --

Hi Ken,

this is what i found:

Feb 15 20:54:30 [15369] ha-idg-2pengine:   notice: process_pe_message:  
Calculated transition 128, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3548.bz2
Feb 15 20:54:30 [15370] ha-idg-2   crmd: info: do_state_transition: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
Processing graph 128 (ref=pe_calc-dc-1644954870-403) derived from 
/var/lib/pacemaker/pengine/pe-input-3548.bz2
Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_pathway_stop_0 locally on ha-idg-2 | action 76

Feb 15 21:04:26 [15369] ha-idg-2pengine:   notice: process_pe_message:  
Calculated transition 129, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3549.bz2
Feb 15 21:04:26 [15370] ha-idg-2   crmd: info: do_state_transition: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
Processing graph 129 (ref=pe_calc-dc-1644955466-405) derived from 
/var/lib/pacemaker/pengine/pe-input-3549.bz2

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-15 Thread Lentes, Bernd
Hi,

i have a weird behaviour in my 2-node-cluster.
I stopped several VirtualDomains via "crm resource stop VirtualDomain", but the 
respective shutdown starts minutes later.
All on the same host.

.bash_history:
 
 3520  2022-02-15 20:55:44 crm resource stop vm_greensql
 3521  2022-02-15 20:56:34 crm resource stop vm_ssh
 3522  2022-02-15 20:57:23 crm resource stop vm_sim
 3523  2022-02-15 20:58:38 crm resource stop vm_mouseidgenes
 3524  2022-02-15 21:00:24 crm resource stop vm_genetrap
 3525  2022-02-15 21:01:25 crm resource stop vm_severin
 3526  2022-02-15 21:01:34 crm resource stop vm_idcc_devel

/var/log/cluster/corosync.log:

Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
--- 7.27455.0 2
Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
+++ 7.27456.0 138c70d41548c4cb1d767dd578a98b8f
Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib:  @epoch=27456
Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib/configuration/resources/primitive[@id='vm_greensql']/meta_attributes[@id='vm_greensql-meta_attributes']/nvpair[@id='vm_greensql-meta_attributes-target-role']:
  @value=Stopped
Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_process_request: 
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha-idg-1/cibadmin/2, version=7.27456.0)
Feb 15 20:55:45 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 128 aborted by vm_greensql-meta_attributes-target-role doing modify 
target-role=Stopped: Configuration change | cib=7.27456.0 
source=te_update_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm_greensql']/meta_att
ributes[@id='vm_greensql-meta_attributes']/nvpair[@id='vm_greensql-meta_attributes-target-role']
 complete=false
 ...
Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
--- 7.27456.0 2
Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
+++ 7.27457.0 (null)
Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib:  @epoch=27457
Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib/configuration/resources/primitive[@id='vm_ssh']/meta_attributes[@id='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-meta_attributes-target-role']:
  @value=Stopped
Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_process_request: 
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha-idg-1/cibadmin/2, version=7.27457.0)
Feb 15 20:56:35 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 128 aborted by vm_ssh-meta_attributes-target-role doing modify 
target-role=Stopped: Configuration change | cib=7.27457.0 
source=te_update_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm_ssh']/meta_attributes[@i
d='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-meta_attributes-target-role'] 
complete=false
 ...
Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
--- 7.27457.0 2
Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
+++ 7.27458.0 7f91d8e52c8ff0887916ad921703fadd
Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib:  @epoch=27458
Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib/configuration/resources/primitive[@id='vm_sim']/meta_attributes[@id='vm_sim-meta_attributes']/nvpair[@id='vm_sim-meta_attributes-target-role']:
  @value=Stopped
Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_process_request: 
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha-idg-1/cibadmin/2, version=7.27458.0)
Feb 15 20:57:24 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 128 aborted by vm_sim-meta_attributes-target-role doing modify 
target-role=Stopped: Configuration change | cib=7.27458.0 
source=te_update_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm_sim']/meta_attributes[@i
d='vm_sim-meta_attributes']/nvpair[@id='vm_sim-meta_attributes-target-role'] 
complete=false
 ...
Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
--- 7.27458.0 2
Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op:  Diff: 
+++ 7.27459.0 727c5953b33542602028bf903b0578bc
Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib:  @epoch=27459
Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op:  +  
/cib/configuration/resources/primitive[@id='vm_mouseidgenes']/meta_attributes[@id='vm_mouseidgenes-meta_attributes']/nvpair[@id='vm_mouseidgenes-meta_attributes-target-role']:
  @value=Stopped
Feb 15 20:58:39 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 128 aborted by vm_mouseidgenes-meta_attributes-target-role doing 
modify target-role=Stopped: Configuration change | cib=7.27459.0 

Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?

2022-02-10 Thread Lentes, Bernd


- On Feb 10, 2022, at 4:40 PM, Jehan-Guillaume de Rorthais j...@dalibo.com 
wrote:

> 
> I wonder if after the cluster shutdown complete, the target-role=Stopped could
> be removed/edited offline with eg. crmadmin? That would make VirtualDomain
> startable on boot.
> 
> I suppose this would not be that simple as it would require to update it on 
> all
> nodes, taking care of the CIB version, hash, etc... But maybe some tooling
> could take care of this?
> 
> Last, if Bernd need to stop gracefully the VirtualDomain paying attention to
> the I/O load, maybe he doesn't want them start automatically on boot for the
> exact same reason anyway?

I start the cluster manually (systemctl start pacemaker) and don't have a 
problem to start the VirtualDomains
by hand one after the other. I prefer that towards an "automatic" solution.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?

2022-02-09 Thread Lentes, Bernd


- On Feb 9, 2022, at 11:26 AM, Jehan-Guillaume de Rorthais j...@dalibo.com 
wrote:


> 
> I'm not sure how "crm resource stop " actually stop a resource. I thought
> it would set "target-role=Stopped", but I might be wrong.
> 
> If "crm resource stop" actually use "target-role=Stopped", I believe the
> resources would not start automatically after setting
> "stop-all-resources=false".
> 

ha-idg-2:~ # crm resource help stop
Stop resources

Stop one or more resources using the target-role attribute. If there
are multiple meta attributes sets, the attribute is set in all of
them. If the resource is a clone, all target-role attributes are
removed from the children resources.

For details on group management see
options manage-children.

Usage:

stop  [ ...]

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?

2022-02-09 Thread Lentes, Bernd


- On Feb 7, 2022, at 4:13 PM, Jehan-Guillaume de Rorthais j...@dalibo.com 
wrote:

> On Mon, 7 Feb 2022 14:24:44 +0100 (CET)
> "Lentes, Bernd"  wrote:
> 
>> Hi,
>> 
>> i'm currently changing a bit in my cluster because i realized that my
>> configuration for a power outtage didn't work as i expected. My idea is
>> currently:
>> - first stop about 20 VirtualDomains, which are my services. This will surely
>> takes some minutes. I'm thinking of stopping each with a time difference of
>> about 20 seconds for not getting to much IO load. and then ...
>> - how to stop the other resources ?
> 
> I would set cluster option "stop-all-resources" so all remaining resources are
> stopped gracefully by the cluster.
> 
> Then you can stop both nodes using eg. "crm cluster stop".
> 
> On restart, after both nodes are up and joined to the cluster, you can set
> "stop-all-resources=false", then start your VirtualDomains.

Aren't  the VirtualDomains already started by "stop-all-resources=false" ?

I wrote a script for the whole procedure which is triggered by the UPS.
As i am not a big schellscript-writer please have a look and tell me your 
opinion.
You find it here: 
https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/rEA9bFxs5Ay6fYG
Thanks.

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] what is the "best" way to completely shutdown a two‑node cluster ?

2022-02-07 Thread Lentes, Bernd


- On Feb 7, 2022, at 2:36 PM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:


> 
> Bernd,
> 
> what if you set the node affected to standby, or shut down the cluster
> services? Or all all nodes powered by the same UPS?

All nodes are powered by the same UPS.

> 
> 
>> 
>> And what is if both nodes are running ? Can i do that simultaneously on both
> 
>> nodes ?
> 
> I guess that should work.
> 
>> My OS is SLES 12 SP5, pacemaker is 1.1.23, corosync is 2.3.6-9.13.1
> 
> Your action plan depends on what the VMNs are doing: basically every HA
> resource should survive a hard restart without much damange.

Well, some vm's have databases, i'd like to shutdown these cleanly.

> So maybe an option could be: Do nothing, or do an emergency shutdown of the
> node without properly migrating all the VMs elsewhere.

The VM's don't need to be migrated, the whole cluster should stop in a 
reasonable time and manner.

> You cannot make an application HA by putting it in a VM; at least not in
> general.

I know. But the time gap between one node having problems and booting the vm on 
the other node
is ok for us.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?

2022-02-07 Thread Lentes, Bernd
Hi,

i'm currently changing a bit in my cluster because i realized the my 
configuration for a power outtage didn't work as i expected.
My idea is currently:
- first stop about 20 VirtualDomains, which are my services. This will surely 
takes some minutes.
I'm thinking of stopping each with a time difference of about 20 seconds for 
not getting to much IO load.
and then ...
- how to stop the other resources ?
- put the nodes into standby or offline ?
- do a systemctl stop pacemaker ?
- or do a crm cluster stop ?

And what is if both nodes are running ? Can i do that simultaneously on both 
nodes ?
My OS is SLES 12 SP5, pacemaker is 1.1.23, corosync is 2.3.6-9.13.1

Thanks for your help.

Bernd

-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Is there a python package for pacemaker ?

2022-02-02 Thread Lentes, Bernd
Hi,

i need to write some scripts for our cluster. Until now i wrote bash scripts.
But i like to learn python. Is there a package for pacemaker ?
What i found is: https://pypi.org/project/pacemaker/ and i'm not sure what that 
is.

Thanks.

Bernd
-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] HA-Cluster, UPS and power outage - how is your setup ?

2022-02-01 Thread Lentes, Bernd
Hi,

we just experienced two power outages in a few days.
This showed me that our UPS configuration and the handling of resources on the 
cluster is insufficient.
We have a two-node cluster with SLES 12 SP5 and a Smart-UPS SRT 3000 from APC 
with Network Management Card.
The UPS is able to buffer the two nodes and some Hardware (SAN, Monitor) for 
about one hour.
Our resources are Virtual Domains, about 20 of different flavor and version.

Our primary goal is not to bypass as long as possible a power outage but to 
shutdown all domains correctly after a dedicated time.

I'm currently thinking of waiting for a dedicated time (maybe 15 minutes) and 
then do a "crm resource stop VirtualDomains" in a script.
I would give the cluster some time for the shutdown (5-10 minutes) and 
afterwards shutdown the nodes (via script).
I have to keep an eye on if both nodes are running or only one of them.

How is your approach ?

Bernd

-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Problem with high load (IO)

2021-09-30 Thread Lentes, Bernd


- On Sep 30, 2021, at 3:55 AM, Gang He g...@suse.com wrote:


>> 
>> 1) No problems during this step, the procedure just needs a few seconds.
>> reflink is a binary. See reflink --help
>> Yes, it is a cluster filesystem. I do the procedure just on one node,
>> so i don't have duplicates.
>> 
>> 2) just with "cp source destination" to a NAS.
>> Yes, the problems appear during this step.
> Ok, when you cp the cloned file to the NAS directory,
> the NAS directory should be another file system, right?
> During the copying process, the original VM running will be affected,
> right?
> 

Yes, it's another fs. Yes, the running machine is affected.
It's getting slower and sometimes does not react, following our monitor 
software.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Problem with high load (IO)

2021-09-29 Thread Lentes, Bernd


- On Sep 29, 2021, at 4:37 AM, Gang He g...@suse.com wrote:

> Hi Lentes,
> 
> Thank for your feedback.
> I have some questions as below,
> 1) how to clone these VM images from each ocfs2 nodes via reflink?
> do you encounter any problems during this step?
> I want to say, this is a shared file system, you do not clone all VM
> images from each node, duplicated.
> 2) after the cloned VM images are created, how do you copy these VM
> images? copy to another backup file system, right?
> The problem usually happened during this step?
> 
> Thanks
> Gang

1) No problems during this step, the procedure just needs a few seconds.
reflink is a binary. See reflink --help
Yes, it is a cluster filesystem. I do the procedure just on one node,
so i don't have duplicates.

2) just with "cp source destination" to a NAS.
Yes, the problems appear during this step.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Problem with high load (IO)

2021-09-28 Thread Lentes, Bernd


- On Sep 27, 2021, at 2:51 PM, Pacemaker ML users@clusterlabs.org wrote:

> I would use something liek this:
> 
> ionice -c 2 -n 7 nice cp XXX YYY
> 
> Best Regards,
> Strahil Nikolov

Just for a better understanding:

ionice does not relate to the copy procedure in this commandline, but to the 
nice program.
What is the advantage if nice does treat IO a bit more carefully ?
Is there a way in this commandline that ionice relates to the copy program ?

What is with ionice -c 2 -n 7 (nice cp XXX YYY) ? With the brackets both 
programs are executed in the same shell.
Would that help ?

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Problem with high load (IO)

2021-09-27 Thread Lentes, Bernd


- On Sep 27, 2021, at 2:51 PM, Pacemaker ML users@clusterlabs.org wrote:

> I would use something liek this:
> 
> ionice -c 2 -n 7 nice cp XXX YYY
> 
> Best Regards,
> Strahil Nikolov
> 

Hi Strahil,

that sounds interesting, i didn't know ionice.
I will have a look on the man-pages.

Thanks.

Bernd

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Problem with high load (IO)

2021-09-27 Thread Lentes, Bernd
Hi,

i have a two-node cluster running on SLES 12SP5 with two HP servers and a 
common FC SAN.
Most of my resources are virtual domains offering databases and web pages.
The disks from the domains reside on a OCFS2 Volume on a FC SAN.
Each night a 9pm all domains are snapshotted with the OCFS2 tool reflink.
After the snapshot is created the disks of the domains are copied to a NAS, 
domains are still running.
The copy procedure occupies the CPU and IO intensively. IO is occupied by copy 
about 90%, the CPU has sometimes a wait about 50%.
Because of that the domains aren't responsive, so that the monitor operation 
from the RA fails sometimes.
In worst cases one domain is fenced.
What would you do in such a situation ?
I'm thinking of making the cp procedure nicer, with nice. Maybe about 10.

More ideas ?


Bernd

-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?

2021-09-23 Thread Lentes, Bernd

- On Sep 20, 2021, at 10:14 PM, kgaillot kgail...@redhat.com wrote:

> 
> As far as I know, only a few of the ocf:pacemaker agents support OCF
> 1.1 currently. The resource-agents package doesn't.
> 
> To check a given agent, run "crm_resource --show-metadata
> ocf:$PROVIDER:$AGENT | grep ''" using the desired provider and
> agent name.
> 

ha-idg-1:/var/log/atop # crm_resource --show-meta 
ocf:heartbeat:VirtualDomain|grep -i version

1.1

So it's version 1.1.
But there is no operation "reload":

Operations' defaults (advisory minimum):

start timeout=90s
stop  timeout=90s
statustimeout=30s interval=10s
monitor   timeout=30s interval=10s
migrate_from  timeout=60s
migrate_totimeout=120s

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?

2021-09-18 Thread Lentes, Bernd


- On Sep 18, 2021, at 1:19 AM, kgaillot kgail...@redhat.com wrote:

> 
> If the agent meta-data advertises support for the 1.1 standard and
> indicates that the trace_ra parameter is reloadable, then Pacemaker
> will automatically do a reload instead of restart for the resource if
> the parameter changes.

>From where do i know that my RA supports 1.1 ?
I'm running on SLES 12 SP5 with resource-agents-4.3.018.a7fb5035-3.51.1.x86_64.
My crm does not support "crm resource reload":

ha-idg-1:~ # crm resource help
Resource management

At this level resources may be managed.

All (or almost all) commands are implemented with the CRM tools
such as crm_resource(8).

Commands:
   ...
refresh  Recheck current resource status and drop failure 
history
restart  Restart resources
  ...

> 
> trace_ra is unusual in that resource agents don't define the parameter
> themselves, the ocf-shellfuncs shell include looks for it instead. It
> would be nice to come up with a general solution that all agents can
> use rather than modify each agent's meta-data individually, but either
> approach would work.

Bernd


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?

2021-09-17 Thread Lentes, Bernd


- On Sep 17, 2021, at 9:13 PM, kgaillot kgail...@redhat.com wrote:

>> Bernd
> 
> Tracing works by setting a special parameter, which to pacemaker looks
> like a configuration change that requires a restart. With the new OCF
> 1.1 standard, the trace parameter could be marked reloadable, but the
> agents need to be updated to do that.
> --
> Ken Gaillot 

Hi Ken,

but does pacemaker do it automatically or do i have to initiate that ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?

2021-09-17 Thread Lentes, Bernd
Hi,

today i configured tracing for some VirtualDomains:

ha-idg-2:~ # crm resource trace vm_documents-oo migrate_from
INFO: Trace for vm_documents-oo:migrate_from is written to 
/var/lib/heartbeat/trace_ra/
INFO: Trace set, restart vm_documents-oo to trace the migrate_from operation

ha-idg-2:~ # crm resource trace vm_genetrap migrate_from
INFO: Trace for vm_genetrap:migrate_from is written to 
/var/lib/heartbeat/trace_ra/
INFO: Trace set, restart vm_genetrap to trace the migrate_from operation

I thought "Trace set, restart vm_genetrap to trace the migrate_from operation" 
is a hint to not forget the restart of the resource.
But all resources i configured tracing for did an automatic restart. 
Is that behaviour intended ?

Bernd

-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] virtual domains not migrated

2021-09-14 Thread Lentes, Bernd
Hi,

Today i couldn't migrate several virtual domains. I have a Two-node cluster 
with SuE SLES 12 SP5.
Pacemaker is 
pacemaker-1.1.23+20200622.28dd98fad-3.9.2.20591.0.PTF.1177212.x86_64, corosync 
is corosync-2.3.6-9.13.1.x86_64.

Migration just stopped after an amount of time.
This is what i found in the logs:

Sep 14 12:28:54 [10498] ha-idg-2   lrmd:   notice: operation_finished:  
vm_genetrap_stop_0:22559:stderr [ error: Failed to shutdown domain vm_genetrap ]
Sep 14 12:28:54 [10498] ha-idg-2   lrmd:   notice: operation_finished:  
vm_genetrap_stop_0:22559:stderr [ error: Timed out during operation: cannot 
acquire state change lock (held by 
monitor=remoteDispatchDomainMigratePerform3Params) ]

Sep 14 12:37:36 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_crispor-server_stop_0:8002:stderr [ error: Failed to shutdown domain 
vm_crispor-server ]
Sep 14 12:37:36 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_crispor-server_stop_0:8002:stderr [ error: Timed out during operation: 
cannot acquire state change lock (held by 
monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ]

Sep 14 12:17:16 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_seneca_stop_0:1546:stderr [ error: Failed to shutdown domain vm_seneca ]
Sep 14 12:17:16 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_seneca_stop_0:1546:stderr [ error: Timed out during operation: cannot 
acquire state change lock (held by 
monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ]
Sep 14 12:17:16 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_seneca_stop_0:1546:stderr [  ]

Sep 14 12:07:40 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_geneious_stop_0:1545:stderr [ error: Failed to shutdown domain vm_geneious ]
Sep 14 12:07:40 [7831] ha-idg-1   lrmd:   notice: operation_finished:   
vm_geneious_stop_0:1545:stderr [ error: Timed out during operation: cannot 
acquire state change lock (held by 
monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ]



Any ideas ?

Bernd


-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Live migration possible with KSM ?

2021-03-31 Thread Lentes, Bernd


- On Mar 30, 2021, at 7:54 PM, hunter86 bg hunter86...@yahoo.com wrote:

> Keep in mind that KSM is highly cpu intensive and is most suitable for same 
> type
> of VMs,so similar memory pages will be merged until a change happen (and that
> change is allocated elsewhere).

> In oVirt migration is possible with KSM actively working, so it should work 
> with
> pacemaker.

> I doubt that KSM would be a problem... most probably performance would not be
> optimal.

> Best Regards,
> Strahil Nikolov

>> On Tue, Mar 30, 2021 at 19:47, Andrei Borzenkov
>>  wrote:
>> On 30.03.2021 18:16, Lentes, Bernd wrote:
>> > Hi,

>>> currently i'm reading "Mastering KVM Virtualization", published by Packt
>> > Publishing, a book i can really recommend.
>>> There are some proposals for tuning guests. One is KSM (kernel samepage
>> > merging), which sounds quite interesting.
>>> Especially in a system with lots of virtual machines with the same OS this 
>>> could
>> > lead to significant merory saving.
>>> I'd like to test, but i don't know if KSM maybe prevents live migration in a
>> > pacemaker cluster.

>> I do not think pacemaker cares or is aware about KSM. It just tells
>> resource agent to perform migration; what happens is entirely up to
>> resource agent.

>> If you can migrate without pacemaker you can also migrate with pacemaker.

Just to give a feedback.
I configured KSM on both nodes. On one it saves me nearly 20GB RAM.
I checked live migration and it worked.

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Live migration possible with KSM ?

2021-03-30 Thread Lentes, Bernd
Hi,

currently i'm reading "Mastering KVM Virtualization", published by Packt 
Publishing, a book i can really recommend.
There are some proposals for tuning guests. One is KSM (kernel samepage 
merging), which sounds quite interesting.
Especially in a system with lots of virtual machines with the same OS this 
could lead to significant merory saving.
I'd like to test, but i don't know if KSM maybe prevents live migration in a 
pacemaker cluster.
Does amyone know ?

Thanks.


Bernd

-- 

Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 


Public key: 

30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01


smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] alert is not executed - solved

2021-02-16 Thread Lentes, Bernd


- On Feb 15, 2021, at 10:24 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:

> - On Feb 15, 2021, at 9:00 PM, kgaillot kgail...@redhat.com wrote:
> 
>> On Mon, 2021-02-15 at 20:47 +0100, Lentes, Bernd wrote:
>>> - On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com
>>> wrote:
>>> 
>>> > I'd check for SELinux denials.
>>> > 
>>> 
>>> SELinux isn't installed and the AppArmor service does not start.
>>> I changed the subject.
>> 

I found it. It was a permission problem. The script was stored in the home from 
root
but alert scripts are executed as user hacluster.


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] alert is not executed

2021-02-15 Thread Lentes, Bernd


- On Feb 15, 2021, at 9:00 PM, kgaillot kgail...@redhat.com wrote:

> On Mon, 2021-02-15 at 20:47 +0100, Lentes, Bernd wrote:
>> - On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com
>> wrote:
>> 
>> > I'd check for SELinux denials.
>> > 
>> 
>> SELinux isn't installed and the AppArmor service does not start.
>> I changed the subject.
> 
> Maybe "exec 2>/some/file" and "set +x" as the first things in the
> script.
> 

That does not help. The script is not executed.
I inserted a "logger" command, but there is nothing written to the syslog.
And the atime of the script is not updated.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] alert is not executed

2021-02-15 Thread Lentes, Bernd

- On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com wrote:

> I'd check for SELinux denials.
> 
SELinux isn't installed and the AppArmor service does not start.
I changed the subject.


Bernd 
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: weird xml snippet in "crm configure show"

2021-02-15 Thread Lentes, Bernd


- On Feb 15, 2021, at 9:55 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:
>> Hi,
>> 
>> i could configure the following:
>> 
>> ha-idg-1:~ # crm configure show smtp_alert
>> alert smtp_alert "/root/skripte/alert_smtp.sh" \
>> attributes email_sender="bernd.len...@helmholtz-muenchen.de" \
>> meta timestamp-format="%D %H:%M" \
>> to "informatic@helmholtz-muenchen.de"
>> 
>> Script is available:
>> ha-idg-1:~ # ll /root/skripte/alert_smtp.sh
>> -rwxr-xr-x 1 root root 4080 Feb 13 01:10 /root/skripte/alert_smtp.sh
>> 
>> But it's not executed, although Cluster log says the alert is doing his
> job:
>> Feb 13 01:10:57 [30760] ha-idg-1   crmd: info: exec_alert_list:
>> Sending resource alert via smtp_alert to
> informatic@helmholtz-muenchen.de
>> Feb 13 01:10:57 [30757] ha-idg-1   lrmd: info:
>> process_lrmd_alert_exec: Executing alert smtp_alert for
>> 621c8a64-13aa-46fa-a7f7-f7df8d384a86
>> 
>> I added a simple logger command into the script, but nothing is written to
>> system log.
> 
> How does the first line in your script look like?
> 

#!/bin/sh
#
# Copyright (C) 2016 Klaus Wenninger 
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public
# License as published by the Free Software Foundation; either
# version 2 of the License, or (at your option) any later version.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] weird xml snippet in "crm configure show"

2021-02-12 Thread Lentes, Bernd


- On Feb 12, 2021, at 12:50 PM, Yan Gao y...@suse.com wrote:


> 
> 
> It seems that crmsh has difficulty parsing the "random" ids of the
> attribute sets here. I guess `crm configure edit` the part to be
> something like:
> 
> alert smtp_alert "/root/skripte/alert_smtp.sh" \
> attributes email_sender="bernd.len...@helmholtz-muenchen.de" \
> to "informatic@helmholtz-muenchen.de" meta
> timestamp-format="%D %H:%M"
> 
> will do.

Hi,

i could configure the following:

ha-idg-1:~ # crm configure show smtp_alert
alert smtp_alert "/root/skripte/alert_smtp.sh" \
attributes email_sender="bernd.len...@helmholtz-muenchen.de" \
meta timestamp-format="%D %H:%M" \
to "informatic@helmholtz-muenchen.de"

Script is available:
ha-idg-1:~ # ll /root/skripte/alert_smtp.sh
-rwxr-xr-x 1 root root 4080 Feb 13 01:10 /root/skripte/alert_smtp.sh

But it's not executed, although Cluster log says the alert is doing his job:
Feb 13 01:10:57 [30760] ha-idg-1   crmd: info: exec_alert_list: Sending 
resource alert via smtp_alert to informatic@helmholtz-muenchen.de
Feb 13 01:10:57 [30757] ha-idg-1   lrmd: info: process_lrmd_alert_exec: 
Executing alert smtp_alert for 621c8a64-13aa-46fa-a7f7-f7df8d384a86

I added a simple logger command into the script, but nothing is written to 
system log.

Bernd

Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] weird xml snippet in "crm configure show"

2021-02-12 Thread Lentes, Bernd

- On Feb 12, 2021, at 5:00 PM, hunter86 bg hunter86...@yahoo.com wrote:

> WARNING: cib-bootstrap-options: unknown attribute 'no-quirum-policy'

> That looks like a typo.

> Best Regards,
> Strahil Nikolov

Thanks, i found that already.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] weird xml snippet in "crm configure show"

2021-02-12 Thread Lentes, Bernd


- On Feb 12, 2021, at 11:18 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

> 
> What is the output of "crm configure verify"?

ha-idg-1:~ # crm configure verify
WARNING: cib-bootstrap-options: unknown attribute 'no-quirum-policy'
WARNING: clvmd: specified timeout 20 for monitor is smaller than the advised 90s
WARNING: dlm: specified timeout 80 for stop is smaller than the advised 100
WARNING: dlm: specified timeout 80 for start is smaller than the advised 90
WARNING: fs_ocfs2: specified timeout 20 for monitor is smaller than the advised 
40s
WARNING: fs_test_ocfs2: specified timeout 20 for monitor is smaller than the 
advised 40s
WARNING: gfs2_share: specified timeout 20 for monitor is smaller than the 
advised 40s
WARNING: gfs2_snap: specified timeout 20 for monitor is smaller than the 
advised 40s
WARNING: vm_amok: specified timeout 25 for monitor is smaller than the advised 
30s
WARNING: vm_crispor: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_crispor-server: specified timeout 25 for monitor is smaller than 
the advised 30s
WARNING: vm_dietrich: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_documents-oo: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_geneious: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_geneious-license: specified timeout 25 for monitor is smaller than 
the advised 30s
WARNING: vm_geneious-license-mcd: specified timeout 25 for monitor is smaller 
than the advised 30s
WARNING: vm_genetrap: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_greensql: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_idcc_devel: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_mausdb: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_mouseidgenes: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_nextcloud: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_pathway: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_photoshop: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_seneca: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_severin: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_sim: specified timeout 25 for monitor is smaller than the advised 
30s
WARNING: vm_snipanalysis: specified timeout 25 for monitor is smaller than the 
advised 30s
WARNING: vm_ssh: specified timeout 25 for monitor is smaller than the advised 
30s



Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-26 Thread Lentes, Bernd


- On Oct 26, 2020, at 4:09 PM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:


> 
> AFAIK you can even kill processes in Linux that are in "D" state (contrary to
> other operating systems).

How ?


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-26 Thread Lentes, Bernd


- On Oct 23, 2020, at 11:18 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:

> - On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote:
> 
> 
>>> I need someting like that which waits for some time (maybe 30s) if the 
>>> domain
>>> nevertheless stops although
>>> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the
>>> process wakes up from D state.
>> 
>> So why not ignore virsh error and just wait always? You probably need to
>> retain "domain not found" exit condition still.

Hi,

here is my rewritten RA: 
https://hmgubox2.helmholtz-muenchen.de/index.php/s/iYjRyJiWb5XNfXm

As i'm not a coder feedback is welcome.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-26 Thread Lentes, Bernd


- On Oct 26, 2020, at 8:41 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

> "SIGKILL: Device or resource busy" is nonsense: kill does not wait; it either
> fails or succeeds.

yes and no. When you send a SIGKILL to a process which is in 'D' state, the 
signal can't be delivered, e.g.  because the domain is doing heavy IO.
But the signal is pending, and when the process wakes up and switches to 'S' or 
'R' state, the signal is delivered and the process is killed.

That's exactly what i want to address. If you just check the rc from kill and 
you get s.th. -ne 0, you think it failed.
But it's possible your kill will work 20 seconds later. So, with a delay, 
SIGKILL would succeed. But the node the domain is running on is already fenced, 
with all its consequences.

And the "device or resource busy" message is exactly the one you get when you 
do a "virsh destroy" to a domain in 'D' state.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote:


>> I need someting like that which waits for some time (maybe 30s) if the domain
>> nevertheless stops although
>> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the
>> process wakes up from D state.
> 
> So why not ignore virsh error and just wait always? You probably need to
> retain "domain not found" exit condition still.

That's my plan.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvi...@valentin-vidic.from.hr 
wrote:

> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
>> But when the timeout has run out the RA tries to kill the machine with a 
>> "virsh
>> destroy".
>> And if that does not work (what is occasionally my problem) because the 
>> domain
>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back 
>> which
>> cause pacemaker to fence the lazy node. Or am i wrong ?
> 
> What does the log look like when this happens?
> 

/var/log/cluster/corosync.log:

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:34:11 INFO: Issuing graceful 
shutdown request for domain vm_amok.

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:37:06 INFO: Issuing forced 
shutdown (destroy) request for domain vm_amok.
Sep 27 22:37:11 [11282] ha-idg-2   lrmd:  warning: child_timeout_callback:  
vm_amok_stop_0 process (PID 8998) timed out
Sep 27 22:37:11 [11282] ha-idg-2   lrmd:  warning: operation_finished:  
vm_amok_stop_0:8998 - timed out after 18ms
  timeout of the domain is 180 sec.

/var/log/libvirt/libvirtd.log (time is UTC):

2020-09-27 20:37:21.489+: 18583: error : virProcessKillPainfully:401 : 
Failed to terminate process 14037 with SIGKILL: Device or resource busy
2020-09-27 20:37:21.505+: 6610: error : virNetSocketWriteWire:1852 : Cannot 
write data: Broken pipe
2020-09-27 20:37:31.962+: 6610: error : qemuMonitorIO:719 : internal error: 
End of file from qemu monitor

SIGKILL didn't work. Nevertheless the process is finished 20 seconds later 
after destroy, surely because it woke up from D and received the signal.

/var/log/cluster/corosync.log on the DC:

Sep 27 22:37:11 [3580] ha-idg-1   crmd:  warning: status_from_rc:   Action 
93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
  Stop (also sigkill) failed
Sep 27 22:37:11 [3579] ha-idg-1pengine:   notice: native_stop_constraints:  
Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced
  cluster decides to fence the node although resource is stopped 10 seconds 
later

atop log:
14037  - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name 
guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ...
  PID of the domain is 14037

14037  - E   0% worker   (at 22:37:31)
  domain has stoppped


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 5:06 PM, Strahil Nikolov hunter86...@yahoo.com wrote:

> why don't you work with something like this: 'op stop interval =300
> timeout=600'.
> The stop operation will timeout at your requirements without modifying the
> script.
> 
> Best Regards,
> Strahil Nikolov

But when the timeout has run out the RA tries to kill the machine with a "virsh 
destroy".
And if that does not work (what is occasionally my problem) because the domain
is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which
cause pacemaker to fence the lazy node. Or am i wrong ?
Where is the benefit of the shorter interval ?

The return value of the "virsh destroy" operation is set immediately.
And it's -ne 0 when the "virsh destroy" didn't suceed.
No matter if the domain stops 20 sec. later, the return value is not changed.
and send to the LRM so the cluster wants to stonith that node.

Surprisingly if the virsh destroy is successfull the RA waits until the domain 
isn't running anymore:

force_stop
{
 ...

  0*)
while [ $status != $OCF_NOT_RUNNING ]; do
VirtualDomain_status
status=$?
done ;;

I need someting like that which waits for some time (maybe 30s) if the domain 
nevertheless stops although
"virsh destroy" gaves an error back. Because the SIGKILL is delivered if the 
process wakes up from D state.
For this amount of time the RA has to wait and to take care that the the return 
value is zero if the domain stopped or
is -ne 0 if also the waiting didn't help.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Lentes, Bernd


- On Oct 23, 2020, at 7:11 AM, Andrei Borzenkov arvidj...@gmail.com wrote:


>> 
>>  ocf_log info "Issuing forced shutdown (destroy) request for domain
>>  ${DOMAIN_NAME}."
>> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>> ex=$?
>> sleep (10)< (or maybe configurable)
>> translate=$(echo $out|tr 'A-Z' 'a-z')
>> 
>> 
>> What do you think ?
>> 
> 
> 
> It makes no difference. You wait 10 seconds before parsing output of
> "virsh destroy", that's all. It does not change output itself, so if
> output indicates that "virsh destroy" failed, it will still indicate
> that after 10 seconds.
> 
> Either you need to repeat "virsh destroy" in a loop, or virsh itself
> should be more robust.

Hi Andrei,

yes, you are right. I saw it alread after sending the E-Mail.
I will change that.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-22 Thread Lentes, Bernd
Hi guys,

ocassionally stopping a VirtualDomain resource via "crm resource stop" does not 
work, and in the end the node is fenced, which is ugly.
I had a look at the RA to see what it does. After trying to stop the domain via 
"virsh shutdown ..." in a configurable time it switches to "virsh destroy".
i assume "virsh destroy" send a sigkill to the respective process. But when the 
host is doing heavily IO it's possible that the process is in "D" state 
(uninterruptible sleep) 
in which it can't be finished with a SIGKILL. The the node the domain is 
running on is fenced due to that.
I digged deeper and found out that the signal is often delivered a bit later 
(just some seconds) and the process is killed, but pacemaker already decided to 
fence the node.
It's all about this excerp in the RA:

force_stop()
{
local out ex translate
local status=0

ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
ex=$?
translate=$(echo $out|tr 'A-Z' 'a-z')
echo >&2 "$translate"
case $ex$translate in
*"error:"*"domain is not running"*|*"error:"*"domain not 
found"*|\
*"error:"*"failed to get domain"*)
: ;; # unexpected path to the intended outcome, all is 
well
[!0]*)
ocf_exit_reason "forced stop failed"
return $OCF_ERR_GENERIC ;;
0*)
while [ $status != $OCF_NOT_RUNNING ]; do
VirtualDomain_status
status=$?
done ;;
esac
return $OCF_SUCCESS
}

I'm thinking about the following:
How about to let the script wait a bit after "virsh destroy". I saw that 
usually it just takes some seconds that "virsh destroy" is successfull.
I'm thinking about this change:

 ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
ex=$?
sleep (10)< (or maybe configurable)
translate=$(echo $out|tr 'A-Z' 'a-z')


What do you think ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] mess in the CIB

2020-10-07 Thread Lentes, Bernd


> 
> It's unlikely that changed at any time; more likely it was created like
> that. Whatever was used to create the initial configuration would be
> where to look for clues.
> 
> As long as the IDs are unique, their content doesn't matter to
> pacemaker, so it's just a cosmetic issue.
> 

What do you propose ?
Delete and re-create correctly ?
These domains can be stopped for a short time.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum München

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] mess in the CIB

2020-10-06 Thread Lentes, Bernd
Hi guys,

i have a very strange problem with my CIB.
We have a two-node cluster running about 15 VirtualDomains as resources.
Two of them seem to be messed up.
Here is the config from crm:

primitive vm_ssh VirtualDomain \
params config="/mnt/share/vm_ssh.xml" \
params hypervisor="qemu:///system" \
params migration_transport=ssh \
params migrate_options="--p2p --tunnelled" \
op start interval=0 timeout=120 \
op stop interval=0 timeout=180 \
op monitor interval=30 timeout=25 \
op migrate_from interval=0 timeout=300 \
op migrate_to interval=0 timeout=300 \
meta allow-migrate=true target-role=Started is-managed=true 
maintenance=false \
utilization cpu=2 hv_memory=4096

ha-idg-1:/mnt/share # crm configure show vm_snipanalysis
primitive vm_snipanalysis VirtualDomain \
params config="/mnt/share/vm_snipanalysis.xml" \
params hypervisor="qemu:///system" \
params migration_transport=ssh \
params migrate_options="--p2p --tunnelled" \
op start interval=0 timeout=120 \
op stop interval=0 timeout=180 \
op monitor interval=30 timeout=25 \
op migrate_from interval=0 timeout=300 \
op migrate_to interval=0 timeout=300 \
meta allow-migrate=true target-role=Stopped is-managed=false 
maintenance=false
Everything looks ok for me.

Here are the two config files for libvirt:

ha-idg-1:/etc/libvirt/qemu # less /mnt/share/vm_snipanalysis.xml

  vm_snipanalysis
  b3b91a8c-b13f-4368-8439-7d8a4108ef3b
  32768000
  32768000
  12
  
hvm
  
  



  
  

  
  



  
  destroy
  restart
  destroy
  


  
  
/usr/bin/qemu-kvm

  
  
  
  
  


  
  
  
  
  



  


  
  


  
  


  
  



  


  


  
  
  
  


  

  


  




  


  


  
  


  


  


  

  


and

ha-idg-1:/etc/libvirt/qemu # less /mnt/share/vm_ssh.xml



  vm_ssh
  b3b91a8d-b13f-4368-8439-7d8a4109ef3b
  4194304
  4194304
  2
  
hvm
  
  



  
  
  
  



  
  destroy
  restart
  destroy
  


  
  
/usr/bin/qemu-kvm

  
  
  
  
  



  


  
  


  
  


  
  



  


  


  
  
  
  


  


  




  

 
  


  
  


  


  


  

  

Also in the libvirt config files i don't see a problem.

BUT in the cib:

 

<==


<==


<===


  


  
  
  
  
  


  
  
  
  


  
  

  

and



  


  


  


  


  
  
  
  
  


  
  
  
  

  

The config of vm_snipanalysis seems to be ok.
But vm_ssh ... why are some instance-attributes of it named with snapanalysis?
I didn't change the configuration of both in the last weeks.

Does anyone have a clue ?
Thanks.

Bernd

-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum München

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/

2020-10-02 Thread Lentes, Bernd


- Am 30. Sep 2020 um 19:24 schrieb Vladislav Bogdanov bub...@hoster-ok.com:

> Hi

> Try to enable trace_ra for start op.

I'm tracing now start and stop and that works fine.


  

  

  
  
  
  
  

  

  


Thanks for any hint.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum München

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/

2020-09-28 Thread Lentes, Bernd
Hi,

currently i have a VirtualDomains resource which sometimes fails to stop.
To investigate further i'm tracing the stop operation of this resource.
But although i stopped it already now several times, nothing appears in 
/var/lib/heartbeat/trace_ra/.

This is my config:
primitive vm_amok VirtualDomain \
params config="/mnt/share/vm_amok.xml" \
params hypervisor="qemu:///system" \
params migration_transport=ssh \
params migrate_options="--p2p --tunnelled" \
op start interval=0 timeout=120 \
op monitor interval=30 timeout=25 \
op migrate_from interval=0 timeout=300 \
op migrate_to interval=0 timeout=300 \
op stop interval=0 timeout=180 \
op_params trace_ra=1 \
meta allow-migrate=true target-role=Started is-managed=true 
maintenance=false \

 

  


  


  


  


  
  
  
  
  

  

  


Any ideas ?
SLES 12 SP4, pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64

Bernd

-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum München

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Lentes, Bernd

- On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com wrote:
>> This appears to be a scheduler bug.
> 
> Fix is in master branch and will land in 2.0.5 expected at end of the
> year
> 
> https://github.com/ClusterLabs/pacemaker/pull/2146

A principal question:
I have SLES 12 and i'm using the pacemaker version provided with the 
distribution.
If this fix is backported depends on Suse.

If i install und update pacemaker manually (not the version provided by Suse),
i loose my support from them, but have always the most recent code and fixes.

If i stay with the version from Suse i have support from them, but maybe not 
all fixes and not the most recent code.

What is your approach ?
Recommendations ?

Thanks.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Lentes, Bernd

- On Aug 18, 2020, at 7:30 PM, kgaillot kgail...@redhat.com wrote:


>> > I'm not sure, I'd have to see the pe input.
>> 
>> You find it here:
>> https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29
> 
> This appears to be a scheduler bug.
> 
> The scheduler considers a migration to be "dangling" if it has a record
> of a failed migrate_to on the source node, but no migrate_from on the
> target node (and no migrate_from or start on the source node, which
> would indicate a later full restart or reverse migration).
> 
> In this case, any migrate_from on the target has since been superseded
> by a failed start and a successful stop, so there is no longer a record
> of it. Therefore the migration is considered dangling, which requires a
> full stop on the source node.
> 
> However in this case we already have a successful stop on the source
> node after the failed migrate_to, and I believe that should be
> sufficient to consider it no longer dangling.
> 

Thanks for your explananation Ken.
For me a Fence i don't understand is the worst that can happen to a HA cluster.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-18 Thread Lentes, Bernd


- On Aug 17, 2020, at 5:09 PM, kgaillot kgail...@redhat.com wrote:


>> I checked all relevant pe-files in this time period.
>> This is what i found out (i just write the important entries):
 
>> Executing cluster transition:
>>  * Resource action: vm_nextcloudstop on ha-idg-2
>> Revised cluster status:
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> 
>> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
>> 3118 -G transition-4516.xml -D transition-4516.dot
>> Current cluster status:
>> Node ha-idg-1 (1084777482): standby
>> Online: [ ha-idg-2 ]
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> <== vm_nextcloud is stopped
>> Transition Summary:
>>  * Shutdown ha-idg-1
>> Executing cluster transition:
>>  * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ?
>> It is already stopped
> 
> I'm not sure, I'd have to see the pe input.

You find it here: 
https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29

>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <===
>> vm_nextcloud is stopped
>> Transition Summary:
>>  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
>> Executing cluster transition:
>>  * Fencing ha-idg-1 (Off)
>>  * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is
>> already stopped ?
>> Revised cluster status:
>> Node ha-idg-1 (1084777482): OFFLINE (standby)
>> Online: [ ha-idg-2 ]
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> 
>> I don't understand why the cluster tries to stop a resource which is
>> already stopped.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-14 Thread Lentes, Bernd
- On Aug 9, 2020, at 10:17 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:


>> So this appears to be the problem. From these logs I would guess the
>> successful stop on ha-idg-1 did not get written to the CIB for some
>> reason. I'd look at the pe input from this transition on ha-idg-2 to
>> confirm that.
>> 
>> Without the DC knowing about the stop, it tries to schedule a new one,
>> but the node is shutting down so it can't do it, which means it has to
>> be fenced.

I checked all relevant pe-files in this time period.
This is what i found out (i just write the important entries):

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3116 -G 
transition-3116.xml -D transition-3116.dot
Current cluster status:
 ...
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-1
Transition Summary:
 ...
* Migratevm_nextcloud   ( ha-idg-1 -> ha-idg-2 )
Executing cluster transition:
 * Resource action: vm_nextcloudmigrate_from on ha-idg-2 <=== migrate 
vm_nextcloud
 * Resource action: vm_nextcloudstop on ha-idg-1 
 * Pseudo action:   vm_nextcloud_start_0
Revised cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2


ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-48 -G 
transition-4514.xml -D transition-4514.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
...
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-idg-1 ] 
<== migration failed
Transition Summary:
..
 * Recovervm_nextcloud( ha-idg-2 )
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-2
 * Resource action: vm_nextcloudstop on ha-idg-1 
 * Resource action: vm_nextcloudstart on ha-idg-2
 * Resource action: vm_nextcloudmonitor=3 on ha-idg-2
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3117 -G 
transition-3117.xml -D transition-3117.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2 <== start 
on ha-idg-2 failed
Transition Summary:
 * Stop   vm_nextcloud ( ha-idg-2 )   due to node availability < 
stop vm_nextcloud (what means due to node availability ?)
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-2
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3118 -G 
transition-4516.xml -D transition-4516.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <== 
vm_nextcloud is stopped
Transition Summary:
 * Shutdown ha-idg-1
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ? It is 
already stopped
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-3545 -G 
transition-0.xml -D transition-0.dot
Current cluster status:
Node ha-idg-1 (1084777482): pending
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is 
stopped
Transition Summary:

Executing cluster transition:
Using the original execution date of: 2020-07-20 15:05:33Z
Revised cluster status:
vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-749 -G 
transition-1.xml -D transition-1.dot
Current cluster status:
Node ha-idg-1 (1084777482): OFFLINE (standby)
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <=== vm_nextcloud 
is stopped
Transition Summary:
 * Fence (Off) ha-idg-1 'resource actions are unrunnable'
Executing cluster transition:
 * Fencing ha-idg-1 (Off)
 * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is already 
stopped ?
Revised cluster status:
Node ha-idg-1 (1084777482): OFFLINE (standby)
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

I don't understand why the cluster tries to stop a resource which is already 
stopped.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-14 Thread Lentes, Bernd


- On Aug 10, 2020, at 11:59 PM, kgaillot kgail...@redhat.com wrote:
> The most recent transition is aborted, but since all its actions are
> complete, the only effect is to trigger a new transition.
> 
> We should probably rephrase the log message. In fact, the whole
> "transition" terminology is kind of obscure. It's hard to come up with
> something better though.
> 
Hi Ken,

i don't get it. How can s.th. be aborted which is already completed ?

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-10 Thread Lentes, Bernd


- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:

> On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
>> Hi,
>> 
>> a few days ago one of my nodes was fenced and i don't know why, which
>> is something i really don't like.
>> What i did:
>> I put one node (ha-idg-1) in standby. The resources on it (most of
>> all virtual domains) were migrated to ha-idg-2,
>> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
>> missing the xml of the domain points to.
>> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
>> course also failed.
>> Then ha-idg-1 was fenced.
>> I did a "crm history" over the respective time period, you find it
>> here:
>> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
>> 
>> Here, from my point of view, the most interesting from the logs:
>> ha-idg-1:
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  Diff: --- 2.16196.19 2
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482']/i
>> nstance_attributes[@id='nodes-108
>> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
>> ha-idg-1 set to standby
>> 
>> Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   ha-idg-1-vm_nextcloud_migrate_to_0:3169 [
>> error: Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or
>> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2
>> failed: 1\n ]
>> migration failed
>> 
>> Jul 20 17:04:01 [23767] ha-idg-1pengine:error:
>> native_create_actions:   Resource vm_nextcloud is active on 2 nodes
>> (attempting recovery)
>> ???
> 
> This is standard for a failed live migration -- the cluster doesn't
> know how far the migration actually got before failing, so it has to
> assume the VM could be active on either node. (The log message would
> make more sense saying "might be active" rather than "is active".)
> 
>> Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice:
>> LogAction:*
>> Recovervm_nextcloud   ( ha-idg-2 )
> 
> The recovery from that situation is a full stop on both nodes, and
> start on one of them.
> 
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on ha-
>> idg-2 | action 106
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
>> locally on ha-idg-1 | action 2
>> 
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd: info:
>> match_graph_event:   Action vm_nextcloud_stop_0 (106) confirmed
>> on ha-idg-2 (rc=0)
>> 
>> Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960
> 
> It looks like both stops succeeded.
> 
>> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
>> crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking
>> handler)
>> systemctl stop pacemaker.service
>> 
>> 
>> ha-idg-2:
>> Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=57
>> the log from ha-idg-2 is two seconds ahead of ha-idg-1
>> 
>> Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice:
>> log_execute: executing - rsc:vm_nextcloud action:start
>> call_id:192
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
>> Failed to create domain from /mnt/share/vm_nextcloud.xml ]
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
>> Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ]
>> J

Re: [ClusterLabs] why is node fenced ?

2020-08-10 Thread Lentes, Bernd

- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:

 
> Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.

The other way round.

>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud
>> on ha-idg-1: unknown error | rc=1
>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed start of vm_nextcloud on
>> ha-idg-2: unknown error | rc
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2pengine: info:
>> native_color:Resource vm_nextcloud cannot run anywhere
>> logical
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (pending)
>> ???
> 
> So this appears to be the problem. From these logs I would guess the
> successful stop on ha-idg-1 did not get written to the CIB for some
> reason. I'd look at the pe input from this transition on ha-idg-2 to
> confirm that.
> 
> Without the DC knowing about the stop, it tries to schedule a new one,
> but the node is shutting down so it can't do it, which means it has to
> be fenced.
> 
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (offline)
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> pe_fence_node:   Cluster node ha-idg-1 will be fenced: resource
>> actions are unrunnable
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> stage6:  Scheduling Node ha-idg-1 for STONITH
>> Jul 20 17:05:35 [10690] ha-idg-2pengine: info:
>> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-
>> idg-1 is fenced
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:   notice:
>> LogNodeActions:   * Fence (Off) ha-idg-1 'resource actions are
>> unrunnable'
>> 
>> 
>> Why does it say "Jul 20 17:05:35 [10690] ha-idg-
>> 2pengine:  warning: custom_action:   Action vm_nextcloud_stop_0
>> on ha-idg-1 is unrunnable (offline)" although
>> "Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960"
>> says that stop was ok ?

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] why is node fenced ?

2020-07-31 Thread Lentes, Bernd


- On Jul 31, 2020, at 8:03 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:


>>> 
>>> My guess is that ha-idg-1 was fenced because a failed migration from
>> ha-idg-2
>>> is treated like a stop failure on ha-idg-2. Stop failures cause fencing.
> You
>>> should have tested your resource before going productive.
>> 
>> Migration failed at 16:59:34.
>> Node is fenced at 17:05:35. 6 minutes later.
>> The cluster needs 6 minutes to decide to fence the node ?
>> I don't believe that the failed migration is the cause for the fencing.
> 
> What are the values for migration timeout and for stop timeout?
> 


primitive vm_nextcloud VirtualDomain \
params config="/mnt/share/vm_nextcloud.xml" \
params hypervisor="qemu:///system" \
params migration_transport=ssh \
params migrate_options="--p2p --tunnelled" \
op start interval=0 timeout=120 \
op stop interval=0 timeout=180 \  <==
op monitor interval=30 timeout=25 \
op migrate_from interval=0 timeout=300 \ <=
op migrate_to interval=0 timeout=300 \ <
meta allow-migrate=true target-role=Started is-managed=true 
maintenance=false \
utilization cpu=1 hv_memory=4096

3 or 5 minutes, not 6 minutes.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] why is node fenced ?

2020-07-30 Thread Lentes, Bernd


- Am 30. Jul 2020 um 9:28 schrieb Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de:

>>>> "Lentes, Bernd"  schrieb am 29.07.2020
> um
> 17:26 in Nachricht
> <1894379294.27456141.1596036406000.javamail.zim...@helmholtz-muenchen.de>:
>> Hi,
>> 
>> a few days ago one of my nodes was fenced and i don't know why, which is
>> something i really don't like.
>> What i did:
>> I put one node (ha-idg-1) in standby. The resources on it (most of all
>> virtual domains) were migrated to ha-idg-2,
>> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the
>> xml of the domain points to.
>> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course
>> also failed.
>> Then ha-idg-1 was fenced.
> 
> My guess is that ha-idg-1 was fenced because a failed migration from ha-idg-2
> is treated like a stop failure on ha-idg-2. Stop failures cause fencing. You
> should have tested your resource before going productive.

Migration failed at 16:59:34.
Node is fenced at 17:05:35. 6 minutes later.
The cluster needs 6 minutes to decide to fence the node ?
I don't believe that the failed migration is the cause for the fencing.


Bernd

Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd


- Am 29. Jul 2020 um 17:26 schrieb Bernd Lentes 
bernd.len...@helmholtz-muenchen.de:

Hi,

sorry, i missed:
OS: SLES 12 SP4
kernel: 4.12.14-95.32
pacmaker: pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd
Hi,

a few days ago one of my nodes was fenced and i don't know why, which is 
something i really don't like.
What i did:
I put one node (ha-idg-1) in standby. The resources on it (most of all virtual 
domains) were migrated to ha-idg-2,
except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the xml 
of the domain points to.
Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course also 
failed.
Then ha-idg-1 was fenced.

I did a "crm history" over the respective time period, you find it here:
https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF

Here, from my point of view, the most interesting from the logs:
ha-idg-1:
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  Diff: 
--- 2.16196.19 2
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  Diff: 
+++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  +  
/cib:  @epoch=16197, @num_updates=0
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  +  
/cib/configuration/nodes/node[@id='1084777482']/instance_attributes[@id='nodes-108
4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
ha-idg-1 set to standby

Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice: process_lrm_event:   
ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ error: Cannot access storage file 
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso':
 No such file or directory\nocf-exit-reason:vm_nextcloud: live migration to 
ha-idg-2 failed: 1\n ]
migration failed

Jul 20 17:04:01 [23767] ha-idg-1pengine:error: native_create_actions:   
Resource vm_nextcloud is active on 2 nodes (attempting recovery)
???

Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice: LogAction:* 
Recovervm_nextcloud   ( ha-idg-2 )

Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_nextcloud_stop_0 on ha-idg-2 | action 106
Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_nextcloud_stop_0 locally on ha-idg-1 | action 2

Jul 20 17:04:01 [23768] ha-idg-1   crmd: info: match_graph_event:   
Action vm_nextcloud_stop_0 (106) confirmed on ha-idg-2 (rc=0)

Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice: process_lrm_event:   
Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) | call=3197 
key=vm_nextcloud_stop_0 confirmed=true cib-update=5960

Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice: crm_signal_dispatch: 
Caught 'Terminated' signal | 15 (invoking handler)
systemctl stop pacemaker.service


ha-idg-2:
Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice: process_lrm_event:   
Result of stop operation for vm_nextcloud on ha-idg-2: 0 (ok) | call=157 
key=vm_nextcloud_stop_0 confirmed=true cib-update=57
the log from ha-idg-2 is two seconds ahead of ha-idg-1

Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice: log_execute: 
executing - rsc:vm_nextcloud action:start call_id:192
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ error: Failed to create domain from 
/mnt/share/vm_nextcloud.xml ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ error: Cannot access storage file 
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso':
 No such file or directory ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ ocf-exit-reason:Failed to start virtual 
domain vm_nextcloud. ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: log_finished:
finished - rsc:vm_nextcloud action:start call_id:192 pid:29107 exit-code:1 
exec-time:581ms queue-time:0ms
start on ha-idg-2 failed

Jul 20 17:05:32 [10691] ha-idg-2   crmd: info: do_dc_takeover:  Taking 
over DC status for this partition
ha-idg-1 stopped pacemaker

Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: unpack_rsc_op_failure:   
Processing failed migrate_to of vm_nextcloud on ha-idg-1: unknown error | rc=1
Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: unpack_rsc_op_failure:   
Processing failed start of vm_nextcloud on ha-idg-2: unknown error | rc

Jul 20 17:05:33 [10690] ha-idg-2pengine: info: native_color:
Resource vm_nextcloud cannot run anywhere
logical

Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: custom_action:   Action 
vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (pending)
???

Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning: custom_action:   Action 
vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (offline)
Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: resource actions are 

[ClusterLabs] pacemaker together with ovirt or Kimchi ?

2020-07-11 Thread Lentes, Bernd
Hi,

i'm having a two node cluster with pacemaker and about 10 virtual domains as 
resources.
It's running fine.
I configure/administrate everything with the crm shell.
But i'm also looking for a web interface.
I'm not much impressed by HAWK.
Is it possible to use Kimchi or ovirt together with a pacemaker HA cluster ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy 
stay at home
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?

2019-10-16 Thread Lentes, Bernd


- On Oct 16, 2019, at 8:27 AM, Digimer li...@alteeve.ca wrote:

> On 2019-10-16 2:16 a.m., Ulrich Windl wrote:
>>>>> "Lentes, Bernd"  schrieb am 15.10.2019
>> um
>> 21:35 in Nachricht
>> <1922568650.3402980.1571168140600.javamail.zim...@helmholtz-muenchen.de>:
>>> Hi,
>>>
>>> i'm a big fan of simple solutions (KISS).
>>> Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker.
>>> They all are fundamental prerequisites for my resources (Virtual Domains).
>>> To configure them i used clones and groups.
>>> Why not having them managed by systemd to make the cluster setup more
>>> overseeable ?
>>>
>>> Is there a strong reason that pacemaker cares about them ?
>> 
>> AFAIK, DLM (others maybe too) need the cluster infrastructure (comminication
>> layer) to be operable.
>> Also I consider systemd handling resources being worse than pacemaker.
>> What is your specific problem? Keeping the cluster configuration simple while
>> moving complexity to systemd?
>> 
>> Do you know one command to describe your systemd configuration as short as 
>> the
>> cluster configuration (like crm configuration show)?
>> 
>> Regards,
>> Ulrich
> 
> This is correct. DLM uses corosync.

OK. I understand. I will stay with pacemaker.
Thanks for all answers.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?

2019-10-15 Thread Lentes, Bernd
Hi,

i'm a big fan of simple solutions (KISS).
Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker.
They all are fundamental prerequisites for my resources (Virtual Domains).
To configure them i used clones and groups.
Why not having them managed by systemd to make the cluster setup more 
overseeable ?

Is there a strong reason that pacemaker cares about them ?

Bernd 

-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trace of Filesystem RA does not log

2019-10-14 Thread Lentes, Bernd

>> -Original Message-
>> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Lentes,
>> Bernd
>> Sent: 2019年10月11日 22:32
>> To: Pacemaker ML 
>> Subject: [ClusterLabs] trace of Filesystem RA does not log
>> 
>> Hi,
>> 
>> occasionally the stop of a Filesystem resource for an OCFS2 Partition fails 
>> to
>> stop.
> Which SLE version are you using?
> When ocfs2 file system stop fails, that means the umount process is hung?
> Could you cat that process stack via /proc/xxx/stack?
> Of course, you also can use o2locktop to identify if there is any 
> active/hanged
> dlm lock at that moment.
> 

I'm using SLES 12 SP4. I don't know exactly why umount isn't working or if it 
hangs, that's why
i tried to trace the stop operation to have more infos.
I will test o2locktop.
What do you mean by "/proc/xxx/stack" ?
The stack of which process should i investigate ? umount ?


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trace of Filesystem RA does not log

2019-10-14 Thread Lentes, Bernd



- On Oct 14, 2019, at 6:27 AM, Roger Zhou zz...@suse.com wrote:
> The stop failure is very bad, and is crucial for HA system.

Yes, that's true.
 
> You can try o2locktop cli to find the potential INODE to be blamed[1].
> 
> `o2locktop --help` gives you more usage details

I will try that.

> 
> [1] o2locktop package
> https://software.opensuse.org/package/o2locktop?search_term=o2locktop
> 

Thanks.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


  1   2   3   >