Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster
> -Original Message- > From: Reid Wahl > Sent: Friday, March 10, 2023 10:30 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > ; Lentes, Bernd muenchen.de> > Subject: Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't > start > cluster > > On Fri, Mar 10, 2023 at 10:49 AM Lentes, Bernd muenchen.de> wrote: > > (192.168.100.10:2340) was formed. Members joined: 1084777482 > > Is this really the corosync node ID of one of your nodes? If not, what's > your > corosync version? Is the number the same every time the issue happens? > The number is so large and seemingly random that I wonder if there's some > kind of memory corruption. > Yes it's correct. ha-idg-1:~ # crm node show ha-idg-1(1084777482): member maintenance=off standby=off ha-idg-2(1084777492): member(offline) maintenance=off standby=off ha-idg-1:~ # > > Cluster node ha-idg-1 is now in unknown state ⇐= is that the > > problem ? > > Probably a normal part of the startup process but I haven't tested it yet. > > > Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng: notice: handle_request: > > Received manual confirmation that ha-idg-1 is fenced > > Yes > > > tengine_stonith_notify: We were allegedly just fenced by a human for > > ha-idg-1! <= what does that mean ? I didn't > fence > > it > > It means you ran `stonith_admin -C` > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker- > 1.1.24/fencing/remote.c#L945-L961 > > > Mar 10 19:36:34 [31050] ha-idg-1 crmd: info: crm_xml_cleanup: > > Cleaning up memory from libxml2 > > Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd: warning: > pcmk_child_exit: > > Shutting cluster down because crmd[31050] had fatal failure > > <=== ??? > > Pacemaker is shutting down on the local node because it just received > confirmation that it was fenced (because you ran `stonith_admin -C`). > This is expected behavior. OK. If it is expected then it's fine. > > Can you help me understand the issue here? You started the cluster on this > node at 19:36:24. 10 seconds later, you ran `stonith_admin -C`, and the > local node shut down Pacemaker, as expected. It doesn't look like > Pacemaker stopped until you ran that command. I didn't know that this is expected. Bernd (null) Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH), Ingolstadter Landstr. 1, 85764 Neuherberg, www.helmholtz-munich.de. Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther, Daniela Sommer (kom.) | Aufsichtsratsvorsitzende: Prof. Dr. Veronika von Messling | Registergericht: Amtsgericht Muenchen HRB 6466 | USt-IdNr. DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster
Hi, I don’t get my cluster running. I had problems with an OCFS2 Volume, both nodes have been fenced. When I do now a “systemctl start pacemaker.service”, crm_mon shows for a few seconds both nodes as UNCLEAN, then pacemaker stops. I try to confirm the fendcing with “Stonith_admin –C”, but it doesn’t work. Maybe time is to short, pacemaker is just running for a few seconds. Here is the log: Mar 10 19:36:24 [31037] ha-idg-1 corosync notice [MAIN ] Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Mar 10 19:36:24 [31037] ha-idg-1 corosync info[MAIN ] Corosync built-in features: debug testagents augeas systemd pie relro bindnow Mar 10 19:36:24 [31037] ha-idg-1 corosync notice [TOTEM ] Initializing transport (UDP/IP Multicast). Mar 10 19:36:24 [31037] ha-idg-1 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [TOTEM ] The network interface [192.168.100.10] is now up. Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync configuration map access [0] Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cmap Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync configuration service [1] Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cfg Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: cpg Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync profile loading service [4] Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [QUORUM] Using quorum provider corosync_votequorum Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [QUORUM] This node is within the primary component and will provide service. Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [QUORUM] Members[0]: Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: votequorum Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Mar 10 19:36:25 [31037] ha-idg-1 corosync info[QB] server name: quorum Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [TOTEM ] A new membership (192.168.100.10:2340) was formed. Members joined: 1084777482 Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [QUORUM] Members[1]: 1084777482 Mar 10 19:36:25 [31037] ha-idg-1 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: notice: main:Starting Pacemaker 1.1.24+20210811.f5abda0ee-3.27.1 | build=1.1.24+20210811.f5abda0ee features: generated-manpages agent-manp ages ncurses libqb-logging libqb-ipc lha-fencing systemd nagios corosync-native atomic-attrd snmp libesmtp acls cibsecrets Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: main:Maximum core file size is: 18446744073709551615 Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: qb_ipcs_us_publish: server name: pacemakerd Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to lrmd IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to cib_ro IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to crmd IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to attrd IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to pengine IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: pcmk__ipc_is_authentic_process_active: Could not connect to stonith-ng IPC: Connection refused Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: corosync_node_name: Unable to get node name for nodeid 1084777482 Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777482 Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_get_peer: Created entry 3c2499de-58a8-44f7-bf1e-03ff1fbec774/0x1456550 for node (null)/1084777482 (1 total) Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_get_peer:Node 1084777482 has uuid 1084777482 Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd: notice: cluster_connect_quorum: Quorum
[ClusterLabs] VirtualDomain did not stop although "crm resource stop"
Hi, i think i found the reason, but i want to be sure. I wanted to stop a VirtualDomain and did a "crm resource stop ..." But it didn't shut down. After waiting several minutes i stopped with libvirt, circumventing the cluster software. First i wondered "why didn't it shutdown ?", but then i realized that also a Livemigration of another VirtualDomain was running and was started before the stop of the resource. Am i correct that that it didn't shutdown because of the running Live-Migration and would have start the shutdown when the Live-Migration is finished ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 https://www.helmholtz-munich.de/en/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource trace
- On 24 Oct, 2022, at 10:08, Klaus Wenninger kwenn...@redhat.com wrote: > On Mon, Oct 24, 2022 at 9:50 AM Xin Liang via Users < [ > mailto:users@clusterlabs.org | users@clusterlabs.org ] > wrote: > Did you try a cleanup in between? When i do a cleanup before trace/untrace the resource is not restarted. When i don't do a cleanup it is restarted. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource trace
- On 17 Oct, 2022, at 21:41, Ken Gaillot kgail...@redhat.com wrote: > This turned out to be interesting. > > In the first case, the resource history contains a start action and a > recurring monitor. The parameters to both change, so the resource > requires a restart. > > In the second case, the resource's history was apparently cleaned at > some point, so the cluster re-probed it and found it running. That > means its history contained only the probe and the recurring monitor. > Neither probe nor recurring monitor changes require a restart, so > nothing is done. > > It would probably make sense to distinguish between probes that found > the resource running and probes that found it not running. Parameter > changes in the former should probably be treated like start. > Is that now a bug or by design ? And what is the conclusion of it all ? Do a "crm resource cleanup" before each "crm resource [un]trace" ? And test everything with ptest before commit ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource trace
- On 17 Oct, 2022, at 21:41, Ken Gaillot kgail...@redhat.com wrote: > This turned out to be interesting. > > In the first case, the resource history contains a start action and a > recurring monitor. The parameters to both change, so the resource > requires a restart. > > In the second case, the resource's history was apparently cleaned at > some point, so the cluster re-probed it and found it running. That > means its history contained only the probe and the recurring monitor. > Neither probe nor recurring monitor changes require a restart, so > nothing is done. "vm-genetrap_monitor_0". Is that a probe ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] trace of resource ‑ sometimes restart, sometimes not
- On 18 Oct, 2022, at 14:35, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > > # crm configure > edit ... > verify > ptest nograph #!!! > commit That's very helpful. I didn't know that, Thanks. > -- > If you used that, you would have seen the restart. > Despite of that I wonder why enabling tracing to start/stop must induce a > resource restart. > > Bernd, are you sure that was the only thing changed? Do you have a record of > commands issued? I'm pretty sure it was the only thing. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource trace
Hi, i try to find out why there is sometimes a restart of the resource and sometimes not. Unpredictable behaviour is someting i expect from Windows, not from Linux. Here you see two "crm resource trace "resource"". In the first case the resource is restarted , in the second not. The command i used is identical in both cases. ha-idg-2:~/trace-untrace # date; crm resource trace vm-genetrap Fri Oct 14 19:05:51 CEST 2022 INFO: Trace for vm-genetrap is written to /var/lib/heartbeat/trace_ra/ INFO: Trace set, restart vm-genetrap to trace non-monitor operations == 1st try: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: Diff: --- 7.28974.3 2 Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: Diff: +++ 7.28975.0 299af44e1c8a3867f9e7a4b25f2c3d6a Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: + /cib: @epoch=28975, @num_updates=0 Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-monitor-30']: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-stop-0']: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-start-0']: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-migrate_from-0']: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-migrate_to-0']: Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_perform_op: ++ Oct 14 19:05:52 [26001] ha-idg-1 crmd: info: abort_transition_graph: Transition 791 aborted by instance_attributes.vm-genetrap-monitor-30-instance_attributes 'create': Configuration change | cib=7.28975.0 source=te_update_diff_v2:483 path=/cib/configuration/resources/primitive[@id='vm-genetrap']/operations/op[@id='vm-genetrap-monitor-30'] complete=true Oct 14 19:05:52 [26001] ha-idg-1 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha-idg-2/cibadmin/2, version=7.28975.0) Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: create op[@id='vm-genetrap-monitor-30'] Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng: info: cib_devices_update: Updating devices to version 7.28975.0 Oct 14 19:05:52 [25997] ha-idg-1 stonith-ng: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 14 19:05:52 [25996] ha-idg-1cib: info: cib_file_backup: Archived previous version as
Re: [ClusterLabs] crm resource trace (Was: Re: trace of resource - sometimes restart, sometimes not)
- On 7 Oct, 2022, at 21:37, Reid Wahl nw...@redhat.com wrote: > On Fri, Oct 7, 2022 at 6:02 AM Lentes, Bernd > wrote: >> - On 7 Oct, 2022, at 01:18, Reid Wahl nw...@redhat.com wrote: >> >> > How did you set a trace just for monitor? >> >> crm resource trace dlm monitor. > > crm resource trace adds "trace_ra=1" to the end of the > monitor operation: > https://github.com/ClusterLabs/crmsh/blob/8cf6a9d13af6496fdd384c18c54680ceb354b72d/crmsh/ui_resource.py#L638-L646 > > That's a schema violation and pcs doesn't even allow it. I installed > `crmsh` and tried to reproduce... `trace_ra=1` shows up in the > configuration for the monitor operation but it gets ignored. I don't > get *any* trace logs. That makes sense -- ocf-shellfuncs.in enables > tracing only if OCF_RESKEY_trace_ra is true. Pacemaker doesn't add > operation attribute to the OCF_RESKEY_* environment variables... at > least in the current upstream main. > > Apparently (since you got logs) this works in some way, or worked at > some point in the past. Out of curiosity, what version are you on? > SLES 12 SP5: ha-idg-1:/usr/lib/ocf/resource.d/heartbeat # rpm -qa|grep -iE 'pacemaker|corosync' libpacemaker3-1.1.24+20210811.f5abda0ee-3.21.9.x86_64 corosync-2.3.6-9.22.1.x86_64 pacemaker-debugsource-1.1.23+20200622.28dd98fad-3.9.2.20591.0.PTF.1177212.x86_64 libcorosync4-2.3.6-9.22.1.x86_64 pacemaker-cli-1.1.24+20210811.f5abda0ee-3.21.9.x86_64 pacemaker-cts-1.1.24+20210811.f5abda0ee-3.21.9.x86_64 pacemaker-1.1.24+20210811.f5abda0ee-3.21.9.x86_64 Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not
- On 7 Oct, 2022, at 01:08, Ken Gaillot kgail...@redhat.com wrote: > > Yes, trace_ra is an agent-defined resource parameter, not a Pacemaker- > defined meta-attribute. Resources are restarted anytime a parameter > changes (unless the parameter is set up for reloads). > > trace_ra is unusual in that it's supported automatically by the OCF > shell functions, rather than by the agents directly. That means it's > not advertised in metadata. Otherwise agents could mark it as > reloadable, and reload would be a quick no-op. > OK. But why no restart if i just set "crm resource trace dlm monitor" ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] expected_votes in cluster conf
Dear all, while checking my cluster with "crm status xml" i stumbled across: ha-idg-1:/usr/lib/ocf/resource.d/heartbeat # crm status xml <= expected_votes="unknown" ? I didn't find expected_votes in the pacemaker doc (https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html). It's a setting for corosync.conf and in mine it is: expected_votes: 2. I have a two-node cluster. I don't know from where "expected_votes="unknown"" comes in my case, maybe a typo. Can you confirm that it isn't an option for pacemaker conf ? Or maybe an undocumentated ? Thanks. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 https://www.helmholtz-munich.de/en/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not
- On 7 Oct, 2022, at 01:08, Ken Gaillot kgail...@redhat.com wrote: > > Yes, trace_ra is an agent-defined resource parameter, not a Pacemaker- > defined meta-attribute. Resources are restarted anytime a parameter > changes (unless the parameter is set up for reloads). > > trace_ra is unusual in that it's supported automatically by the OCF > shell functions, rather than by the agents directly. That means it's > not advertised in metadata. Otherwise agents could mark it as > reloadable, and reload would be a quick no-op. > OK. But why no restart if i just set "crm resource trace dlm monitor" ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trace of resource - sometimes restart, sometimes not
- On 7 Oct, 2022, at 01:18, Reid Wahl nw...@redhat.com wrote: > How did you set a trace just for monitor? crm resource trace dlm monitor. > Wish I could help with that -- it's mostly a mystery to me too ;) :-)) smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] trace of resource - sometimes restart, sometimes not
Hi, i have some problems with our DLM, so i wanted to trace it. Yesterday i just set a trace for "monitor". No restart of DLM afterwards. It went fine as expected. I got logs in /var/lib/heartbeat/trace_ra. After some monitor i stopped tracing. Today i set a trace for all operations. Now resource DLM restarted: * Restartdlm:0 ( ha-idg-1 ) due to resource definition change I didn't expect that so i had some trouble. Is the difference in this behaviour intentional ? If yes, why ? Is there a rule ? Furthermore i'd like to ask where i can find more information about DLM, because it is a mystery for me. Sometimes the DLM does not respond to the "monitor", so it needs to be restarted, and therefore all depending resources (which is a lot). This happens under some load (although not completely overwhelmed). Thanks. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 https://www.helmholtz-munich.de/en/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 16:26, kwenning kwenn...@redhat.com wrote: >> >> if I get Ulrich right - and my fading memory of when I really used crmsh the >> last time is telling me the same thing ... >> I get the impression many people prefer pcs to crm. Is there any reason for that ? And can i use pcs on Suse ? If yes, how ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 16:26, kwenning kwenn...@redhat.com wrote: > > Guess the resources running now are those you tried to enable before > while they were globally stopped > No. First i set stop-all-resources to false. Then SOME resources started. Then i tried several times to start some VirtualDomains using "crm resource start" which didn't succeed. Some time later i tried it again and it succeeded ... Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources
- On 24 Aug, 2022, at 16:01, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: >> Now with "crm resource start" all resources started. I didn't change >> anything !?! > > I guess that command set the roles of all resources to "started", so you > changed > something ;-) I did it before and nothing happened ... Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
Hi, Now with "crm resource start" all resources started. I didn't change anything !?! Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 07:21, Reid Wahl nw...@redhat.com wrote: > As a result, your command might start the virtual machines, but > Pacemaker will still show that the resources are "Stopped (disabled)". > To fix that, you'll need to enable the resources. How do i achieve that ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources
- On 24 Aug, 2022, at 08:17, Reid Wahl nw...@redhat.com wrote: > I'm not sure off the top of my head what (if anything) gets sent to > the logs. Do note that Bernd is using pacemaker v1, which hasn't been > receiving new features for quite a while. An Update is recommended ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Cluster does not start resources
- On 24 Aug, 2022, at 08:10, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > > Bernd, > > that command would simply set the role to "started", but I guess it already > is. > Obviously to be effective stop-all-resources must habve precedence. You see? > > Regards, > Ulrich Yes, but i set stop-all-resources to false. Some resources started afterwards, some didn't. And the target-role for all "disabled" VirtualDomains is still stopped. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
> > There is no resource with the name "virtual_domain" in your list. All > non-active resources in your list are either disabled or unmanaged. > Without actual commands that list resource state before "crm resource > start", "crm resource start" itself and once more resource state after > this command any answer will be just a wild guess. "crm resource start virtual_domain" is just an example, virtual_domain is a placeholder for the name of the VM. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 07:22, Reid Wahl nw...@redhat.com wrote: > Are the VMs running after your start command? No. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 07:03, arvidjaar arvidj...@gmail.com wrote: > On 24.08.2022 07:34, Lentes, Bernd wrote: >> >> >> - On 24 Aug, 2022, at 05:33, Reid Wahl nw...@redhat.com wrote: >> >> >>> The stop-all-resources cluster property is set to true. Is that intentional? >> OMG. Thanks Reid ! >> >> But unfortunately not all virtual domains are running: >> > > what exactly is not clear in this output? All these resources are > explicitly disabled (target-role=stopped) and so will not be started. > That's clear. But a manual "crm resource start virtual_domain" should start them, but it doesn't. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 05:33, Reid Wahl nw...@redhat.com wrote: > The stop-all-resources cluster property is set to true. Is that intentional? OMG. Thanks Reid ! But unfortunately not all virtual domains are running: Stack: corosync Current DC: ha-idg-2 (version 1.1.24+20210811.f5abda0ee-3.21.9-1.1.24+20210811.f5abda0ee) - partition with quorum Last updated: Wed Aug 24 06:14:37 2022 Last change: Wed Aug 24 06:04:24 2022 by root via cibadmin on ha-idg-1 2 nodes configured 40 resource instances configured (21 DISABLED) Node ha-idg-1: online fence_ilo_ha-idg-2 (stonith:fence_ilo2): Started fenct ha-idg-2 mit ILO dlm (ocf::pacemaker:controld): Started clvmd (ocf::heartbeat:clvm): Started vm-mausdb (ocf::lentes:VirtualDomain):Started fs_ocfs2(ocf::lentes:Filesystem.new): Started vm-nc-mcd (ocf::lentes:VirtualDomain):Started fs_test_ocfs2 (ocf::lentes:Filesystem.new): Started gfs2_snap (ocf::heartbeat:Filesystem):Started gfs2_share (ocf::heartbeat:Filesystem):Started Node ha-idg-2: online fence_ilo_ha-idg-1 (stonith:fence_ilo4): Started fenct ha-idg-1 mit ILO clvmd (ocf::heartbeat:clvm): Started dlm (ocf::pacemaker:controld): Started vm-sim (ocf::lentes:VirtualDomain):Started gfs2_snap (ocf::heartbeat:Filesystem):Started fs_ocfs2(ocf::lentes:Filesystem.new): Started gfs2_share (ocf::heartbeat:Filesystem):Started vm-seneca (ocf::lentes:VirtualDomain):Started vm-ssh (ocf::lentes:VirtualDomain):Started Inactive resources: Clone Set: ClusterMon-clone [ClusterMon-SMTP] Stopped (disabled): [ ha-idg-1 ha-idg-2 ] vm-geneious (ocf::lentes:VirtualDomain):Stopped (disabled) vm-idcc-devel (ocf::lentes:VirtualDomain):Stopped (disabled) vm-genetrap (ocf::lentes:VirtualDomain):Stopped (disabled) vm-mouseidgenes (ocf::lentes:VirtualDomain):Stopped (disabled) vm-greensql (ocf::lentes:VirtualDomain):Stopped (disabled) vm-severin (ocf::lentes:VirtualDomain):Stopped (disabled) ping_19216810010(ocf::pacemaker:ping): Stopped (disabled) ping_19216810020(ocf::pacemaker:ping): Stopped (disabled) vm_crispor (ocf::heartbeat:VirtualDomain): Stopped (unmanaged) vm-dietrich (ocf::lentes:VirtualDomain):Stopped (disabled) vm-pathway (ocf::lentes:VirtualDomain):Stopped (disabled) vm-crispor-server (ocf::lentes:VirtualDomain):Stopped (disabled) vm-geneious-license (ocf::lentes:VirtualDomain):Stopped (disabled) vm-amok (ocf::lentes:VirtualDomain):Stopped (disabled) vm-geneious-license-mcd (ocf::lentes:VirtualDomain):Stopped (disabled) vm-documents-oo (ocf::lentes:VirtualDomain):Stopped (disabled) vm_snipanalysis (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged) vm-photoshop(ocf::lentes:VirtualDomain):Stopped (disabled) vm-check-mk (ocf::lentes:VirtualDomain):Stopped (disabled) vm-encore (ocf::lentes:VirtualDomain):Stopped (disabled) Migration Summary: * Node ha-idg-1: * Node ha-idg-2: Also a manual "crm resource start" wasn't successfull. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cluster does not start resources
- On 24 Aug, 2022, at 04:04, Reid Wahl nw...@redhat.com wrote: > Can you share your CIB? Not sure off hand what everything means (resource not > found, IPC error, crmd failure and respawn), and pacemaker v1 logs aren't the > easiest to interpret. But perhaps something in the CIB will show itself as an > issue. Attached > -- > Regards, > Reid Wahl (He/Him) > Senior Software Engineer, Red Hat > RHEL High Availability - Pacemaker cib.xml Description: XML document smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Cluster does not start resources
Hi, currently i can't start resources on our 2-node-cluster. Cluster seems to be ok: Stack: corosync Current DC: ha-idg-1 (version 1.1.24+20210811.f5abda0ee-3.21.9-1.1.24+20210811.f5abda0ee) - partition with quorum Last updated: Wed Aug 24 02:56:46 2022 Last change: Wed Aug 24 02:56:41 2022 by hacluster via crmd on ha-idg-1 2 nodes configured 40 resource instances configured (26 DISABLED) Node ha-idg-1: online Node ha-idg-2: online Inactive resources: fence_ilo_ha-idg-2 (stonith:fence_ilo2): Stopped fence_ilo_ha-idg-1 (stonith:fence_ilo4): Stopped Clone Set: cl_share [gr_share] Stopped: [ ha-idg-1 ha-idg-2 ] Clone Set: ClusterMon-clone [ClusterMon-SMTP] Stopped (disabled): [ ha-idg-1 ha-idg-2 ] vm-mausdb (ocf::lentes:VirtualDomain):Stopped (disabled) vm-sim (ocf::lentes:VirtualDomain):Stopped (disabled) vm-geneious (ocf::lentes:VirtualDomain):Stopped (disabled) vm-idcc-devel (ocf::lentes:VirtualDomain):Stopped (disabled) vm-genetrap (ocf::lentes:VirtualDomain):Stopped (disabled) vm-mouseidgenes (ocf::lentes:VirtualDomain):Stopped (disabled) vm-greensql (ocf::lentes:VirtualDomain):Stopped (disabled) vm-severin (ocf::lentes:VirtualDomain):Stopped (disabled) ping_19216810010(ocf::pacemaker:ping): Stopped (disabled) ping_19216810020(ocf::pacemaker:ping): Stopped (disabled) vm_crispor (ocf::heartbeat:VirtualDomain): Stopped (unmanaged) vm-dietrich (ocf::lentes:VirtualDomain):Stopped (disabled) vm-pathway (ocf::lentes:VirtualDomain):Stopped (disabled) vm-crispor-server (ocf::lentes:VirtualDomain):Stopped (disabled) vm-geneious-license (ocf::lentes:VirtualDomain):Stopped (disabled) vm-nc-mcd (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged) vm-amok (ocf::lentes:VirtualDomain):Stopped (disabled) vm-geneious-license-mcd (ocf::lentes:VirtualDomain):Stopped (disabled) vm-documents-oo (ocf::lentes:VirtualDomain):Stopped (disabled) fs_test_ocfs2 (ocf::lentes:Filesystem.new): Stopped vm-ssh (ocf::lentes:VirtualDomain):Stopped (disabled) vm_snipanalysis (ocf::lentes:VirtualDomain):Stopped (disabled, unmanaged) vm-seneca (ocf::lentes:VirtualDomain):Stopped (disabled) vm-photoshop(ocf::lentes:VirtualDomain):Stopped (disabled) vm-check-mk (ocf::lentes:VirtualDomain):Stopped (disabled) vm-encore (ocf::lentes:VirtualDomain):Stopped (disabled) Migration Summary: * Node ha-idg-1: * Node ha-idg-2: Fencing History: * Off of ha-idg-2 successful: delegate=ha-idg-1, client=crmd.27356, origin=ha-idg-1, last-successful='Wed Aug 24 01:53:49 2022' Trying to start e.g. cl_share, which is a prerequisite for the virtual domains ... nothing happens. I did a "crm resource cleanup" (although crm_mon shows no error) hoping this will help ... it didn't. my command history: 1471 2022-08-24 03:11:27 crm resource cleanup 1472 2022-08-24 03:11:52 crm resource cleanup cl_share 1473 2022-08-24 03:12:45 crm resource start cl_share (to correlate with the log) I found some weird entries in the log after the "crm resource cleanup": Aug 24 03:11:28 [27351] ha-idg-1cib: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Aug 24 03:11:33 [27351] ha-idg-1cib: info: cib_process_ping: Reporting our current digest to ha-idg-1: ed5bb7d32532ebf1ce3c45d0067c55b3 for 7.28627.70 (0x15073e0 0) Aug 24 03:11:52 [27353] ha-idg-1 lrmd: info: process_lrmd_get_rsc_info: Resource 'dlm:0' not found (0 active resources) Aug 24 03:11:52 [27356] ha-idg-1 crmd: notice: do_lrm_invoke: Not registering resource 'dlm:0' for a delete event | get-rc=-19 (No such device) transition-key=(null) What does that mean "Resource not found" ? ... Aug 24 03:11:57 [27351] ha-idg-1cib: info: cib_process_ping: Reporting our current digest to ha-idg-1: 0b3e9ad9ad8103ce2da3b6b8d41e6716 for 7.28628.0 (0x1352bf0 0) Aug 24 03:11:58 [27356] ha-idg-1 crmd:error: do_pe_invoke_callback: Could not retrieve the Cluster Information Base: Timer expired | rc=-62 call=222 Aug 24 03:11:58 [27356] ha-idg-1 crmd: info: register_fsa_error_adv: Resetting the current action list Aug 24 03:11:58 [27356] ha-idg-1 crmd:error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from do_pe_invoke_callback Aug 24 03:11:58 [27356] ha-idg-1 crmd: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback Aug 24 03:11:58 [27356] ha-idg-1 crmd: warning: do_recover: Fast-tracking shutdown in response to errors Aug 24 03:11:58 [27356] ha-idg-1 crmd: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Aug 24 03:11:58 [27356] ha-idg-1 crmd: info: do_dc_release: DC role
Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?
- On 4 Aug, 2022, at 19:46, Reid Wahl nw...@redhat.com wrote: >> It shuts down ha-idg-2: >> 2022-08-03T01:19:51.866200+02:00 ha-idg-2 systemd-logind[1535]: Power key >> pressed. >> 2022-08-03T01:19:52.048335+02:00 ha-idg-2 systemd-logind[1535]: System is >> powering down. >> 2022-08-03T01:19:52.051815+02:00 ha-idg-2 systemd[1]: Stopped target >> resource-agents dependencies. >> ... > > Yes, but it thought it was shutting down ha-idg-1. > >> >> Then it stops cluster software on ha-idg-1: >> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: warning: pcmk_child_exit: >> Shutting >> cluster down because crmd[19368] had fatal failure >> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: notice: pcmk_shutdown_worker: >> Shutting down Pacemaker >> Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: notice: stop_child: >> Stopping >> pengine | sent signal 15 to process 19367 >> ... > > Node ha-idg-1 received a notification from the fencer that said "hey, > we just fenced ha-idg-1!" Then it said "oh no, that's me! I'll shut > myself down now." > > That can be helpful if we're using fabric fencing. That's not supposed > to happen with power fencing. The shutdown on ha-idg-1 didn't hurt > anything, but it should have gotten powered off (instead of powering > off ha-idg-2. > What is "fabric" fencing and "power" fencing ? Fabric something like ILO or IPMI ? And power fencing is cutting off power by controllable UPS or power switches ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?
- On 4 Aug, 2022, at 15:14, arvidjaar arvidj...@gmail.com wrote: > On 04.08.2022 16:06, Lentes, Bernd wrote: >> >> - On 4 Aug, 2022, at 00:27, Reid Wahl nw...@redhat.com wrote: >> >> Would do you mean by "banned" ? "crm resource ban ..." ? >> Is that something different than a location constraint ? > "crm resource ban" creates location constraint, but not every location > constraint is created by "crm resource ban". OK. It seems that the cluster realizes that something went wrong. It wants to shutdown ha-idg-1: Aug 03 01:19:12 [19367] ha-idg-1pengine: warning: pe_fence_node: Cluster node ha-idg-1 will be fenced: vm-mausdb failed there Aug 03 01:19:12 [19367] ha-idg-1pengine: info: native_stop_constraints: fence_ilo_ha-idg-2_stop_0 is implicit after ha-idg-1 is fenced Aug 03 01:19:12 [19367] ha-idg-1pengine: notice: LogNodeActions: * Fence (Off) ha-idg-1 'vm-mausdb failed there' Aug 03 01:19:14 [19367] ha-idg-1pengine: warning: pe_fence_node: Cluster node ha-idg-1 will be fenced: vm-mausdb failed there Aug 03 01:19:15 [19368] ha-idg-1 crmd: notice: te_fence_node: Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=6 ... It shuts down ha-idg-2: 2022-08-03T01:19:51.866200+02:00 ha-idg-2 systemd-logind[1535]: Power key pressed. 2022-08-03T01:19:52.048335+02:00 ha-idg-2 systemd-logind[1535]: System is powering down. 2022-08-03T01:19:52.051815+02:00 ha-idg-2 systemd[1]: Stopped target resource-agents dependencies. ... Then it stops cluster software on ha-idg-1: Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: warning: pcmk_child_exit: Shutting cluster down because crmd[19368] had fatal failure Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker Aug 03 01:19:58 [19361] ha-idg-1 pacemakerd: notice: stop_child: Stopping pengine | sent signal 15 to process 19367 ... Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] 2-Node Cluster - fencing with just one node running ?
- On 4 Aug, 2022, at 00:27, Reid Wahl nw...@redhat.com wrote: > > Such constraints are unnecessary. > > Let's say we have two stonith devices called "fence_dev1" and > "fence_dev2" that fence nodes 1 and 2, respectively. If node 2 needs > to be fenced, and fence_dev2 is running on node 2, node 1 will still > use fence_dev2 to fence node 2. The current location of the stonith > device only tells us which node is running the recurring monitor > operation for that stonith device. The device is available to ALL > nodes, unless it's disabled or it's banned from a given node. So these > constraints serve no purpose in most cases. Would do you mean by "banned" ? "crm resource ban ..." ? Is that something different than a location constraint ? > If you ban fence_dev2 from node 1, then node 1 won't be able to use > fence_dev2 to fence node 2. Likewise, if you ban fence_dev1 from node > 1, then node 1 won't be able to use fence_dev1 to fence itself. > Usually that's unnecessary anyway, but it may be preferable to power > ourselves off if we're the last remaining node and a stop operation > fails. So banning a fencing device from a node means that this node can't use the fencing device ? > If ha-idg-2 is in standby, it can still fence ha-idg-1. Since it > sounds like you've banned fence_ilo_ha-idg-1 from ha-idg-1, so that it > can't run anywhere when ha-idg-2 is in standby, I'm not sure off the > top of my head whether fence_ilo_ha-idg-1 is available in this > situation. It may not be. ha-idg-2 was not only in standby, i also stopped pacemaker on that node. Then ha-idg-2 can't fence ha-idg-1 i assume. > > A solution would be to stop banning the stonith devices from their > respective nodes. Surely if fence_ilo_ha-idg-1 had been running on > ha-idg-1, ha-idg-2 would have been able to use it to fence ha-idg-1. > (Again, I'm not sure if that's still true if ha-idg-2 is in standby > **and** fence_ilo_ha-idg-1 is banned from ha-idg-1.) > >> Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng: notice: log_operation: >> Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with >> device 'fence_ilo_ha-idg-2' returned: 0 (OK) >> So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2, >> which isn't necessary. > > Here, it sounds like the pcmk_host_list setting is either missing or > misconfigured for fence_ilo_ha-idg-2. fence_ilo_ha-idg-2 should NOT be > usable for fencing ha-idg-1. > > fence_ilo_ha-idg-1 should be configured with pcmk_host_list=ha-idg-1, > and fence_ilo_ha-idg-2 should be configured with > pcmk_host_list=ha-idg-2. I will check that. > What happened is that ha-idg-1 used fence_ilo_ha-idg-2 to fence > itself. Of course, this only rebooted ha-idg-2. But based on the > stonith device configuration, pacemaker on ha-idg-1 believed that > ha-idg-1 had been fenced. Hence the "allegedly just fenced" message. > >> >> Finally the cluster seems to realize that something went wrong: >> Aug 03 01:19:58 [19368] ha-idg-1 crmd: crit: >> tengine_stonith_notify: >> We were allegedly just fenced by ha-idg-1 for ha-idg-1! Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] 2-Node Cluster - fencing with just one node running ?
Hi, i have the following situation: 2-node Cluster, just one node running (ha-idg-1). The second node (ha-idg-2) is in standby. DLM monitor on ha-idg-1 times out. Cluster tries to restart all services depending on DLM: Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Recoverdlm:0 ( ha-idg-1 ) Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartclvmd:0 ( ha-idg-1 ) due to required dlm:0 start Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartgfs2_share:0( ha-idg-1 ) due to required clvmd:0 start Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartgfs2_snap:0 ( ha-idg-1 ) due to required gfs2_share:0 start Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartfs_ocfs2:0 ( ha-idg-1 ) due to required gfs2_snap:0 start Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave dlm:1 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave clvmd:1 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave gfs2_share:1(Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave gfs2_snap:1 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave fs_ocfs2:1 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave ClusterMon-SMTP:0 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: info: LogActions: Leave ClusterMon-SMTP:1 (Stopped) Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartvm-mausdb ( ha-idg-1 ) due to required cl_share running Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartvm-sim ( ha-idg-1 ) due to required cl_share running Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartvm-geneious ( ha-idg-1 ) due to required cl_share running Aug 03 01:07:11 [19367] ha-idg-1pengine: notice: LogAction:* Restartvm-idcc-devel ( ha-idg-1 ) due to required cl_share running ... restart of vm-mausdb failed, stop timed out: VirtualDomain(vm-mausdb)[32415]:2022/08/03_01:19:06 INFO: Issuing forced shutdown (destroy) request for domain vm-mausdb. Aug 03 01:19:11 [19365] ha-idg-1 lrmd: warning: child_timeout_callback: vm-mausdb_stop_0 process (PID 32415) timed out Aug 03 01:19:11 [19365] ha-idg-1 lrmd: warning: operation_finished: vm-mausdb_stop_0:32415 - timed out after 72ms ... Aug 03 01:19:14 [19367] ha-idg-1pengine: warning: pe_fence_node: Cluster node ha-idg-1 will be fenced: vm-mausdb failed there Aug 03 01:19:15 [19368] ha-idg-1 crmd: notice: te_fence_node: Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=6 I have two fencing resources defined. One for ha-idg-1, one for ha-idg-2. Both are HP ILO network adapters. I have two location constraints: both take care that the resource for fencing node ha-idg-1 is running on ha-idg-2 and vice versa. I never thought that it's necessary that a node has to fence itself. So now ha-idg-2 is in standby, there is no fence device to stonith ha-idg-1. Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng: notice: log_operation: Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with device 'fence_ilo_ha-idg-2' returned: 0 (OK) So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2, which isn't necessary. Finally the cluster seems to realize that something went wrong: Aug 03 01:19:58 [19368] ha-idg-1 crmd: crit: tengine_stonith_notify: We were allegedly just fenced by ha-idg-1 for ha-idg-1! So my question now: is it necessary to have a fencing device that a node can commit suicide ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45
[ClusterLabs] cluster log not unambiguous about state of VirtualDomains
Hi, i have a strange behaviour found in the cluster log (/var/log/cluster/corosync.log). I KNOW that i put one node (ha-idg-2) in standby mode and stopped the pacemaker service on that node: The history of the shell says: 993 2022-08-02 18:28:25 crm node standby ha-idg-2 994 2022-08-02 18:28:58 systemctl stop pacemaker.service Later on i had some trouble with high load. I found contradictory entries in the log on the DC (ha-idg-1): Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-documents-oo active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-documents-oo active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-mausdb active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-mausdb active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-photoshop active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-photoshop active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-encore active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-encore active on ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource dlm:1 active on ha-idg-2 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-seneca active on ha-idg-2<=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-pathway active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-dietrich active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-sim active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-ssh active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-nextcloud active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource fs_ocfs2:1 active on ha-idg-2 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource gfs2_share:1 active on ha-idg-2 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-geneious active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource gfs2_snap:1 active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource vm-geneious-license-mcd active on ha-idg-2 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: determine_op_status: Operation monitor found resource clvmd:1 active on ha-idg-2 The log says some VirtualDomains are running on ha-idg-2 !?! But just a few lines later the log says all VirtualDomains are running on ha-idg-1: Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-mausdb (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print:vm-sim (ocf::lentes:VirtualDomain):Started ha-idg-1 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-geneious (ocf::lentes:VirtualDomain):Started ha-idg-1 <=== Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-idcc-devel (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-genetrap (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-mouseidgenes (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-greensql (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: vm-severin (ocf::lentes:VirtualDomain):Started ha-idg-1 Aug 03 00:14:04 [19367] ha-idg-1pengine: info: common_print: ping_19216810010(ocf::pacemaker:ping): Stopped (disabled) Aug 03 00:14:04 [19367] ha-idg-1pengine:
Re: [ClusterLabs] [EXT] Problem with DLM
- On 26 Jul, 2022, at 20:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > Hi Bernd! > > I think the answer may be some time before the timeout was reported; maybe a > network issue? Or a very high load. It's hard to say from the logs... Yes, i had a high load before: Jul 20 00:17:42 [32512] ha-idg-1 crmd: notice: throttle_check_thresholds: High CPU load detected: 90.080002 Jul 20 00:18:12 [32512] ha-idg-1 crmd: notice: throttle_check_thresholds: High CPU load detected: 76.169998 Jul 20 00:18:42 [32512] ha-idg-1 crmd: notice: throttle_check_thresholds: High CPU load detected: 85.629997 Jul 20 00:19:12 [32512] ha-idg-1 crmd: notice: throttle_check_thresholds: High CPU load detected: 70.660004 Jul 20 00:19:42 [32512] ha-idg-1 crmd: notice: throttle_check_thresholds: High CPU load detected: 58.34 Jul 20 00:20:12 [32512] ha-idg-1 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 48.740002 Jul 20 00:20:12 [32512] ha-idg-1 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0100) Jul 20 00:20:42 [32512] ha-idg-1 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 41.88 Jul 20 00:21:12 [32512] ha-idg-1 crmd: info: throttle_send_command: New throttle mode: 0001 (was 0010) Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: child_timeout_callback: dlm_monitor_3 process (PID 11816) timed out Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: operation_finished: dlm_monitor_3:11816 - timed out after 2ms Jul 20 00:21:56 [32512] ha-idg-1 crmd:error: process_lrm_event: Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 key=dlm_monitor_3 timeout=2ms Jul 20 00:21:56 [32512] ha-idg-1 crmd: info: exec_alert_list: Sending resource alert via smtp_alert to informatic@helmholtz-muenchen.de Jul 20 00:21:56 [12204] ha-idg-1 lrmd: info: process_lrmd_alert_exec: Executing alert smtp_alert for 8f934e90-12f5-4bad-b4f4-55ac933f01c6 Can that interfere with DLM ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Problem with DLM
Hi, it seems my DLM went grazy: /var/log/cluster/corosync.log: Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: child_timeout_callback: dlm_monitor_3 process (PID 11816) timed out Jul 20 00:21:56 [12204] ha-idg-1 lrmd: warning: operation_finished: dlm_monitor_3:11816 - timed out after 2ms Jul 20 00:21:56 [32512] ha-idg-1 crmd:error: process_lrm_event: Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 key=dlm_monitor_3 timeout=2ms Jul 20 00:21:56 [32512] ha-idg-1 crmd: info: exec_alert_list: Sending resource alert via smtp_alert to informatic@helmholtz-muenchen.de /var/log/messages: 2022-07-20T00:21:56.644677+02:00 ha-idg-1 Cluster: alert_smtp.sh 2022-07-20T00:22:16.076936+02:00 ha-idg-1 kernel: [2366794.757496] dlm: FD5D3C7CE9104CF5916A84DA0DBED302: leaving the lockspace group... 2022-07-20T00:22:16.364971+02:00 ha-idg-1 kernel: [2366795.045657] dlm: FD5D3C7CE9104CF5916A84DA0DBED302: group event done 0 0 2022-07-20T00:22:16.364982+02:00 ha-idg-1 kernel: [2366795.045777] dlm: FD5D3C7CE9104CF5916A84DA0DBED302: release_lockspace final free 2022-07-20T00:22:15.533571+02:00 ha-idg-1 Cluster: message repeated 22 times: [ alert_smtp.sh] 2022-07-20T00:22:17.164442+02:00 ha-idg-1 ocfs2_hb_ctl[19106]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -K -u FD5D3C7CE9104CF5916A84DA0DBED302 2022-07-20T00:22:18.904936+02:00 ha-idg-1 kernel: [2366797.586278] ocfs2: Unmounting device (254,24) on (node 1084777482) 2022-07-20T00:22:19.116701+02:00 ha-idg-1 Cluster: alert_smtp.sh What do these kernel messages mean ? Why stopped DLM ? I think this is the second time this happened. It is really a show stopper because node is fenced some minutes later: 00:34:40.709002 ha-idg: Fencing Operation Off of ha-idg-1 by ha-idg-2 for crmd.28253@ha-idg-2: OK (ref=9710f0e2-a9a9-42c3-a294-ed0bd78bba1a) What can i do ? Is there an alternative DLM ? System is SLES 12 SP5. Update to SLES 15 SP3 ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] is there a way to cancel a running live migration or a "resource stop" ?
Hi, is there a way to cancel a running live migration or a "resource stop" ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 +49 89 3187 49123 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] modified RA can't be used
- On Jun 27, 2022, at 3:54 PM, kgaillot kgail...@redhat.com wrote: > As an aside, the preferred naming for custom agents is to change the > provider (ocf:PROVIDER:AGENT), putting them in > /usr/lib/ocf/resource.d/PROVIDER/AGENT. > > For example, ocf:local:VirtualDomain or ocf:mcd:VirtualDomain > > The main advantage is having your own namespace and not having to worry > about a current or future resource-agents package having any side > effects on your agent. > Hi Ken, i did it this way. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] modified RA can't be used
- On Jun 27, 2022, at 2:57 PM, Oyvind Albrigtsen oalbr...@redhat.com wrote: > You need to update the agent name in the metadata section to be the > same as the filename. > > > Oyvind > OMG. Thank you !!! Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] modified RA can't be used
Hi, i adapted the RA ocf/heartbeat/VirtualDomain to my needs and renamed it to VirtualDomain.ssh When i try to use it now, i get an error message. I start e.g. "crm configure edit vm-idcc-devel" to modify an existing VirtualDomain that it uses the new RA and want to save it i get the following error: ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist? ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist? ERROR: ocf:heartbeat:VirtualDomain.ssh: no such resource agent The RA exists in the filesystem and has the same permissions as the original: ha-idg-1:~ # ll /usr/lib/ocf/resource.d/heartbeat/Virt* -rwxr-xr-x 1 root root 35607 Feb 15 07:21 /usr/lib/ocf/resource.d/heartbeat/VirtualDomain -rwxr-xr-x 1 root root 35747 Jun 27 14:22 /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh The difference is only in one line i added: ha-idg-1:~ # diff /usr/lib/ocf/resource.d/heartbeat/VirtualDomain /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh 732a733,734 > ssh -i /root/ssh/id_rsa.mcd.shutdown mcd.shutdown@${DOMAIN_NAME} shutdown.bat > ## new by bernd.len...@helmholtz-muenchen.de 26062022 I also copied the new RA to another folder ... same problem. When i try to get info about the new RA i get the same error: ha-idg-1:~ # crm ra info ocf:heartbeat:VirtualDomain.ssh ERROR: ocf:heartbeat:VirtualDomain.ssh: got no meta-data, does this RA exist? The VirtualDomain is shutdown. It's a two-node cluster with SLES 12 SP5, RA exists on both nodes and is identical: ha-idg-1:~ # sha1sum /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh 8d075cb0745c674525802f94d4d7d2b88af8156c /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh ha-idg-2:~ # sha1sum /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh 8d075cb0745c674525802f94d4d7d2b88af8156c /usr/lib/ocf/resource.d/heartbeat/VirtualDomain.ssh Any ideas ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] how does the VirtualDomain RA know with which options it's called ?
Hi, from my understanding the resource agents in /usr/lib/ocf/resource.d/heartbeat are quite similar to the old scripts in /etc/init.d started by init. Init starts these scripts with "script [start|stop|reload|restart|status]". Inside the script there is a case construct which checks the options the script is started with, and calls the appropriate function. Similar to the init scripts the cluster calls the RA with "script [start|stop|monitor ...]" But i'm missing this construct in the VirtualDomain RA. From where does it know how it is invoked ? I don't see any logic which checks the options the script is called with. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later
- On Feb 17, 2022, at 4:25 PM, kgaillot kgail...@redhat.com wrote: >> So for me the big question is: >> When a transition is happening, and there is a change in the cluster, >> is the transition "aborted" >> (delayed or interrupted would be better) or not ? >> Is this behaviour consistent ? If no, from what does it depend ? >> >> Bernd > > Yes, anytime the DC sees a change that could affect resources, it will > abort the current transition and calculate a new one. Aborting means > not initiating any new actions from the transition -- but any actions > currently in flight must complete before the new transition can be > calculated. > > Changes that abort a transition include configuration changes, a node > joining or leaving, an unexpected action result being received, a node > attribute changing, the cluster-recheck-interval passing since the last > transition, or a timer popping for a time-based event (failure timeout, > rule, etc.). I may be forgetting some, but you get the idea. > -- Hi Ken, thanks for your explanation. Now i try to resume if i understood everything correctly: I started the shutdown of several VirtualDomains with "crm resource vm_xxx stop". Not concurrently, one by one with some delay of about 30 sec. But there was already one VirtualDomain shutting down before. Cluster said this transition is aborted, but in real it couldn't be aborted. How to abort a running shutdown ? So we had to wait for the shutdown of that domain. It has been switched off by libvirt with "virsh destroy" after 10 minutes. After that the shutdown of the other domains was initiated, and the domains shutdown cleanly. So, to conclude: I forgot that i had already one domain in shutdown. I should have waited for this to finish before starting the stop of the other resources. Cluster tried to "abort" the shutdown, but shutdown can't be aborted. And i had bad luck that the shutdown of this domain took so long. Correct ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later
- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com wrote: > > > Splitting logs between different messages does not really help in interpreting > them. I agree. Here is the complete excerpt from the respective time: https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8 > > I guess the real question here is why "Transition aborted" is logged although > transition apparently continues. Transition 128 started at 20:54:30 and > completed > at 21:04:26, but there were multiple "Transition 128 aborted" messages in > between That's correct. The shutdown_timeout for the domain is set with 600 sec. in the CIB. The RA says: # The "shutdown_timeout" we use here is the operation # timeout specified in the CIB, minus 5 seconds And between 20:54:30 and 21:04:26 we have very close 595 sec. > It looks like "Transition aborted" is more "we try to abort this transition if > possible". My guess is that pacemaker must wait for currently running > action(s) > which can take quite some time when stopping virtual domain. Transition 128 > was initiated when stopping vm_pathway, but we have no idea when it was > stopped. We have: Feb 15 21:04:26 [15370] ha-idg-2 crmd: notice: run_graph: Transition 128 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3548.bz2): Complete and the log from libvirt confirms it: /var/log/libvirtd/qemu/vm_pathway.log: 2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal 15 from pid 7368 (/usr/sbin/libvirtd) 2022-02-15 20:04:26.769+: shutting down, reason=destroyed Time in libvirt logs is UTC, and in Munich we have currently UTC+1, so the time differs in the logs. We see that the domain is "switched off" via libvirt exactly at 21:04:26. So for me the big question is: When a transition is happening, and there is a change in the cluster, is the transition "aborted" (delayed or interrupted would be better) or not ? Is this behaviour consistent ? If no, from what does it depend ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later
- On Feb 17, 2022, at 10:26 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: "Ulrich Windl" schrieb am 17.02.2022 > > To correct myself: crm was a "-w" (wait) option that will wait until the DC is > idle. In most cases it just waits until the operation requeszted has completed > (or failed). Hi Ulrich, but stopping the domains with -w would last very long. We have 20 VirtualDomains. Our UPS does not have so much capacity for waiting such a long time. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later
- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote: > A transition is the set of actions that need to be taken in response to > current conditions. A transition is aborted any time conditions change > (here, the target-role being changed in the configuration), so that a > new set of actions can be calculated. > > Someone once defined a transition as an "action plan", and I'm tempted > to use that instead. Plus maybe replace "aborted" with "interrupted", > so then we'd have "Action plan interrupted" which is maybe a little > more understandable. These "transition aborted" happen quite often. Feb 15 20:53:25 [15370] ha-idg-2 crmd: notice: abort_transition_graph: Transition 126 aborted by vm_documents-oo-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27453.0 source=te_update _diff_v2:483 path=/cib/configuration/resources/primitive[@id='vm_documents-oo']/meta_attributes[@id='vm_documents-oo-meta_attributes']/nvpair[@id='vm_documents-oo-meta_attributes-target-role'] complete=false Feb 15 20:53:00 [15370] ha-idg-2 crmd: info: abort_transition_graph: Transition 125 aborted by vm_amok-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27452.0 source=te_update_diff_v2 :483 path=/cib/configuration/resources/primitive[@id='vm_amok']/meta_attributes[@id='vm_amok-meta_attributes']/nvpair[@id='vm_amok-meta_attributes-target-role'] complete=true Why is there sometimes "complete=true" and sometimes "complete=false" ? What does that mean ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later
- On Feb 16, 2022, at 1:01 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > Bernd, > > I guess the syslog(/journal of the DC has better logs. Unfortunately the journal didn't reveal something. > As I see it now, it seems stop of vm_pathway takes a few minutes, and no other > action is started befor that is done. > I think I once said it "Clusters are not for the impatient", i.e.: Don't > start a > noew action when the previous action did not complete yet. Does that mean when i want to shutdown some VirtualDomains that i have to do this one by one, always waiting for the complete shutdown before stopping the next one ? That could last very long. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later
- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote: >> Any idea ? >> What is about that transition 128, which is aborted ? > > A transition is the set of actions that need to be taken in response to > current conditions. A transition is aborted any time conditions change > (here, the target-role being changed in the configuration), so that a > new set of actions can be calculated. > > Someone once defined a transition as an "action plan", and I'm tempted > to use that instead. Plus maybe replace "aborted" with "interrupted", > so then we'd have "Action plan interrupted" which is maybe a little > more understandable. > >> >> Transition 128 is finished: >> Feb 15 21:04:26 [15370] ha-idg-2 crmd: notice: >> run_graph: Transition 128 (Complete=1, Pending=0, Fired=0, >> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- >> 3548.bz2): Complete >> >> And one second later the shutdown starts. Is that normal that there >> is such a big time gap ? >> > > No, there should be another transition calculated (with a "saving > input" message) immediately after the original transition is aborted. > What's the timestamp on that? > -- Hi Ken, this is what i found: Feb 15 20:54:30 [15369] ha-idg-2pengine: notice: process_pe_message: Calculated transition 128, saving inputs in /var/lib/pacemaker/pengine/pe-input-3548.bz2 Feb 15 20:54:30 [15370] ha-idg-2 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response Feb 15 20:54:30 [15370] ha-idg-2 crmd: notice: do_te_invoke: Processing graph 128 (ref=pe_calc-dc-1644954870-403) derived from /var/lib/pacemaker/pengine/pe-input-3548.bz2 Feb 15 20:54:30 [15370] ha-idg-2 crmd: notice: te_rsc_command: Initiating stop operation vm_pathway_stop_0 locally on ha-idg-2 | action 76 Feb 15 21:04:26 [15369] ha-idg-2pengine: notice: process_pe_message: Calculated transition 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-3549.bz2 Feb 15 21:04:26 [15370] ha-idg-2 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response Feb 15 21:04:26 [15370] ha-idg-2 crmd: notice: do_te_invoke: Processing graph 129 (ref=pe_calc-dc-1644955466-405) derived from /var/lib/pacemaker/pengine/pe-input-3549.bz2 Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later
Hi, i have a weird behaviour in my 2-node-cluster. I stopped several VirtualDomains via "crm resource stop VirtualDomain", but the respective shutdown starts minutes later. All on the same host. .bash_history: 3520 2022-02-15 20:55:44 crm resource stop vm_greensql 3521 2022-02-15 20:56:34 crm resource stop vm_ssh 3522 2022-02-15 20:57:23 crm resource stop vm_sim 3523 2022-02-15 20:58:38 crm resource stop vm_mouseidgenes 3524 2022-02-15 21:00:24 crm resource stop vm_genetrap 3525 2022-02-15 21:01:25 crm resource stop vm_severin 3526 2022-02-15 21:01:34 crm resource stop vm_idcc_devel /var/log/cluster/corosync.log: Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op: Diff: --- 7.27455.0 2 Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op: Diff: +++ 7.27456.0 138c70d41548c4cb1d767dd578a98b8f Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op: + /cib: @epoch=27456 Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_perform_op: + /cib/configuration/resources/primitive[@id='vm_greensql']/meta_attributes[@id='vm_greensql-meta_attributes']/nvpair[@id='vm_greensql-meta_attributes-target-role']: @value=Stopped Feb 15 20:55:45 [15365] ha-idg-2cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2, version=7.27456.0) Feb 15 20:55:45 [15370] ha-idg-2 crmd: info: abort_transition_graph: Transition 128 aborted by vm_greensql-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27456.0 source=te_update_diff_v2:483 path=/cib/configuration/resources/primitive[@id='vm_greensql']/meta_att ributes[@id='vm_greensql-meta_attributes']/nvpair[@id='vm_greensql-meta_attributes-target-role'] complete=false ... Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op: Diff: --- 7.27456.0 2 Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op: Diff: +++ 7.27457.0 (null) Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op: + /cib: @epoch=27457 Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_perform_op: + /cib/configuration/resources/primitive[@id='vm_ssh']/meta_attributes[@id='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-meta_attributes-target-role']: @value=Stopped Feb 15 20:56:35 [15365] ha-idg-2cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2, version=7.27457.0) Feb 15 20:56:35 [15370] ha-idg-2 crmd: info: abort_transition_graph: Transition 128 aborted by vm_ssh-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27457.0 source=te_update_diff_v2:483 path=/cib/configuration/resources/primitive[@id='vm_ssh']/meta_attributes[@i d='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-meta_attributes-target-role'] complete=false ... Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op: Diff: --- 7.27457.0 2 Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op: Diff: +++ 7.27458.0 7f91d8e52c8ff0887916ad921703fadd Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op: + /cib: @epoch=27458 Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_perform_op: + /cib/configuration/resources/primitive[@id='vm_sim']/meta_attributes[@id='vm_sim-meta_attributes']/nvpair[@id='vm_sim-meta_attributes-target-role']: @value=Stopped Feb 15 20:57:24 [15365] ha-idg-2cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2, version=7.27458.0) Feb 15 20:57:24 [15370] ha-idg-2 crmd: info: abort_transition_graph: Transition 128 aborted by vm_sim-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27458.0 source=te_update_diff_v2:483 path=/cib/configuration/resources/primitive[@id='vm_sim']/meta_attributes[@i d='vm_sim-meta_attributes']/nvpair[@id='vm_sim-meta_attributes-target-role'] complete=false ... Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op: Diff: --- 7.27458.0 2 Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op: Diff: +++ 7.27459.0 727c5953b33542602028bf903b0578bc Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op: + /cib: @epoch=27459 Feb 15 20:58:39 [15365] ha-idg-2cib: info: cib_perform_op: + /cib/configuration/resources/primitive[@id='vm_mouseidgenes']/meta_attributes[@id='vm_mouseidgenes-meta_attributes']/nvpair[@id='vm_mouseidgenes-meta_attributes-target-role']: @value=Stopped Feb 15 20:58:39 [15370] ha-idg-2 crmd: info: abort_transition_graph: Transition 128 aborted by vm_mouseidgenes-meta_attributes-target-role doing modify target-role=Stopped: Configuration change | cib=7.27459.0
Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?
- On Feb 10, 2022, at 4:40 PM, Jehan-Guillaume de Rorthais j...@dalibo.com wrote: > > I wonder if after the cluster shutdown complete, the target-role=Stopped could > be removed/edited offline with eg. crmadmin? That would make VirtualDomain > startable on boot. > > I suppose this would not be that simple as it would require to update it on > all > nodes, taking care of the CIB version, hash, etc... But maybe some tooling > could take care of this? > > Last, if Bernd need to stop gracefully the VirtualDomain paying attention to > the I/O load, maybe he doesn't want them start automatically on boot for the > exact same reason anyway? I start the cluster manually (systemctl start pacemaker) and don't have a problem to start the VirtualDomains by hand one after the other. I prefer that towards an "automatic" solution. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?
- On Feb 9, 2022, at 11:26 AM, Jehan-Guillaume de Rorthais j...@dalibo.com wrote: > > I'm not sure how "crm resource stop " actually stop a resource. I thought > it would set "target-role=Stopped", but I might be wrong. > > If "crm resource stop" actually use "target-role=Stopped", I believe the > resources would not start automatically after setting > "stop-all-resources=false". > ha-idg-2:~ # crm resource help stop Stop resources Stop one or more resources using the target-role attribute. If there are multiple meta attributes sets, the attribute is set in all of them. If the resource is a clone, all target-role attributes are removed from the children resources. For details on group management see options manage-children. Usage: stop [ ...] Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?
- On Feb 7, 2022, at 4:13 PM, Jehan-Guillaume de Rorthais j...@dalibo.com wrote: > On Mon, 7 Feb 2022 14:24:44 +0100 (CET) > "Lentes, Bernd" wrote: > >> Hi, >> >> i'm currently changing a bit in my cluster because i realized that my >> configuration for a power outtage didn't work as i expected. My idea is >> currently: >> - first stop about 20 VirtualDomains, which are my services. This will surely >> takes some minutes. I'm thinking of stopping each with a time difference of >> about 20 seconds for not getting to much IO load. and then ... >> - how to stop the other resources ? > > I would set cluster option "stop-all-resources" so all remaining resources are > stopped gracefully by the cluster. > > Then you can stop both nodes using eg. "crm cluster stop". > > On restart, after both nodes are up and joined to the cluster, you can set > "stop-all-resources=false", then start your VirtualDomains. Aren't the VirtualDomains already started by "stop-all-resources=false" ? I wrote a script for the whole procedure which is triggered by the UPS. As i am not a big schellscript-writer please have a look and tell me your opinion. You find it here: https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/rEA9bFxs5Ay6fYG Thanks. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] what is the "best" way to completely shutdown a two‑node cluster ?
- On Feb 7, 2022, at 2:36 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > > Bernd, > > what if you set the node affected to standby, or shut down the cluster > services? Or all all nodes powered by the same UPS? All nodes are powered by the same UPS. > > >> >> And what is if both nodes are running ? Can i do that simultaneously on both > >> nodes ? > > I guess that should work. > >> My OS is SLES 12 SP5, pacemaker is 1.1.23, corosync is 2.3.6-9.13.1 > > Your action plan depends on what the VMNs are doing: basically every HA > resource should survive a hard restart without much damange. Well, some vm's have databases, i'd like to shutdown these cleanly. > So maybe an option could be: Do nothing, or do an emergency shutdown of the > node without properly migrating all the VMs elsewhere. The VM's don't need to be migrated, the whole cluster should stop in a reasonable time and manner. > You cannot make an application HA by putting it in a VM; at least not in > general. I know. But the time gap between one node having problems and booting the vm on the other node is ok for us. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?
Hi, i'm currently changing a bit in my cluster because i realized the my configuration for a power outtage didn't work as i expected. My idea is currently: - first stop about 20 VirtualDomains, which are my services. This will surely takes some minutes. I'm thinking of stopping each with a time difference of about 20 seconds for not getting to much IO load. and then ... - how to stop the other resources ? - put the nodes into standby or offline ? - do a systemctl stop pacemaker ? - or do a crm cluster stop ? And what is if both nodes are running ? Can i do that simultaneously on both nodes ? My OS is SLES 12 SP5, pacemaker is 1.1.23, corosync is 2.3.6-9.13.1 Thanks for your help. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Is there a python package for pacemaker ?
Hi, i need to write some scripts for our cluster. Until now i wrote bash scripts. But i like to learn python. Is there a package for pacemaker ? What i found is: https://pypi.org/project/pacemaker/ and i'm not sure what that is. Thanks. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] HA-Cluster, UPS and power outage - how is your setup ?
Hi, we just experienced two power outages in a few days. This showed me that our UPS configuration and the handling of resources on the cluster is insufficient. We have a two-node cluster with SLES 12 SP5 and a Smart-UPS SRT 3000 from APC with Network Management Card. The UPS is able to buffer the two nodes and some Hardware (SAN, Monitor) for about one hour. Our resources are Virtual Domains, about 20 of different flavor and version. Our primary goal is not to bypass as long as possible a power outage but to shutdown all domains correctly after a dedicated time. I'm currently thinking of waiting for a dedicated time (maybe 15 minutes) and then do a "crm resource stop VirtualDomains" in a script. I would give the cluster some time for the shutdown (5-10 minutes) and afterwards shutdown the nodes (via script). I have to keep an eye on if both nodes are running or only one of them. How is your approach ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Problem with high load (IO)
- On Sep 30, 2021, at 3:55 AM, Gang He g...@suse.com wrote: >> >> 1) No problems during this step, the procedure just needs a few seconds. >> reflink is a binary. See reflink --help >> Yes, it is a cluster filesystem. I do the procedure just on one node, >> so i don't have duplicates. >> >> 2) just with "cp source destination" to a NAS. >> Yes, the problems appear during this step. > Ok, when you cp the cloned file to the NAS directory, > the NAS directory should be another file system, right? > During the copying process, the original VM running will be affected, > right? > Yes, it's another fs. Yes, the running machine is affected. It's getting slower and sometimes does not react, following our monitor software. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Problem with high load (IO)
- On Sep 29, 2021, at 4:37 AM, Gang He g...@suse.com wrote: > Hi Lentes, > > Thank for your feedback. > I have some questions as below, > 1) how to clone these VM images from each ocfs2 nodes via reflink? > do you encounter any problems during this step? > I want to say, this is a shared file system, you do not clone all VM > images from each node, duplicated. > 2) after the cloned VM images are created, how do you copy these VM > images? copy to another backup file system, right? > The problem usually happened during this step? > > Thanks > Gang 1) No problems during this step, the procedure just needs a few seconds. reflink is a binary. See reflink --help Yes, it is a cluster filesystem. I do the procedure just on one node, so i don't have duplicates. 2) just with "cp source destination" to a NAS. Yes, the problems appear during this step. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Problem with high load (IO)
- On Sep 27, 2021, at 2:51 PM, Pacemaker ML users@clusterlabs.org wrote: > I would use something liek this: > > ionice -c 2 -n 7 nice cp XXX YYY > > Best Regards, > Strahil Nikolov Just for a better understanding: ionice does not relate to the copy procedure in this commandline, but to the nice program. What is the advantage if nice does treat IO a bit more carefully ? Is there a way in this commandline that ionice relates to the copy program ? What is with ionice -c 2 -n 7 (nice cp XXX YYY) ? With the brackets both programs are executed in the same shell. Would that help ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Problem with high load (IO)
- On Sep 27, 2021, at 2:51 PM, Pacemaker ML users@clusterlabs.org wrote: > I would use something liek this: > > ionice -c 2 -n 7 nice cp XXX YYY > > Best Regards, > Strahil Nikolov > Hi Strahil, that sounds interesting, i didn't know ionice. I will have a look on the man-pages. Thanks. Bernd Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Problem with high load (IO)
Hi, i have a two-node cluster running on SLES 12SP5 with two HP servers and a common FC SAN. Most of my resources are virtual domains offering databases and web pages. The disks from the domains reside on a OCFS2 Volume on a FC SAN. Each night a 9pm all domains are snapshotted with the OCFS2 tool reflink. After the snapshot is created the disks of the domains are copied to a NAS, domains are still running. The copy procedure occupies the CPU and IO intensively. IO is occupied by copy about 90%, the CPU has sometimes a wait about 50%. Because of that the domains aren't responsive, so that the monitor operation from the RA fails sometimes. In worst cases one domain is fenced. What would you do in such a situation ? I'm thinking of making the cp procedure nicer, with nice. Maybe about 10. More ideas ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?
- On Sep 20, 2021, at 10:14 PM, kgaillot kgail...@redhat.com wrote: > > As far as I know, only a few of the ocf:pacemaker agents support OCF > 1.1 currently. The resource-agents package doesn't. > > To check a given agent, run "crm_resource --show-metadata > ocf:$PROVIDER:$AGENT | grep ''" using the desired provider and > agent name. > ha-idg-1:/var/log/atop # crm_resource --show-meta ocf:heartbeat:VirtualDomain|grep -i version 1.1 So it's version 1.1. But there is no operation "reload": Operations' defaults (advisory minimum): start timeout=90s stop timeout=90s statustimeout=30s interval=10s monitor timeout=30s interval=10s migrate_from timeout=60s migrate_totimeout=120s Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?
- On Sep 18, 2021, at 1:19 AM, kgaillot kgail...@redhat.com wrote: > > If the agent meta-data advertises support for the 1.1 standard and > indicates that the trace_ra parameter is reloadable, then Pacemaker > will automatically do a reload instead of restart for the resource if > the parameter changes. >From where do i know that my RA supports 1.1 ? I'm running on SLES 12 SP5 with resource-agents-4.3.018.a7fb5035-3.51.1.x86_64. My crm does not support "crm resource reload": ha-idg-1:~ # crm resource help Resource management At this level resources may be managed. All (or almost all) commands are implemented with the CRM tools such as crm_resource(8). Commands: ... refresh Recheck current resource status and drop failure history restart Restart resources ... > > trace_ra is unusual in that resource agents don't define the parameter > themselves, the ocf-shellfuncs shell include looks for it instead. It > would be nice to come up with a general solution that all agents can > use rather than modify each agent's meta-data individually, but either > approach would work. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?
- On Sep 17, 2021, at 9:13 PM, kgaillot kgail...@redhat.com wrote: >> Bernd > > Tracing works by setting a special parameter, which to pacemaker looks > like a configuration change that requires a restart. With the new OCF > 1.1 standard, the trace parameter could be marked reloadable, but the > agents need to be updated to do that. > -- > Ken Gaillot Hi Ken, but does pacemaker do it automatically or do i have to initiate that ? Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] cofigured trace for Virtual Domains - automatic restart ?
Hi, today i configured tracing for some VirtualDomains: ha-idg-2:~ # crm resource trace vm_documents-oo migrate_from INFO: Trace for vm_documents-oo:migrate_from is written to /var/lib/heartbeat/trace_ra/ INFO: Trace set, restart vm_documents-oo to trace the migrate_from operation ha-idg-2:~ # crm resource trace vm_genetrap migrate_from INFO: Trace for vm_genetrap:migrate_from is written to /var/lib/heartbeat/trace_ra/ INFO: Trace set, restart vm_genetrap to trace the migrate_from operation I thought "Trace set, restart vm_genetrap to trace the migrate_from operation" is a hint to not forget the restart of the resource. But all resources i configured tracing for did an automatic restart. Is that behaviour intended ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] virtual domains not migrated
Hi, Today i couldn't migrate several virtual domains. I have a Two-node cluster with SuE SLES 12 SP5. Pacemaker is pacemaker-1.1.23+20200622.28dd98fad-3.9.2.20591.0.PTF.1177212.x86_64, corosync is corosync-2.3.6-9.13.1.x86_64. Migration just stopped after an amount of time. This is what i found in the logs: Sep 14 12:28:54 [10498] ha-idg-2 lrmd: notice: operation_finished: vm_genetrap_stop_0:22559:stderr [ error: Failed to shutdown domain vm_genetrap ] Sep 14 12:28:54 [10498] ha-idg-2 lrmd: notice: operation_finished: vm_genetrap_stop_0:22559:stderr [ error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) ] Sep 14 12:37:36 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_crispor-server_stop_0:8002:stderr [ error: Failed to shutdown domain vm_crispor-server ] Sep 14 12:37:36 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_crispor-server_stop_0:8002:stderr [ error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ] Sep 14 12:17:16 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_seneca_stop_0:1546:stderr [ error: Failed to shutdown domain vm_seneca ] Sep 14 12:17:16 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_seneca_stop_0:1546:stderr [ error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ] Sep 14 12:17:16 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_seneca_stop_0:1546:stderr [ ] Sep 14 12:07:40 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_geneious_stop_0:1545:stderr [ error: Failed to shutdown domain vm_geneious ] Sep 14 12:07:40 [7831] ha-idg-1 lrmd: notice: operation_finished: vm_geneious_stop_0:1545:stderr [ error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepareTunnel3Params) ] Any ideas ? Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Live migration possible with KSM ?
- On Mar 30, 2021, at 7:54 PM, hunter86 bg hunter86...@yahoo.com wrote: > Keep in mind that KSM is highly cpu intensive and is most suitable for same > type > of VMs,so similar memory pages will be merged until a change happen (and that > change is allocated elsewhere). > In oVirt migration is possible with KSM actively working, so it should work > with > pacemaker. > I doubt that KSM would be a problem... most probably performance would not be > optimal. > Best Regards, > Strahil Nikolov >> On Tue, Mar 30, 2021 at 19:47, Andrei Borzenkov >> wrote: >> On 30.03.2021 18:16, Lentes, Bernd wrote: >> > Hi, >>> currently i'm reading "Mastering KVM Virtualization", published by Packt >> > Publishing, a book i can really recommend. >>> There are some proposals for tuning guests. One is KSM (kernel samepage >> > merging), which sounds quite interesting. >>> Especially in a system with lots of virtual machines with the same OS this >>> could >> > lead to significant merory saving. >>> I'd like to test, but i don't know if KSM maybe prevents live migration in a >> > pacemaker cluster. >> I do not think pacemaker cares or is aware about KSM. It just tells >> resource agent to perform migration; what happens is entirely up to >> resource agent. >> If you can migrate without pacemaker you can also migrate with pacemaker. Just to give a feedback. I configured KSM on both nodes. On one it saves me nearly 20GB RAM. I checked live migration and it worked. Bernd smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Live migration possible with KSM ?
Hi, currently i'm reading "Mastering KVM Virtualization", published by Packt Publishing, a book i can really recommend. There are some proposals for tuning guests. One is KSM (kernel samepage merging), which sounds quite interesting. Especially in a system with lots of virtual machines with the same OS this could lead to significant merory saving. I'd like to test, but i don't know if KSM maybe prevents live migration in a pacemaker cluster. Does amyone know ? Thanks. Bernd -- Bernd Lentes System Administrator Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd Public key: 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01 smime.p7s Description: S/MIME Cryptographic Signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] alert is not executed - solved
- On Feb 15, 2021, at 10:24 PM, Bernd Lentes bernd.len...@helmholtz-muenchen.de wrote: > - On Feb 15, 2021, at 9:00 PM, kgaillot kgail...@redhat.com wrote: > >> On Mon, 2021-02-15 at 20:47 +0100, Lentes, Bernd wrote: >>> - On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com >>> wrote: >>> >>> > I'd check for SELinux denials. >>> > >>> >>> SELinux isn't installed and the AppArmor service does not start. >>> I changed the subject. >> I found it. It was a permission problem. The script was stored in the home from root but alert scripts are executed as user hacluster. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] alert is not executed
- On Feb 15, 2021, at 9:00 PM, kgaillot kgail...@redhat.com wrote: > On Mon, 2021-02-15 at 20:47 +0100, Lentes, Bernd wrote: >> - On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com >> wrote: >> >> > I'd check for SELinux denials. >> > >> >> SELinux isn't installed and the AppArmor service does not start. >> I changed the subject. > > Maybe "exec 2>/some/file" and "set +x" as the first things in the > script. > That does not help. The script is not executed. I inserted a "logger" command, but there is nothing written to the syslog. And the atime of the script is not updated. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] alert is not executed
- On Feb 15, 2021, at 4:53 PM, kgaillot kgail...@redhat.com wrote: > I'd check for SELinux denials. > SELinux isn't installed and the AppArmor service does not start. I changed the subject. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: weird xml snippet in "crm configure show"
- On Feb 15, 2021, at 9:55 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: >> Hi, >> >> i could configure the following: >> >> ha-idg-1:~ # crm configure show smtp_alert >> alert smtp_alert "/root/skripte/alert_smtp.sh" \ >> attributes email_sender="bernd.len...@helmholtz-muenchen.de" \ >> meta timestamp-format="%D %H:%M" \ >> to "informatic@helmholtz-muenchen.de" >> >> Script is available: >> ha-idg-1:~ # ll /root/skripte/alert_smtp.sh >> -rwxr-xr-x 1 root root 4080 Feb 13 01:10 /root/skripte/alert_smtp.sh >> >> But it's not executed, although Cluster log says the alert is doing his > job: >> Feb 13 01:10:57 [30760] ha-idg-1 crmd: info: exec_alert_list: >> Sending resource alert via smtp_alert to > informatic@helmholtz-muenchen.de >> Feb 13 01:10:57 [30757] ha-idg-1 lrmd: info: >> process_lrmd_alert_exec: Executing alert smtp_alert for >> 621c8a64-13aa-46fa-a7f7-f7df8d384a86 >> >> I added a simple logger command into the script, but nothing is written to >> system log. > > How does the first line in your script look like? > #!/bin/sh # # Copyright (C) 2016 Klaus Wenninger # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public # License as published by the Free Software Foundation; either # version 2 of the License, or (at your option) any later version. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] weird xml snippet in "crm configure show"
- On Feb 12, 2021, at 12:50 PM, Yan Gao y...@suse.com wrote: > > > It seems that crmsh has difficulty parsing the "random" ids of the > attribute sets here. I guess `crm configure edit` the part to be > something like: > > alert smtp_alert "/root/skripte/alert_smtp.sh" \ > attributes email_sender="bernd.len...@helmholtz-muenchen.de" \ > to "informatic@helmholtz-muenchen.de" meta > timestamp-format="%D %H:%M" > > will do. Hi, i could configure the following: ha-idg-1:~ # crm configure show smtp_alert alert smtp_alert "/root/skripte/alert_smtp.sh" \ attributes email_sender="bernd.len...@helmholtz-muenchen.de" \ meta timestamp-format="%D %H:%M" \ to "informatic@helmholtz-muenchen.de" Script is available: ha-idg-1:~ # ll /root/skripte/alert_smtp.sh -rwxr-xr-x 1 root root 4080 Feb 13 01:10 /root/skripte/alert_smtp.sh But it's not executed, although Cluster log says the alert is doing his job: Feb 13 01:10:57 [30760] ha-idg-1 crmd: info: exec_alert_list: Sending resource alert via smtp_alert to informatic@helmholtz-muenchen.de Feb 13 01:10:57 [30757] ha-idg-1 lrmd: info: process_lrmd_alert_exec: Executing alert smtp_alert for 621c8a64-13aa-46fa-a7f7-f7df8d384a86 I added a simple logger command into the script, but nothing is written to system log. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] weird xml snippet in "crm configure show"
- On Feb 12, 2021, at 5:00 PM, hunter86 bg hunter86...@yahoo.com wrote: > WARNING: cib-bootstrap-options: unknown attribute 'no-quirum-policy' > That looks like a typo. > Best Regards, > Strahil Nikolov Thanks, i found that already. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] weird xml snippet in "crm configure show"
- On Feb 12, 2021, at 11:18 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > > What is the output of "crm configure verify"? ha-idg-1:~ # crm configure verify WARNING: cib-bootstrap-options: unknown attribute 'no-quirum-policy' WARNING: clvmd: specified timeout 20 for monitor is smaller than the advised 90s WARNING: dlm: specified timeout 80 for stop is smaller than the advised 100 WARNING: dlm: specified timeout 80 for start is smaller than the advised 90 WARNING: fs_ocfs2: specified timeout 20 for monitor is smaller than the advised 40s WARNING: fs_test_ocfs2: specified timeout 20 for monitor is smaller than the advised 40s WARNING: gfs2_share: specified timeout 20 for monitor is smaller than the advised 40s WARNING: gfs2_snap: specified timeout 20 for monitor is smaller than the advised 40s WARNING: vm_amok: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_crispor: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_crispor-server: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_dietrich: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_documents-oo: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_geneious: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_geneious-license: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_geneious-license-mcd: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_genetrap: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_greensql: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_idcc_devel: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_mausdb: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_mouseidgenes: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_nextcloud: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_pathway: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_photoshop: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_seneca: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_severin: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_sim: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_snipanalysis: specified timeout 25 for monitor is smaller than the advised 30s WARNING: vm_ssh: specified timeout 25 for monitor is smaller than the advised 30s Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 26, 2020, at 4:09 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > > AFAIK you can even kill processes in Linux that are in "D" state (contrary to > other operating systems). How ? Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 11:18 PM, Bernd Lentes bernd.len...@helmholtz-muenchen.de wrote: > - On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote: > > >>> I need someting like that which waits for some time (maybe 30s) if the >>> domain >>> nevertheless stops although >>> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the >>> process wakes up from D state. >> >> So why not ignore virsh error and just wait always? You probably need to >> retain "domain not found" exit condition still. Hi, here is my rewritten RA: https://hmgubox2.helmholtz-muenchen.de/index.php/s/iYjRyJiWb5XNfXm As i'm not a coder feedback is welcome. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 26, 2020, at 8:41 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > "SIGKILL: Device or resource busy" is nonsense: kill does not wait; it either > fails or succeeds. yes and no. When you send a SIGKILL to a process which is in 'D' state, the signal can't be delivered, e.g. because the domain is doing heavy IO. But the signal is pending, and when the process wakes up and switches to 'S' or 'R' state, the signal is delivered and the process is killed. That's exactly what i want to address. If you just check the rc from kill and you get s.th. -ne 0, you think it failed. But it's possible your kill will work 20 seconds later. So, with a delay, SIGKILL would succeed. But the node the domain is running on is already fenced, with all its consequences. And the "device or resource busy" message is exactly the one you get when you do a "virsh destroy" to a domain in 'D' state. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote: >> I need someting like that which waits for some time (maybe 30s) if the domain >> nevertheless stops although >> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the >> process wakes up from D state. > > So why not ignore virsh error and just wait always? You probably need to > retain "domain not found" exit condition still. That's my plan. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvi...@valentin-vidic.from.hr wrote: > On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote: >> But when the timeout has run out the RA tries to kill the machine with a >> "virsh >> destroy". >> And if that does not work (what is occasionally my problem) because the >> domain >> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back >> which >> cause pacemaker to fence the lazy node. Or am i wrong ? > > What does the log look like when this happens? > /var/log/cluster/corosync.log: VirtualDomain(vm_amok)[8998]: 2020/09/27_22:34:11 INFO: Issuing graceful shutdown request for domain vm_amok. VirtualDomain(vm_amok)[8998]: 2020/09/27_22:37:06 INFO: Issuing forced shutdown (destroy) request for domain vm_amok. Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: child_timeout_callback: vm_amok_stop_0 process (PID 8998) timed out Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: operation_finished: vm_amok_stop_0:8998 - timed out after 18ms timeout of the domain is 180 sec. /var/log/libvirt/libvirtd.log (time is UTC): 2020-09-27 20:37:21.489+: 18583: error : virProcessKillPainfully:401 : Failed to terminate process 14037 with SIGKILL: Device or resource busy 2020-09-27 20:37:21.505+: 6610: error : virNetSocketWriteWire:1852 : Cannot write data: Broken pipe 2020-09-27 20:37:31.962+: 6610: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor SIGKILL didn't work. Nevertheless the process is finished 20 seconds later after destroy, surely because it woke up from D and received the signal. /var/log/cluster/corosync.log on the DC: Sep 27 22:37:11 [3580] ha-idg-1 crmd: warning: status_from_rc: Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error Stop (also sigkill) failed Sep 27 22:37:11 [3579] ha-idg-1pengine: notice: native_stop_constraints: Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced cluster decides to fence the node although resource is stopped 10 seconds later atop log: 14037 - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ... PID of the domain is 14037 14037 - E 0% worker (at 22:37:31) domain has stoppped Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 5:06 PM, Strahil Nikolov hunter86...@yahoo.com wrote: > why don't you work with something like this: 'op stop interval =300 > timeout=600'. > The stop operation will timeout at your requirements without modifying the > script. > > Best Regards, > Strahil Nikolov But when the timeout has run out the RA tries to kill the machine with a "virsh destroy". And if that does not work (what is occasionally my problem) because the domain is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which cause pacemaker to fence the lazy node. Or am i wrong ? Where is the benefit of the shorter interval ? The return value of the "virsh destroy" operation is set immediately. And it's -ne 0 when the "virsh destroy" didn't suceed. No matter if the domain stops 20 sec. later, the return value is not changed. and send to the LRM so the cluster wants to stonith that node. Surprisingly if the virsh destroy is successfull the RA waits until the domain isn't running anymore: force_stop { ... 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; I need someting like that which waits for some time (maybe 30s) if the domain nevertheless stops although "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the process wakes up from D state. For this amount of time the RA has to wait and to take care that the the return value is zero if the domain stopped or is -ne 0 if also the waiting didn't help. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 7:11 AM, Andrei Borzenkov arvidj...@gmail.com wrote: >> >> ocf_log info "Issuing forced shutdown (destroy) request for domain >> ${DOMAIN_NAME}." >> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) >> ex=$? >> sleep (10)< (or maybe configurable) >> translate=$(echo $out|tr 'A-Z' 'a-z') >> >> >> What do you think ? >> > > > It makes no difference. You wait 10 seconds before parsing output of > "virsh destroy", that's all. It does not change output itself, so if > output indicates that "virsh destroy" failed, it will still indicate > that after 10 seconds. > > Either you need to repeat "virsh destroy" in a loop, or virsh itself > should be more robust. Hi Andrei, yes, you are right. I saw it alread after sending the E-Mail. I will change that. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
Hi guys, ocassionally stopping a VirtualDomain resource via "crm resource stop" does not work, and in the end the node is fenced, which is ugly. I had a look at the RA to see what it does. After trying to stop the domain via "virsh shutdown ..." in a configurable time it switches to "virsh destroy". i assume "virsh destroy" send a sigkill to the respective process. But when the host is doing heavily IO it's possible that the process is in "D" state (uninterruptible sleep) in which it can't be finished with a SIGKILL. The the node the domain is running on is fenced due to that. I digged deeper and found out that the signal is often delivered a bit later (just some seconds) and the process is killed, but pacemaker already decided to fence the node. It's all about this excerp in the RA: force_stop() { local out ex translate local status=0 ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) ex=$? translate=$(echo $out|tr 'A-Z' 'a-z') echo >&2 "$translate" case $ex$translate in *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\ *"error:"*"failed to get domain"*) : ;; # unexpected path to the intended outcome, all is well [!0]*) ocf_exit_reason "forced stop failed" return $OCF_ERR_GENERIC ;; 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; esac return $OCF_SUCCESS } I'm thinking about the following: How about to let the script wait a bit after "virsh destroy". I saw that usually it just takes some seconds that "virsh destroy" is successfull. I'm thinking about this change: ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) ex=$? sleep (10)< (or maybe configurable) translate=$(echo $out|tr 'A-Z' 'a-z') What do you think ? Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] mess in the CIB
> > It's unlikely that changed at any time; more likely it was created like > that. Whatever was used to create the initial configuration would be > where to look for clues. > > As long as the IDs are unique, their content doesn't matter to > pacemaker, so it's just a cosmetic issue. > What do you propose ? Delete and re-create correctly ? These domains can be stopped for a short time. Bernd Helmholtz Zentrum München Helmholtz Zentrum München ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] mess in the CIB
Hi guys, i have a very strange problem with my CIB. We have a two-node cluster running about 15 VirtualDomains as resources. Two of them seem to be messed up. Here is the config from crm: primitive vm_ssh VirtualDomain \ params config="/mnt/share/vm_ssh.xml" \ params hypervisor="qemu:///system" \ params migration_transport=ssh \ params migrate_options="--p2p --tunnelled" \ op start interval=0 timeout=120 \ op stop interval=0 timeout=180 \ op monitor interval=30 timeout=25 \ op migrate_from interval=0 timeout=300 \ op migrate_to interval=0 timeout=300 \ meta allow-migrate=true target-role=Started is-managed=true maintenance=false \ utilization cpu=2 hv_memory=4096 ha-idg-1:/mnt/share # crm configure show vm_snipanalysis primitive vm_snipanalysis VirtualDomain \ params config="/mnt/share/vm_snipanalysis.xml" \ params hypervisor="qemu:///system" \ params migration_transport=ssh \ params migrate_options="--p2p --tunnelled" \ op start interval=0 timeout=120 \ op stop interval=0 timeout=180 \ op monitor interval=30 timeout=25 \ op migrate_from interval=0 timeout=300 \ op migrate_to interval=0 timeout=300 \ meta allow-migrate=true target-role=Stopped is-managed=false maintenance=false Everything looks ok for me. Here are the two config files for libvirt: ha-idg-1:/etc/libvirt/qemu # less /mnt/share/vm_snipanalysis.xml vm_snipanalysis b3b91a8c-b13f-4368-8439-7d8a4108ef3b 32768000 32768000 12 hvm destroy restart destroy /usr/bin/qemu-kvm and ha-idg-1:/etc/libvirt/qemu # less /mnt/share/vm_ssh.xml vm_ssh b3b91a8d-b13f-4368-8439-7d8a4109ef3b 4194304 4194304 2 hvm destroy restart destroy /usr/bin/qemu-kvm Also in the libvirt config files i don't see a problem. BUT in the cib: <== <== <=== and The config of vm_snipanalysis seems to be ok. But vm_ssh ... why are some instance-attributes of it named with snapanalysis? I didn't change the configuration of both in the last weeks. Does anyone have a clue ? Thanks. Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy Helmholtz Zentrum München Helmholtz Zentrum München ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/
- Am 30. Sep 2020 um 19:24 schrieb Vladislav Bogdanov bub...@hoster-ok.com: > Hi > Try to enable trace_ra for start op. I'm tracing now start and stop and that works fine. Thanks for any hint. Bernd Helmholtz Zentrum München Helmholtz Zentrum München ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/
Hi, currently i have a VirtualDomains resource which sometimes fails to stop. To investigate further i'm tracing the stop operation of this resource. But although i stopped it already now several times, nothing appears in /var/lib/heartbeat/trace_ra/. This is my config: primitive vm_amok VirtualDomain \ params config="/mnt/share/vm_amok.xml" \ params hypervisor="qemu:///system" \ params migration_transport=ssh \ params migrate_options="--p2p --tunnelled" \ op start interval=0 timeout=120 \ op monitor interval=30 timeout=25 \ op migrate_from interval=0 timeout=300 \ op migrate_to interval=0 timeout=300 \ op stop interval=0 timeout=180 \ op_params trace_ra=1 \ meta allow-migrate=true target-role=Started is-managed=true maintenance=false \ Any ideas ? SLES 12 SP4, pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy Helmholtz Zentrum München Helmholtz Zentrum München ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com wrote: >> This appears to be a scheduler bug. > > Fix is in master branch and will land in 2.0.5 expected at end of the > year > > https://github.com/ClusterLabs/pacemaker/pull/2146 A principal question: I have SLES 12 and i'm using the pacemaker version provided with the distribution. If this fix is backported depends on Suse. If i install und update pacemaker manually (not the version provided by Suse), i loose my support from them, but have always the most recent code and fixes. If i stay with the version from Suse i have support from them, but maybe not all fixes and not the most recent code. What is your approach ? Recommendations ? Thanks. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 18, 2020, at 7:30 PM, kgaillot kgail...@redhat.com wrote: >> > I'm not sure, I'd have to see the pe input. >> >> You find it here: >> https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29 > > This appears to be a scheduler bug. > > The scheduler considers a migration to be "dangling" if it has a record > of a failed migrate_to on the source node, but no migrate_from on the > target node (and no migrate_from or start on the source node, which > would indicate a later full restart or reverse migration). > > In this case, any migrate_from on the target has since been superseded > by a failed start and a successful stop, so there is no longer a record > of it. Therefore the migration is considered dangling, which requires a > full stop on the source node. > > However in this case we already have a successful stop on the source > node after the failed migrate_to, and I believe that should be > sufficient to consider it no longer dangling. > Thanks for your explananation Ken. For me a Fence i don't understand is the worst that can happen to a HA cluster. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 17, 2020, at 5:09 PM, kgaillot kgail...@redhat.com wrote: >> I checked all relevant pe-files in this time period. >> This is what i found out (i just write the important entries): >> Executing cluster transition: >> * Resource action: vm_nextcloudstop on ha-idg-2 >> Revised cluster status: >> vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped >> >> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input- >> 3118 -G transition-4516.xml -D transition-4516.dot >> Current cluster status: >> Node ha-idg-1 (1084777482): standby >> Online: [ ha-idg-2 ] >> vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped >> <== vm_nextcloud is stopped >> Transition Summary: >> * Shutdown ha-idg-1 >> Executing cluster transition: >> * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ? >> It is already stopped > > I'm not sure, I'd have to see the pe input. You find it here: https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29 >> vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <=== >> vm_nextcloud is stopped >> Transition Summary: >> * Fence (Off) ha-idg-1 'resource actions are unrunnable' >> Executing cluster transition: >> * Fencing ha-idg-1 (Off) >> * Pseudo action: vm_nextcloud_stop_0 <=== why stop ? It is >> already stopped ? >> Revised cluster status: >> Node ha-idg-1 (1084777482): OFFLINE (standby) >> Online: [ ha-idg-2 ] >> vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped >> >> I don't understand why the cluster tries to stop a resource which is >> already stopped. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 9, 2020, at 10:17 PM, Bernd Lentes bernd.len...@helmholtz-muenchen.de wrote: >> So this appears to be the problem. From these logs I would guess the >> successful stop on ha-idg-1 did not get written to the CIB for some >> reason. I'd look at the pe input from this transition on ha-idg-2 to >> confirm that. >> >> Without the DC knowing about the stop, it tries to schedule a new one, >> but the node is shutting down so it can't do it, which means it has to >> be fenced. I checked all relevant pe-files in this time period. This is what i found out (i just write the important entries): ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3116 -G transition-3116.xml -D transition-3116.dot Current cluster status: ... vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-1 Transition Summary: ... * Migratevm_nextcloud ( ha-idg-1 -> ha-idg-2 ) Executing cluster transition: * Resource action: vm_nextcloudmigrate_from on ha-idg-2 <=== migrate vm_nextcloud * Resource action: vm_nextcloudstop on ha-idg-1 * Pseudo action: vm_nextcloud_start_0 Revised cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-2 ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-48 -G transition-4514.xml -D transition-4514.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] ... vm_nextcloud (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-idg-1 ] <== migration failed Transition Summary: .. * Recovervm_nextcloud( ha-idg-2 ) Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-2 * Resource action: vm_nextcloudstop on ha-idg-1 * Resource action: vm_nextcloudstart on ha-idg-2 * Resource action: vm_nextcloudmonitor=3 on ha-idg-2 Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-2 ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3117 -G transition-3117.xml -D transition-3117.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2 <== start on ha-idg-2 failed Transition Summary: * Stop vm_nextcloud ( ha-idg-2 ) due to node availability < stop vm_nextcloud (what means due to node availability ?) Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-2 Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3118 -G transition-4516.xml -D transition-4516.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is stopped Transition Summary: * Shutdown ha-idg-1 Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ? It is already stopped Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-3545 -G transition-0.xml -D transition-0.dot Current cluster status: Node ha-idg-1 (1084777482): pending Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is stopped Transition Summary: Executing cluster transition: Using the original execution date of: 2020-07-20 15:05:33Z Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-749 -G transition-1.xml -D transition-1.dot Current cluster status: Node ha-idg-1 (1084777482): OFFLINE (standby) Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <=== vm_nextcloud is stopped Transition Summary: * Fence (Off) ha-idg-1 'resource actions are unrunnable' Executing cluster transition: * Fencing ha-idg-1 (Off) * Pseudo action: vm_nextcloud_stop_0 <=== why stop ? It is already stopped ? Revised cluster status: Node ha-idg-1 (1084777482): OFFLINE (standby) Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped I don't understand why the cluster tries to stop a resource which is already stopped. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 10, 2020, at 11:59 PM, kgaillot kgail...@redhat.com wrote: > The most recent transition is aborted, but since all its actions are > complete, the only effect is to trigger a new transition. > > We should probably rephrase the log message. In fact, the whole > "transition" terminology is kind of obscure. It's hard to come up with > something better though. > Hi Ken, i don't get it. How can s.th. be aborted which is already completed ? Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com: > On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote: >> Hi, >> >> a few days ago one of my nodes was fenced and i don't know why, which >> is something i really don't like. >> What i did: >> I put one node (ha-idg-1) in standby. The resources on it (most of >> all virtual domains) were migrated to ha-idg-2, >> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was >> missing the xml of the domain points to. >> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of >> course also failed. >> Then ha-idg-1 was fenced. >> I did a "crm history" over the respective time period, you find it >> here: >> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF >> >> Here, from my point of view, the most interesting from the logs: >> ha-idg-1: >> Jul 20 16:59:33 [23763] ha-idg-1cib: info: >> cib_perform_op: Diff: --- 2.16196.19 2 >> Jul 20 16:59:33 [23763] ha-idg-1cib: info: >> cib_perform_op: Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758 >> Jul 20 16:59:33 [23763] ha-idg-1cib: info: >> cib_perform_op: + /cib: @epoch=16197, @num_updates=0 >> Jul 20 16:59:33 [23763] ha-idg-1cib: info: >> cib_perform_op: + /cib/configuration/nodes/node[@id='1084777482']/i >> nstance_attributes[@id='nodes-108 >> 4777482']/nvpair[@id='nodes-1084777482-standby']: @value=on >> ha-idg-1 set to standby >> >> Jul 20 16:59:34 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ >> error: Cannot access storage file >> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ >> ubuntu-18.04.4-live-server-amd64.iso': No such file or >> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2 >> failed: 1\n ] >> migration failed >> >> Jul 20 17:04:01 [23767] ha-idg-1pengine:error: >> native_create_actions: Resource vm_nextcloud is active on 2 nodes >> (attempting recovery) >> ??? > > This is standard for a failed live migration -- the cluster doesn't > know how far the migration actually got before failing, so it has to > assume the VM could be active on either node. (The log message would > make more sense saying "might be active" rather than "is active".) > >> Jul 20 17:04:01 [23767] ha-idg-1pengine: notice: >> LogAction:* >> Recovervm_nextcloud ( ha-idg-2 ) > > The recovery from that situation is a full stop on both nodes, and > start on one of them. > >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: >> te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 on ha- >> idg-2 | action 106 >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: >> te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 >> locally on ha-idg-1 | action 2 >> >> Jul 20 17:04:01 [23768] ha-idg-1 crmd: info: >> match_graph_event: Action vm_nextcloud_stop_0 (106) confirmed >> on ha-idg-2 (rc=0) >> >> Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=5960 > > It looks like both stops succeeded. > >> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd: notice: >> crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking >> handler) >> systemctl stop pacemaker.service >> >> >> ha-idg-2: >> Jul 20 17:04:03 [10691] ha-idg-2 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=57 >> the log from ha-idg-2 is two seconds ahead of ha-idg-1 >> >> Jul 20 17:04:08 [10688] ha-idg-2 lrmd: notice: >> log_execute: executing - rsc:vm_nextcloud action:start >> call_id:192 >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> operation_finished: vm_nextcloud_start_0:29107:stderr [ error: >> Failed to create domain from /mnt/share/vm_nextcloud.xml ] >> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: >> operation_finished: vm_nextcloud_start_0:29107:stderr [ error: >> Cannot access storage file >> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ >> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ] >> J
Re: [ClusterLabs] why is node fenced ?
- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com: > Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC. The other way round. >> Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: >> unpack_rsc_op_failure: Processing failed migrate_to of vm_nextcloud >> on ha-idg-1: unknown error | rc=1 >> Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: >> unpack_rsc_op_failure: Processing failed start of vm_nextcloud on >> ha-idg-2: unknown error | rc >> >> Jul 20 17:05:33 [10690] ha-idg-2pengine: info: >> native_color:Resource vm_nextcloud cannot run anywhere >> logical >> >> Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: >> custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable >> (pending) >> ??? > > So this appears to be the problem. From these logs I would guess the > successful stop on ha-idg-1 did not get written to the CIB for some > reason. I'd look at the pe input from this transition on ha-idg-2 to > confirm that. > > Without the DC knowing about the stop, it tries to schedule a new one, > but the node is shutting down so it can't do it, which means it has to > be fenced. > >> Jul 20 17:05:35 [10690] ha-idg-2pengine: warning: >> custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable >> (offline) >> Jul 20 17:05:35 [10690] ha-idg-2pengine: warning: >> pe_fence_node: Cluster node ha-idg-1 will be fenced: resource >> actions are unrunnable >> Jul 20 17:05:35 [10690] ha-idg-2pengine: warning: >> stage6: Scheduling Node ha-idg-1 for STONITH >> Jul 20 17:05:35 [10690] ha-idg-2pengine: info: >> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha- >> idg-1 is fenced >> Jul 20 17:05:35 [10690] ha-idg-2pengine: notice: >> LogNodeActions: * Fence (Off) ha-idg-1 'resource actions are >> unrunnable' >> >> >> Why does it say "Jul 20 17:05:35 [10690] ha-idg- >> 2pengine: warning: custom_action: Action vm_nextcloud_stop_0 >> on ha-idg-1 is unrunnable (offline)" although >> "Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: >> process_lrm_event: Result of stop operation for vm_nextcloud on >> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true >> cib-update=5960" >> says that stop was ok ? Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] why is node fenced ?
- On Jul 31, 2020, at 8:03 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: >>> >>> My guess is that ha-idg-1 was fenced because a failed migration from >> ha-idg-2 >>> is treated like a stop failure on ha-idg-2. Stop failures cause fencing. > You >>> should have tested your resource before going productive. >> >> Migration failed at 16:59:34. >> Node is fenced at 17:05:35. 6 minutes later. >> The cluster needs 6 minutes to decide to fence the node ? >> I don't believe that the failed migration is the cause for the fencing. > > What are the values for migration timeout and for stop timeout? > primitive vm_nextcloud VirtualDomain \ params config="/mnt/share/vm_nextcloud.xml" \ params hypervisor="qemu:///system" \ params migration_transport=ssh \ params migrate_options="--p2p --tunnelled" \ op start interval=0 timeout=120 \ op stop interval=0 timeout=180 \ <== op monitor interval=30 timeout=25 \ op migrate_from interval=0 timeout=300 \ <= op migrate_to interval=0 timeout=300 \ < meta allow-migrate=true target-role=Started is-managed=true maintenance=false \ utilization cpu=1 hv_memory=4096 3 or 5 minutes, not 6 minutes. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] why is node fenced ?
- Am 30. Jul 2020 um 9:28 schrieb Ulrich Windl ulrich.wi...@rz.uni-regensburg.de: >>>> "Lentes, Bernd" schrieb am 29.07.2020 > um > 17:26 in Nachricht > <1894379294.27456141.1596036406000.javamail.zim...@helmholtz-muenchen.de>: >> Hi, >> >> a few days ago one of my nodes was fenced and i don't know why, which is >> something i really don't like. >> What i did: >> I put one node (ha-idg-1) in standby. The resources on it (most of all >> virtual domains) were migrated to ha-idg-2, >> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the >> xml of the domain points to. >> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course >> also failed. >> Then ha-idg-1 was fenced. > > My guess is that ha-idg-1 was fenced because a failed migration from ha-idg-2 > is treated like a stop failure on ha-idg-2. Stop failures cause fencing. You > should have tested your resource before going productive. Migration failed at 16:59:34. Node is fenced at 17:05:35. 6 minutes later. The cluster needs 6 minutes to decide to fence the node ? I don't believe that the failed migration is the cause for the fencing. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- Am 29. Jul 2020 um 17:26 schrieb Bernd Lentes bernd.len...@helmholtz-muenchen.de: Hi, sorry, i missed: OS: SLES 12 SP4 kernel: 4.12.14-95.32 pacmaker: pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] why is node fenced ?
Hi, a few days ago one of my nodes was fenced and i don't know why, which is something i really don't like. What i did: I put one node (ha-idg-1) in standby. The resources on it (most of all virtual domains) were migrated to ha-idg-2, except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the xml of the domain points to. Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course also failed. Then ha-idg-1 was fenced. I did a "crm history" over the respective time period, you find it here: https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF Here, from my point of view, the most interesting from the logs: ha-idg-1: Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op: Diff: --- 2.16196.19 2 Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op: Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758 Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op: + /cib: @epoch=16197, @num_updates=0 Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op: + /cib/configuration/nodes/node[@id='1084777482']/instance_attributes[@id='nodes-108 4777482']/nvpair[@id='nodes-1084777482-standby']: @value=on ha-idg-1 set to standby Jul 20 16:59:34 [23768] ha-idg-1 crmd: notice: process_lrm_event: ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ error: Cannot access storage file '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso': No such file or directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2 failed: 1\n ] migration failed Jul 20 17:04:01 [23767] ha-idg-1pengine:error: native_create_actions: Resource vm_nextcloud is active on 2 nodes (attempting recovery) ??? Jul 20 17:04:01 [23767] ha-idg-1pengine: notice: LogAction:* Recovervm_nextcloud ( ha-idg-2 ) Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 on ha-idg-2 | action 106 Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: te_rsc_command: Initiating stop operation vm_nextcloud_stop_0 locally on ha-idg-1 | action 2 Jul 20 17:04:01 [23768] ha-idg-1 crmd: info: match_graph_event: Action vm_nextcloud_stop_0 (106) confirmed on ha-idg-2 (rc=0) Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: process_lrm_event: Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true cib-update=5960 Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd: notice: crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler) systemctl stop pacemaker.service ha-idg-2: Jul 20 17:04:03 [10691] ha-idg-2 crmd: notice: process_lrm_event: Result of stop operation for vm_nextcloud on ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true cib-update=57 the log from ha-idg-2 is two seconds ahead of ha-idg-1 Jul 20 17:04:08 [10688] ha-idg-2 lrmd: notice: log_execute: executing - rsc:vm_nextcloud action:start call_id:192 Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished: vm_nextcloud_start_0:29107:stderr [ error: Failed to create domain from /mnt/share/vm_nextcloud.xml ] Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished: vm_nextcloud_start_0:29107:stderr [ error: Cannot access storage file '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ] Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished: vm_nextcloud_start_0:29107:stderr [ ocf-exit-reason:Failed to start virtual domain vm_nextcloud. ] Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: log_finished: finished - rsc:vm_nextcloud action:start call_id:192 pid:29107 exit-code:1 exec-time:581ms queue-time:0ms start on ha-idg-2 failed Jul 20 17:05:32 [10691] ha-idg-2 crmd: info: do_dc_takeover: Taking over DC status for this partition ha-idg-1 stopped pacemaker Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: unpack_rsc_op_failure: Processing failed migrate_to of vm_nextcloud on ha-idg-1: unknown error | rc=1 Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: unpack_rsc_op_failure: Processing failed start of vm_nextcloud on ha-idg-2: unknown error | rc Jul 20 17:05:33 [10690] ha-idg-2pengine: info: native_color: Resource vm_nextcloud cannot run anywhere logical Jul 20 17:05:33 [10690] ha-idg-2pengine: warning: custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (pending) ??? Jul 20 17:05:35 [10690] ha-idg-2pengine: warning: custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (offline) Jul 20 17:05:35 [10690] ha-idg-2pengine: warning: pe_fence_node: Cluster node ha-idg-1 will be fenced: resource actions are
[ClusterLabs] pacemaker together with ovirt or Kimchi ?
Hi, i'm having a two node cluster with pacemaker and about 10 virtual domains as resources. It's running fine. I configure/administrate everything with the crm shell. But i'm also looking for a web interface. I'm not much impressed by HAWK. Is it possible to use Kimchi or ovirt together with a pacemaker HA cluster ? Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum münchen bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy stay at home Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?
- On Oct 16, 2019, at 8:27 AM, Digimer li...@alteeve.ca wrote: > On 2019-10-16 2:16 a.m., Ulrich Windl wrote: >>>>> "Lentes, Bernd" schrieb am 15.10.2019 >> um >> 21:35 in Nachricht >> <1922568650.3402980.1571168140600.javamail.zim...@helmholtz-muenchen.de>: >>> Hi, >>> >>> i'm a big fan of simple solutions (KISS). >>> Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker. >>> They all are fundamental prerequisites for my resources (Virtual Domains). >>> To configure them i used clones and groups. >>> Why not having them managed by systemd to make the cluster setup more >>> overseeable ? >>> >>> Is there a strong reason that pacemaker cares about them ? >> >> AFAIK, DLM (others maybe too) need the cluster infrastructure (comminication >> layer) to be operable. >> Also I consider systemd handling resources being worse than pacemaker. >> What is your specific problem? Keeping the cluster configuration simple while >> moving complexity to systemd? >> >> Do you know one command to describe your systemd configuration as short as >> the >> cluster configuration (like crm configuration show)? >> >> Regards, >> Ulrich > > This is correct. DLM uses corosync. OK. I understand. I will stay with pacemaker. Thanks for all answers. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?
Hi, i'm a big fan of simple solutions (KISS). Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker. They all are fundamental prerequisites for my resources (Virtual Domains). To configure them i used clones and groups. Why not having them managed by systemd to make the cluster setup more overseeable ? Is there a strong reason that pacemaker cares about them ? Bernd -- Bernd Lentes Systemadministration Institut für Entwicklungsgenetik Gebäude 35.34 - Raum 208 HelmholtzZentrum münchen bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/idg Perfekt ist wer keine Fehler macht Also sind Tote perfekt Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trace of Filesystem RA does not log
>> -Original Message- >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Lentes, >> Bernd >> Sent: 2019年10月11日 22:32 >> To: Pacemaker ML >> Subject: [ClusterLabs] trace of Filesystem RA does not log >> >> Hi, >> >> occasionally the stop of a Filesystem resource for an OCFS2 Partition fails >> to >> stop. > Which SLE version are you using? > When ocfs2 file system stop fails, that means the umount process is hung? > Could you cat that process stack via /proc/xxx/stack? > Of course, you also can use o2locktop to identify if there is any > active/hanged > dlm lock at that moment. > I'm using SLES 12 SP4. I don't know exactly why umount isn't working or if it hangs, that's why i tried to trace the stop operation to have more infos. I will test o2locktop. What do you mean by "/proc/xxx/stack" ? The stack of which process should i investigate ? umount ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trace of Filesystem RA does not log
- On Oct 14, 2019, at 6:27 AM, Roger Zhou zz...@suse.com wrote: > The stop failure is very bad, and is crucial for HA system. Yes, that's true. > You can try o2locktop cli to find the potential INODE to be blamed[1]. > > `o2locktop --help` gives you more usage details I will try that. > > [1] o2locktop package > https://software.opensuse.org/package/o2locktop?search_term=o2locktop > Thanks. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/