Re: [ClusterLabs] What's the number in "Servant pcmk is outdated (age: 682915)"
Hi Ulrich, On 2022/6/1 7:59, Ulrich Windl wrote: Hi! I'm wondering what the number in parentheses is for these messages: sbd[6809]: warning: inquisitor_child: pcmk health check: UNHEALTHY sbd[6809]: warning: inquisitor_child: Servant pcmk is outdated (age: 682915) As we know, each sbd watcher daemon (servant) is supposed to report "healthy" status to sbd inquisitor daemon timely if the object is fine. In here, the sbd pcmk servant proactively reported "unhealthy" status. The "age" value from the log message in this case with pcmk/cluster servant is indeed confusing though, since internally there's a coding trick for this case to intentionally make any previous "healthy" status directly aged (outdated). The value itself here is basically the tv_sec value from clock_gettime() minus 1, which is not really meaningful for users. Regards, Yan Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed migration causing fencing loop
Hi Ulrich, On 2022/3/31 11:18, Gao,Yan via Users wrote: On 2022/3/31 9:03, Ulrich Windl wrote: Hi! I just wanted to point out one thing that hit us with SLES15 SP3: Some failed live VM migration causing node fencing resulted in a fencing loop, because of two reasons: 1) Pacemaker thinks that even _after_ fencing there is some migration to "clean up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% chance?) trying to stop the VM on the node that just booted after fencing. That's supid but shouldn't be fatal IF there weren't... 2) The stop operation of the VM (that atually isn't running) fails, AFAICT it could not connect to the hypervisor, but the logic in the RA is kind of arguable that the probe (monitor) of the VM returned "not running", but the stop right after that returned failure... OTOH, the point about pacemaker is the stop of the resource on the fenced and rejoined node is not really necessary. There has been discussions about this here and we are trying to figure out a solution for it: https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919 FYI, this issue has been addressed with: https://github.com/ClusterLabs/pacemaker/pull/2705 Regards, Yan For now it requires administrator's intervene if the situation happens: 1) Fix the access to hypervisor before the fenced node rejoins. 2) Manually cleanup the resource, which tells pacemaker it can safely forget the historical migrate_to failure. Regards, Yan causing a node fence. So the loop is complete. Some details (many unrelated messages left out): Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to restore domain 'v15' Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (error: v15: live migration to h16 failed: 1) was recorded for migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Cluster node h18 will be fenced: prm_libvirtd:0 failed there Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (error: v15: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022 Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery) Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: notice: * Restart prm_xen_v15 ( h18 ) Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15 currently has no state, retrying. Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15 currently has no state, retrying. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15 has no state during stop operation, bailing out. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced shutdown (destroy) request for domain v15. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed Mar 30 16:19:07 h19 pacemaker-controld[7351]: notice: Transition 124 action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error' Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when the problem happened. But as we had another boot loop some time ago I wrote a systemd service that counts boots, and if too many happen within a short time, pacemaker will be disabled on that node. As it it set now, the counter is reset if the node is up for at least 15 minutes; if it fails more than 4 times to do so, pacemaker will be disabled. If someone wants to try that or give feedback, drop me a line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)... Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] More pacemaker oddities while stopping DC
On 2022/5/25 8:10, Ulrich Windl wrote: Hi! We are still suffering from kernel RAM corruption on the Xen hypervisor when a VM or the hypervisor is doing I/O (three months since the bug report at SUSE, but no fix or workaround meaning the whole Xen cluster project was canceled after 20 years, but that's a different topic). All VMs will be migrated to VMware, dumping the whole SLES15 Xen cluster very soon. My script that detected RAM corruption tried to shutdown pacemaker, hoping for the best (i.e. VMs to be live-migrated away). However there are very strange decisions made (pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64): May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos7)[24460]: INFO: test-jeos7: live migration to h19 succeeded. May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos9)[24463]: INFO: test-jeos9: live migration to h19 succeeded. May 24 17:05:07 h16 pacemaker-execd[7504]: notice: prm_xen_test-jeos7 migrate_to (call 321, PID 24281) exited with status 0 (execution time 5500ms, queue time 0ms) May 24 17:05:07 h16 pacemaker-controld[7509]: notice: Result of migrate_to operation for prm_xen_test-jeos7 on h16: ok May 24 17:05:07 h16 pacemaker-execd[7504]: notice: prm_xen_test-jeos9 migrate_to (call 323, PID 24283) exited with status 0 (execution time 5514ms, queue time 0ms) May 24 17:05:07 h16 pacemaker-controld[7509]: notice: Result of migrate_to operation for prm_xen_test-jeos9 on h16: ok Would you agree that the migration was successful? I'd say YES! Maybe practically yes with what migrate_to has achieved with VirtualDomain RA, but technically no from pacemaker's point of view. Following the migrate_to on the source node, a migrate_from operation on the target node and a stop operation on the source node will be needed to eventually make a successful live-migration. However this is what happened: May 24 17:05:19 h16 pacemaker-controld[7509]: notice: Transition 2460 (Complete=16, Pending=0, Fired=0, Skipped=7, Incomplete=57, Source=/var/lib/pacemaker/pengine/pe-input-89.bz2): Stopped May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022 May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Unexpected result (error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022 May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Cluster node h16 will be fenced: prm_ping_gw1:1 failed there May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022 May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Unexpected result (error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022 May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_ping_gw1 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_ping_gw1 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_ping_gw1 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_iotw-md10 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_iotw-md10 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Forcing cln_iotw-md10 away from h16 after 100 failures (max=100) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: notice: Resource prm_xen_test-jeos7 can no longer migrate from h16 to h19 (will stop on both nodes) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: notice: Resource prm_xen_test-jeos9 can no longer migrate from h16 to h19 (will stop on both nodes) May 24 17:05:19 h16 pacemaker-schedulerd[7508]: warning: Scheduling Node h16 for STONITH So the DC considers the migration to have failed, even though it was reported as success! A so-called partial live-migration could no longer continue here. Regards, Yan (The ping had dumped core due to RAM corruption before) May 24 17:03:12 h16 kernel: ping[23973]: segfault at 213e6 ip 000213e6 sp 7ffc249fab78 error 14 in bash[5655262bc000+f1000] So it stopped the VMs that were migrated successfully before: May 24 17:05:19 h16 pacemaker-controld[7509]: notice: Initiating stop operation prm_xen_test-jeos7_stop_0 on h19 May 24 17:05:19 h16 pacemaker-controld[7509]: notice: Initiating stop operation prm_xen_test-jeos9_stop_0 on h19 May 24 17:05:19 h16 pacemaker-controld[7509]: notice: Requesting fencing (reboot) of node h16 Those test VMs were not important, but the important part was that due to the failure to stop the ping resource, it did not even try to migrate the other VMs (non-test) away, so those were hard-fenced. For completeness I should add
Re: [ClusterLabs] Antw: Instable SLES15 SP3 kernel
Hi Ulrich, On 2022/4/27 11:13, Ulrich Windl wrote: Update for the Update: I had installed SLES Updates in one VM and rebooted it via cluster. While installing the updates in the VM the Xen host got RAM corruption (it seems any disk I/O on the host, either locally or via a VM image causes RAM corruption): I totally understand your frustrations on this, but I don't really see how much the potential kernel issue is relevant to this mailing list. I believe SUSE support has been working and trying to address it and they will update you once there's further progress. About the topics related to cluster, please find the comments in below. Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip 3a46 sp 7ffd1c92e8e8 error 14 in pacemaker-execd[5565921cc000+b000] Fortunately that wasn't fatal and my rescue script kicked in before things get really bad: Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected, starting pro-active reboot All VMs could be live-migrated away before reboot, but this SLES release is completely unusable! Regards, Ulrich Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D : 161 : 60728>: Hi! I want to give a non-update on the issue: The kernel still segfaults random processes, and there is really nothing from support within two months that could help improve the situation. The cluster is logging all kinds on non-funny messages like these: Apr 27 02:20:49 h18 systemd-coredump[22319]: [] Process 22317 (controld) of user 0 dumped core. Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:246ea08b idx:1 val:3 Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:259b58a0 idx:1 val:7 Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing For a hypervisor host this means that many VMs are reset the hard way! Other resources weren't stopped properly, too, of course. There also two NULL-pointer outputs in messages on the DC: Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries for 118/(null): 0 in progress, 17 completed Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) last kicked at: 1650418762 I guess that NULL pointer should have been the host name (h18) in reality. It's as expected being NULL here. DLM requests fencing through pacemaker's stonith api targeting a node by its corosync nodeid (118 here), which it has the knowledge of, rather than the node name. Pacemaker will do the interpretation and eventually issue the fencing. Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence again (to make sure, maybe), but there is some odd problem: Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing (reboot) of node h18 Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device '(any)' Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action 'reboot' targeting h18 originating from client pacemaker-controld.7453.73d8bbd6 with identical request from stonith-api.39797@h16.ea22f429 (360> This is also as expected when DLM is used. Despite the fencing previously proactively requested by DLM, pacemaker also has its reason to issue a fencing targeting the node. And fenced daemons is aware there's already the pending/on-going fencing targeting the same node, so it doesn't really need to issue it once again. Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1 process (PID 39749) timed out Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1[39749] timed out after 12ms Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot' [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with device 'prm_stonith_sbd' returned: -62 (Timer expired) Please make sure: stonith-timeout > sbd_msgwait + pcmk_delay_max If it was already the case, probably sbd was encountering certain difficulties writing the poison pill at that time ... Regards, Yan I never saw such message before. Evenbtually: Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0) Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK The olny thing I found out was that the kernel running without Xen does not show RAM corruption. Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Failed migration causing fencing loop
On 2022/4/4 8:58, Ulrich Windl wrote: Andrei Borzenkov schrieb am 04.04.2022 um 06:39 in Nachricht : On 31.03.2022 14:02, Ulrich Windl wrote: "Gao,Yan" schrieb am 31.03.2022 um 11:18 in Nachricht <67785c2f‑f875‑cb16‑608b‑77d63d9b0...@suse.com>: On 2022/3/31 9:03, Ulrich Windl wrote: Hi! I just wanted to point out one thing that hit us with SLES15 SP3: Some failed live VM migration causing node fencing resulted in a fencing loop, because of two reasons: 1) Pacemaker thinks that even _after_ fencing there is some migration to "clean up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% chance?) trying to stop the VM on the node that just booted after fencing. That's supid but shouldn't be fatal IF there weren't... 2) The stop operation of the VM (that atually isn't running) fails, AFAICT it could not connect to the hypervisor, but the logic in the RA is kind of arguable that the probe (monitor) of the VM returned "not running", but the stop right after that returned failure... OTOH, the point about pacemaker is the stop of the resource on the fenced and rejoined node is not really necessary. There has been discussions about this here and we are trying to figure out a solution for it: https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919 For now it requires administrator's intervene if the situation happens: 1) Fix the access to hypervisor before the fenced node rejoins. Thanks for the explanation! Unfortunately this can be tricky if libvirtd is involved (as it is here): libvird uses locking (virtlockd), which in turn needs a cluster‑wird filesystem for locks across the nodes. When that filesystem is provided by the cluster, it's hard to delay node joining until filesystem, virtlockd and libvirtd are running. So do not use filesystem provided by the same cluster. Use separate filesystem mounted outside of cluster, like separate high‑available NFS. Hi! Having a second cluster just pto provide VM locking seems a big overkill. Actually I absolutely regret that I ever followed the advice to use libvirt and VIrtualDomain as it seems to have no real benefit for Xen and PVMs. As a matter of fact after more than 10 years using Xen PVMs in a cluster we will move to VMware as SLES15 SP3 is the most unstable SLES ever seen (I started with SLES 8). SUSE support seems unable to either fix the memory corruption, or to provide a kernel that does not have it (it seems SP2 did not have it). Sounds like there's certain kernel issue related to Xen? Probably ask SUSE support to raise the priority of the ticket? Regards, Yan Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Failed migration causing fencing loop
On 2022/3/31 9:03, Ulrich Windl wrote: Hi! I just wanted to point out one thing that hit us with SLES15 SP3: Some failed live VM migration causing node fencing resulted in a fencing loop, because of two reasons: 1) Pacemaker thinks that even _after_ fencing there is some migration to "clean up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% chance?) trying to stop the VM on the node that just booted after fencing. That's supid but shouldn't be fatal IF there weren't... 2) The stop operation of the VM (that atually isn't running) fails, AFAICT it could not connect to the hypervisor, but the logic in the RA is kind of arguable that the probe (monitor) of the VM returned "not running", but the stop right after that returned failure... OTOH, the point about pacemaker is the stop of the resource on the fenced and rejoined node is not really necessary. There has been discussions about this here and we are trying to figure out a solution for it: https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919 For now it requires administrator's intervene if the situation happens: 1) Fix the access to hypervisor before the fenced node rejoins. 2) Manually cleanup the resource, which tells pacemaker it can safely forget the historical migrate_to failure. Regards, Yan causing a node fence. So the loop is complete. Some details (many unrelated messages left out): Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to restore domain 'v15' Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (error: v15: live migration to h16 failed: 1) was recorded for migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022 Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Cluster node h18 will be fenced: prm_libvirtd:0 failed there Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: warning: Unexpected result (error: v15: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022 Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery) Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: notice: * Restart prm_xen_v15 ( h18 ) Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15 currently has no state, retrying. Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15 currently has no state, retrying. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15 has no state during stop operation, bailing out. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced shutdown (destroy) request for domain v15. Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed Mar 30 16:19:07 h19 pacemaker-controld[7351]: notice: Transition 124 action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error' Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when the problem happened. But as we had another boot loop some time ago I wrote a systemd service that counts boots, and if too many happen within a short time, pacemaker will be disabled on that node. As it it set now, the counter is reset if the node is up for at least 15 minutes; if it fails more than 4 times to do so, pacemaker will be disabled. If someone wants to try that or give feedback, drop me a line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)... Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] weird xml snippet in "crm configure show"
Hi, On 2021/2/12 11:05, Lentes, Bernd wrote: Hi, i have problems with a configured alert which does not alert anymore. I played a bit around with it and changed several times the configuration with cibadmin. Sometimes i had trouble with the admin_epoch, sometimes with the scheme. When i invoke now a "crm configure show", at the end i see: ... rsc_defaults rsc-options: \ resource-stickiness=200 xml \ \ \ \ \ \ \ \ \ It seems that crmsh has difficulty parsing the "random" ids of the attribute sets here. I guess `crm configure edit` the part to be something like: alert smtp_alert "/root/skripte/alert_smtp.sh" \ attributes email_sender="bernd.len...@helmholtz-muenchen.de" \ to "informatic@helmholtz-muenchen.de" meta timestamp-format="%D %H:%M" will do. Regards, Yan Is that normal ? Bernd ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Q: List resources affected by utilization limits
On 1/13/21 9:14 AM, Ulrich Windl wrote: Hi! I had made a test: I had configured RAM requirements for some test VMs together with node RAM capacities. Things were running fine. Then as a test I reduced the RAM capacity of all nodes, and test VMs were stopped due to not enough RAM. Now I wonder: is there a command that can list those resources that couldn't start because of "not enough nod capacity"? Preferrably combined with the utilization attribute that could not be fulfilled? crm_simulate -LU should give some hints. Regards, Yan Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] "crm verify": ".. stonith-watchdog-timeout is nonzero"
On 11/26/20 8:31 AM, Ulrich Windl wrote: Hi! Using SBD, I got this message from crm's top-level "verify": crm(live/h16)# verify Current cluster status: Online: [ h16 h18 h19 ] prm_stonith_sbd(stonith:external/sbd): Started h18 (unpack_config) notice: Watchdog will be used via SBD if fencing is required and stonith-watchdog-timeout is nonzero The message simply tells us "what if stonith-watchdog-timeout is nonzero". The message is kind of confusing to constantly appear. It has been dropped to info level and relevant documentation has been improved as of: https://github.com/ClusterLabs/pacemaker/pull/2142 Regards, Yan Interestingly this message does not change even after this: crm(live/h16)configure# property stonith-watchdog-timeout=0 crm(live/h16)configure# verify crm(live/h16)configure# commit So what's going on? Most notably what's the difference between configure's verify and the top-level verify? I have this: in Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Coming in Pacemaker 2.0.4: fencing delay based on what resources are where
On 2020/3/23 14:04, Gao,Yan wrote: On 2020/3/23 8:00, Ulrich Windl wrote: Andrei Borzenkov schrieb am 21.03.2020 um 18:22 in Nachricht <14318_1584811393_5E764D80_14318_174_1_6ab730d7-8cf0-2c7d-7ae5-8d0ea8402758@gmai .com>: 21.03.2020 20:07, Ken Gaillot пишет: Hi all, I am happy to announce a feature that was discussed on this list a while back. It will be in Pacemaker 2.0.4 (the first release candidate is expected in about three weeks). A longstanding concern in two-node clusters is that in a split brain, one side must get a fencing delay to avoid simultaneous fencing of both nodes, but there is no perfect way to determine which node gets the delay. The most common approach is to configure a static delay on one node. This is particularly useful in an active/passive setup where one particular node is normally assigned the active role. Another approach is to use the relatively new fence_heuristics_ping agent in a topology with your real fencing agent. A node that can ping a configured IP will be more likely to survive. In addition, we now have a new cluster-wide property, priority-fencing- delay, that bases the delay on what resources were known to be active where just before the split. If you set the new property, and configure priorities for your resources, the node with the highest combined priority of all resources running on it will be more likely to survive. As an example, if you set a default priority of 1 for all resources, and set priority-fencing-delay to 15s, then the node running the most resources will be more likely to survive because the other node will wait 15 seconds before initiating fencing. If a particular resource is more important than the rest, you can give it a higher priority. That sounds good except one consideration. "priority" also affects resource placement, and changing it may have rather unexpected results, especially in cases when scores are carefully selected to achieve resource distribution. I've always seen pririties as "super odering" constraints: Try to run the important resources first (what ever their dependencies or scores are). The fact about priority is, on "calculation", what resources scheduler should "consider" first, so that in cases where there are conflicting colocation/anti-colocation constraints, I mean conflicting situations in regard of colocation/anti-colocation constraints. Regards, Yan lack of utilization capacity, the resources with higher priority will get "decided" first. So does it affect the order of the resources listed from the output of crm_mon? Yes. But it doesn't reflect in what order cluster transitions actually start the resources. That's what order constraints are for. Regards, Yan [...] Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Coming in Pacemaker 2.0.4: fencing delay based on what resources are where
On 2020/3/23 8:00, Ulrich Windl wrote: Andrei Borzenkov schrieb am 21.03.2020 um 18:22 in Nachricht <14318_1584811393_5E764D80_14318_174_1_6ab730d7-8cf0-2c7d-7ae5-8d0ea8402758@gmai .com>: 21.03.2020 20:07, Ken Gaillot пишет: Hi all, I am happy to announce a feature that was discussed on this list a while back. It will be in Pacemaker 2.0.4 (the first release candidate is expected in about three weeks). A longstanding concern in two-node clusters is that in a split brain, one side must get a fencing delay to avoid simultaneous fencing of both nodes, but there is no perfect way to determine which node gets the delay. The most common approach is to configure a static delay on one node. This is particularly useful in an active/passive setup where one particular node is normally assigned the active role. Another approach is to use the relatively new fence_heuristics_ping agent in a topology with your real fencing agent. A node that can ping a configured IP will be more likely to survive. In addition, we now have a new cluster-wide property, priority-fencing- delay, that bases the delay on what resources were known to be active where just before the split. If you set the new property, and configure priorities for your resources, the node with the highest combined priority of all resources running on it will be more likely to survive. As an example, if you set a default priority of 1 for all resources, and set priority-fencing-delay to 15s, then the node running the most resources will be more likely to survive because the other node will wait 15 seconds before initiating fencing. If a particular resource is more important than the rest, you can give it a higher priority. That sounds good except one consideration. "priority" also affects resource placement, and changing it may have rather unexpected results, especially in cases when scores are carefully selected to achieve resource distribution. I've always seen pririties as "super odering" constraints: Try to run the important resources first (what ever their dependencies or scores are). The fact about priority is, on "calculation", what resources scheduler should "consider" first, so that in cases where there are conflicting colocation/anti-colocation constraints, lack of utilization capacity, the resources with higher priority will get "decided" first. So does it affect the order of the resources listed from the output of crm_mon? Yes. But it doesn't reflect in what order cluster transitions actually start the resources. That's what order constraints are for. Regards, Yan [...] Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Coming in Pacemaker 2.0.4: fencing delay based on what resources are where
On 2020/3/23 7:57, Ulrich Windl wrote: Ken Gaillot schrieb am 21.03.2020 um 18:07 in Nachricht <15250_1584810570_5E764A4A_15250_638_1_c8c5a180d8ad9327dfd1e743d4352556f3fd. a...@redhat.com>: Hi all, I am happy to announce a feature that was discussed on this list a while back. It will be in Pacemaker 2.0.4 (the first release candidate is expected in about three weeks). A longstanding concern in two-node clusters is that in a split brain, one side must get a fencing delay to avoid simultaneous fencing of both nodes, but there is no perfect way to determine which node gets the delay. The most common approach is to configure a static delay on one node. This is particularly useful in an active/passive setup where one particular node is normally assigned the active role. Actually with sbd there could be a more simplitic approach: Allocate a a pseudo-mode named "DC" or "locker" and then use a SCSI lock mechanism to update that slot atomically. Only the node that "has the lock" may issue fence commands. Once the fencing is confirmed, the locker slot is released (wiped)... It doesn't sound as simple as directly introducing a delay. What if the lock holder itself somehow runs into issue or dies after the fencing is issued but before it's confirmed? So the other node would have to somehow gain the lock after,well, a "delay" anyway? Another approach is to use the relatively new fence_heuristics_ping agent in a topology with your real fencing agent. A node that can ping a configured IP will be more likely to survive. In addition, we now have a new cluster-wide property, priority-fencing- delay, that bases the delay on what resources were known to be active where just before the split. If you set the new property, and configure priorities for your resources, the node with the highest combined priority of all resources running on it will be more likely to survive. Or combined with a ping-like mechanism: Each node periodically sends an "I'm alive" message that updates the node's timestamp in CIB status. The node that was alive last will survive. If it doesn't react within fencing timeout, the second-newest (in case of two nodes: the other) node may fence and try to form a cluster. Why would such an outdated node state matter more than what corosync tells us? And the point here is not pick "A" node. The point is pick the more "significant" node which is potentially hosting the more significant resources/instances to help it win inevitable fencing match in case it's split-brain. Regards, Yan As an example, if you set a default priority of 1 for all resources, and set priority-fencing-delay to 15s, then the node running the most resources will be more likely to survive because the other node will wait 15 seconds before initiating fencing. If a particular resource is more important than the rest, you can give it a higher priority. The master role of promotable clones will get an extra 1 point, if a priority has been configured for that clone. If both nodes have equal priority, or fencing is needed for some reason other than node loss (e.g. on-fail=fencing for some monitor), then the usual delay properties apply (pcmk_delay_base, etc.). I'd like to recognize the primary authors of the 2.0.4 features announced so far: - shutdown locks: myself - switch to clock_gettime() for monotonic clock: Jan Pokorný - crm_mon --include/--exclude: Chris Lumens - priority-fencing-delay: Gao,Yan -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Coming in Pacemaker 2.0.4: fencing delay based on what resources are where
On 2020/3/21 18:22, Andrei Borzenkov wrote: 21.03.2020 20:07, Ken Gaillot пишет: Hi all, I am happy to announce a feature that was discussed on this list a while back. It will be in Pacemaker 2.0.4 (the first release candidate is expected in about three weeks). A longstanding concern in two-node clusters is that in a split brain, one side must get a fencing delay to avoid simultaneous fencing of both nodes, but there is no perfect way to determine which node gets the delay. The most common approach is to configure a static delay on one node. This is particularly useful in an active/passive setup where one particular node is normally assigned the active role. Another approach is to use the relatively new fence_heuristics_ping agent in a topology with your real fencing agent. A node that can ping a configured IP will be more likely to survive. In addition, we now have a new cluster-wide property, priority-fencing- delay, that bases the delay on what resources were known to be active where just before the split. If you set the new property, and configure priorities for your resources, the node with the highest combined priority of all resources running on it will be more likely to survive. As an example, if you set a default priority of 1 for all resources, and set priority-fencing-delay to 15s, then the node running the most resources will be more likely to survive because the other node will wait 15 seconds before initiating fencing. If a particular resource is more important than the rest, you can give it a higher priority. That sounds good except one consideration. "priority" also affects resource placement, and changing it may have rather unexpected results, especially in cases when scores are carefully selected to achieve resource distribution. Despite the fact that resource locations and placement-strategy are more for that purpose, it's true that resource priority now could imply more things, which users might want to think through for either new deployments or existing ones. Thanks for the thorough introduction, Ken. Regards, Yan The master role of promotable clones will get an extra 1 point, if a priority has been configured for that clone. If both nodes have equal priority, or fencing is needed for some reason other than node loss (e.g. on-fail=fencing for some monitor), then the usual delay properties apply (pcmk_delay_base, etc.). I'd like to recognize the primary authors of the 2.0.4 features announced so far: - shutdown locks: myself - switch to clock_gettime() for monotonic clock: Jan Pokorný - crm_mon --include/--exclude: Chris Lumens - priority-fencing-delay: Gao,Yan ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
On 2/11/19 9:49 AM, Fulong Wang wrote: Thanks Yan, You gave me more valuable hints on the SBD operation! Now, i can see the verbose output after service restart. Be aware since pacemaker integration (-P) is enabled by default, which means despite the sbd failure, if the node itself was clean and "healthy" from pacemaker's point of view and if it's in the cluster partition with the quorum, it wouldn't self-fence -- meaning a node just being unable to fence doesn't necessarily need to be fenced. As described in sbd man page, "this allows sbd to survive temporary outages of the majority of devices. However, while the cluster is in such a degraded state, it can neither successfully fence nor be shutdown cleanly (as taking the cluster below the quorum threshold will immediately cause all remaining nodes to self-fence). In short, it will not tolerate any further faults. Please repair the system before continuing." Yes, I can see the "pacemaker integration" was enabled in my sbd config file by default. So, you mean in some sbd failure cases, if the node was considered as "healthy" from pacemaker's poinit of view, it still wouldn't sel-fence. Honestly speaking, i didn't get you at this point. I have "no-quorum-policy=ignore" setting in my setup and it's a two node cluster. Not directly related to the behaviors of sbd, starting from corosync-2, with properly configured "quorum" service in corosync.conf, no-quorum-policy=ignore in pacemaker should be avoided, meaning pacemaker should follow the decisions on quorum made by corosync: https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_config_basics_global.html#sec_ha_config_basics_corosync_2-node Can you show me a sample situation for this? For example if a node loses access to the sbd device, but every node is still "clean" online, meaning there's no need to fence anyone at the point. The node will continue functioning under such a degraded state. But of course administrator needs to fix the sbd issue as soon as possible. Be aware that 2-node cluster is such a common but special use case. If we lose one node meanwhile also lose the access to sbd, the single online node will self-fence even if corosync's votequorum service considers it as being "quorate". This is the safest approach for good in case it's split-brain. This already works correctly with the fix in regard of 2-node cluster from Klaus. Regards, Yan Many Thanks!!! Reagards Fulong *From:* Gao,Yan *Sent:* Thursday, January 3, 2019 20:43 *To:* Fulong Wang; Cluster Labs - All topics related to open-source clustering welcomed *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue On 12/24/18 7:10 AM, Fulong Wang wrote: Yan, klaus and Everyone, Merry Christmas!!! Many thanks for your advice! I added the "-v" param in "SBD_OPTS", but didn't see any apparent change in the system message log, am i looking at a wrong place? Did you restart all cluster services, for example by "crm cluster stop" and then "crm cluster start"? Basically sbd.service needs to be restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker. SBD daemons log into syslog. When a sbd watcher receives a "test" command, there should be a syslog like this showing up: "servant: Received command test from ..." sbd won't actually do anything about a "test" command but logging a message. If you are not running a late version of sbd (maintenance update) yet, a single "-v" will make sbd too verbose already. But of course you could use grep. By the way, we want to test when the disk access paths (multipath devices) lost, the sbd can fence the node automatically. Be aware since pacemaker integration (-P) is enabled by default, which means despite the sbd failure, if the node itself was clean and "healthy" from pacemaker's point of view and if it's in the cluster partition with the quorum, it wouldn't self-fence -- meaning a node just being unable to fence doesn't necessarily need to be fenced. As described in sbd man page, "this allows sbd to survive temporary outages of the majority of devices. However, while the cluster is in such a degraded state, it can neither successfully fence nor be shutdown cleanly (as taking the cluster below the quorum threshold will immediately cause all remaining nodes to self-fence). In short, it will not tolerate any further faults. Please repair the system before continuing." Regards, Yan what's your recommendation for this scenario? The "crm node fence" did the work. Regards Fulong ---
Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
On 2/12/19 3:38 AM, Fulong Wang wrote: Klaus, Thanks for the infor! Did you mean i should compile sbd from github source to include the fixs you mentioned by myself? The corosync, pacemaker and sbd version in my setup is as below: corosync: 2.3.6-9.13.1 pacemaker: 1.1.16-6.5.1 sbd: 1.3.1+20180507 I'm pretty sure this version has the fix in regard of 2-node cluster from Klaus. Regards, Yan Regards Fulong *From:* Klaus Wenninger *Sent:* Monday, February 11, 2019 18:51 *To:* Cluster Labs - All topics related to open-source clustering welcomed; Fulong Wang; Gao,Yan *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue On 02/11/2019 09:49 AM, Fulong Wang wrote: Thanks Yan, You gave me more valuable hints on the SBD operation! Now, i can see the verbose output after service restart. >Be aware since pacemaker integration (-P) is enabled by default, which >means despite the sbd failure, if the node itself was clean and >"healthy" from pacemaker's point of view and if it's in the cluster >partition with the quorum, it wouldn't self-fence -- meaning a node just >being unable to fence doesn't necessarily need to be fenced. >As described in sbd man page, "this allows sbd to survive temporary >outages of the majority of devices. However, while the cluster is in >such a degraded state, it can neither successfully fence nor be shutdown >cleanly (as taking the cluster below the quorum threshold will >immediately cause all remaining nodes to self-fence). In short, it will >not tolerate any further faults. Please repair the system before >continuing." Yes, I can see the "pacemaker integration" was enabled in my sbd config file by default. So, you mean in some sbd failure cases, if the node was considered as "healthy" from pacemaker's poinit of view, it still wouldn't sel-fence. Honestly speaking, i didn't get you at this point. I have "no-quorum-policy=ignore" setting in my setup and it's a two node cluster. Can you show me a sample situation for this? When using sbd with 2-node-clusters and pacemaker-integration you might check https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377 to be included in your sbd-version. This is relevant when 2-node is configured in corosync. Regards, Klaus Many Thanks!!! Reagards Fulong *From:* Gao,Yan <mailto:y...@suse.com> *Sent:* Thursday, January 3, 2019 20:43 *To:* Fulong Wang; Cluster Labs - All topics related to open-source clustering welcomed *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue On 12/24/18 7:10 AM, Fulong Wang wrote: > Yan, klaus and Everyone, > > > Merry Christmas!!! > > > > Many thanks for your advice! > I added the "-v" param in "SBD_OPTS", but didn't see any apparent change > in the system message log, am i looking at a wrong place? Did you restart all cluster services, for example by "crm cluster stop" and then "crm cluster start"? Basically sbd.service needs to be restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker. SBD daemons log into syslog. When a sbd watcher receives a "test" command, there should be a syslog like this showing up: "servant: Received command test from ..." sbd won't actually do anything about a "test" command but logging a message. If you are not running a late version of sbd (maintenance update) yet, a single "-v" will make sbd too verbose already. But of course you could use grep. > > By the way, we want to test when the disk access paths (multipath > devices) lost, the sbd can fence the node automatically. Be aware since pacemaker integration (-P) is enabled by default, which means despite the sbd failure, if the node itself was clean and "healthy" from pacemaker's point of view and if it's in the cluster partition with the quorum, it wouldn't self-fence -- meaning a node just being unable to fence doesn't necessarily need to be fenced. As described in sbd man page, "this allows sbd to survive temporary outages of the majority of devices. However, while the cluster is in such a degraded state, it can neither successfully fence nor be shutdown cleanly (as taking the cluster below the quorum threshold will immediately cause all remaining nodes to self-fence). In short, it will not tolerate any further faults. Please repair the system before continuing." Regards, Yan > what's your recommendation for this scenario? > > > > > > > > The "crm node fence" did the work. &g
Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
On 12/22/18 5:27 AM, Andrei Borzenkov wrote: 21.12.2018 12:09, Klaus Wenninger пишет: On 12/21/2018 08:15 AM, Fulong Wang wrote: Hello Experts, I'm New to this mail lists. Pls kindlyforgive me if this mail has disturb you! Our Company recently is evaluating the usage of the SuSE HAE on x86 platform. Wen simulating the storage disaster fail-over, i finally found that the SBD communication functioned normal on SuSE11 SP4 but abnormal on SuSE12 SP3. I have no experience with SBD on SLES but I know that handling of the logging verbosity-levels has changed recently in the upstream-repo. Given that it was done by Yan Gao iirc I'd assume it went into SLES. So changing the verbosity of the sbd-daemon might get you back these logs. Do you mean commit 2dbdee29736fcbf0fe1d41c306959b22d05f72b0 Author: Gao,Yan Date: Mon Apr 30 18:02:04 2018 +0200 Log: upgrade important messages and downgrade unimportant ones ?? This commit actually increased severity for message on target node: @@ -1180,7 +1180,7 @@ int servant(const char *diskname, int mode, const void* argp) } if (s_mbox->cmd > 0) { - cl_log(LOG_INFO, + cl_log(LOG_NOTICE, "Received command %s from %s on disk %s", char2cmd(s_mbox->cmd), s_mbox->from, diskname); and did not change severity for messages on source node (they are still INFO). True. Not sure if any of them should belong to notice if everything works well... sbd commands that send messages can be supplied with -v as well of course. Regards, Yan And of course you can use the list command on the other node to verify as well. Klaus The SBD device was added during the initialization of the first cluster node. I have requested help from SuSE guys, but they didn't give me any valuable feedback yet now! Below are some screenshots to explain what i have encountered. ~~~ on a SuSE11 SP4 HAE cluster, i run the sbd test command as below: then there will be some information showed up in the local system message log on the second node, we can found that the communication is normal by but when i turn to a SuSE12 SP3 HAE cluster, ran the same command as above: I didn't get any response in the system message log. "systemctl status sbd" also doesn't give me any clue on this. ~~ What could be the reason for this abnormal behavior? Is there any problems with my setup? Any suggestions are appreciate! Thanks! Regards FuLong ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
On 12/24/18 7:10 AM, Fulong Wang wrote: Yan, klaus and Everyone, Merry Christmas!!! Many thanks for your advice! I added the "-v" param in "SBD_OPTS", but didn't see any apparent change in the system message log, am i looking at a wrong place? Did you restart all cluster services, for example by "crm cluster stop" and then "crm cluster start"? Basically sbd.service needs to be restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker. SBD daemons log into syslog. When a sbd watcher receives a "test" command, there should be a syslog like this showing up: "servant: Received command test from ..." sbd won't actually do anything about a "test" command but logging a message. If you are not running a late version of sbd (maintenance update) yet, a single "-v" will make sbd too verbose already. But of course you could use grep. By the way, we want to test when the disk access paths (multipath devices) lost, the sbd can fence the node automatically. Be aware since pacemaker integration (-P) is enabled by default, which means despite the sbd failure, if the node itself was clean and "healthy" from pacemaker's point of view and if it's in the cluster partition with the quorum, it wouldn't self-fence -- meaning a node just being unable to fence doesn't necessarily need to be fenced. As described in sbd man page, "this allows sbd to survive temporary outages of the majority of devices. However, while the cluster is in such a degraded state, it can neither successfully fence nor be shutdown cleanly (as taking the cluster below the quorum threshold will immediately cause all remaining nodes to self-fence). In short, it will not tolerate any further faults. Please repair the system before continuing." Regards, Yan what's your recommendation for this scenario? The "crm node fence" did the work. Regards Fulong *From:* Gao,Yan *Sent:* Friday, December 21, 2018 20:43 *To:* kwenn...@redhat.com; Cluster Labs - All topics related to open-source clustering welcomed; Fulong Wang *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue First thanks for your reply, Klaus! On 2018/12/21 10:09, Klaus Wenninger wrote: On 12/21/2018 08:15 AM, Fulong Wang wrote: Hello Experts, I'm New to this mail lists. Pls kindlyforgive me if this mail has disturb you! Our Company recently is evaluating the usage of the SuSE HAE on x86 platform. Wen simulating the storage disaster fail-over, i finally found that the SBD communication functioned normal on SuSE11 SP4 but abnormal on SuSE12 SP3. I have no experience with SBD on SLES but I know that handling of the logging verbosity-levels has changed recently in the upstream-repo. Given that it was done by Yan Gao iirc I'd assume it went into SLES. So changing the verbosity of the sbd-daemon might get you back these logs. Yes, I think it's the issue. Could you please retrieve the latest maintenance update for SLE12SP3 and try? Otherwise of course you could temporarily enable verbose/debug logging by adding a couple of "-v" into "SBD_OPTS" in /etc/sysconfig/sbd. But frankly, it makes more sense to manually trigger fencing for example by "crm node fence" and see if it indeed works correctly. And of course you can use the list command on the other node to verify as well. The "test" message in the slot might get overwritten soon by a "clear" if the sbd daemon is running. Regards, Yan Klaus The SBD device was added during the initialization of the first cluster node. I have requested help from SuSE guys, but they didn't give me any valuable feedback yet now! Below are some screenshots to explain what i have encountered. ~~~ on a SuSE11 SP4 HAE cluster, i run the sbd test command as below: then there will be some information showed up in the local system message log on the second node, we can found that the communication is normal by but when i turn to a SuSE12 SP3 HAE cluster, ran the same command as above: I didn't get any response in the system message log. "systemctl status sbd" also doesn't give me any clue on this. ~~ What could be the reason for this abnormal behavior? Is there any problems with my setup? Any suggestions are appreciate! Thanks! Regards FuLong ___ Users mailing list:Users@clusterlabs.org https://lists.clust
Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
First thanks for your reply, Klaus! On 2018/12/21 10:09, Klaus Wenninger wrote: On 12/21/2018 08:15 AM, Fulong Wang wrote: Hello Experts, I'm New to this mail lists. Pls kindlyforgive me if this mail has disturb you! Our Company recently is evaluating the usage of the SuSE HAE on x86 platform. Wen simulating the storage disaster fail-over, i finally found that the SBD communication functioned normal on SuSE11 SP4 but abnormal on SuSE12 SP3. I have no experience with SBD on SLES but I know that handling of the logging verbosity-levels has changed recently in the upstream-repo. Given that it was done by Yan Gao iirc I'd assume it went into SLES. So changing the verbosity of the sbd-daemon might get you back these logs. Yes, I think it's the issue. Could you please retrieve the latest maintenance update for SLE12SP3 and try? Otherwise of course you could temporarily enable verbose/debug logging by adding a couple of "-v" into "SBD_OPTS" in /etc/sysconfig/sbd. But frankly, it makes more sense to manually trigger fencing for example by "crm node fence" and see if it indeed works correctly. And of course you can use the list command on the other node to verify as well. The "test" message in the slot might get overwritten soon by a "clear" if the sbd daemon is running. Regards, Yan Klaus The SBD device was added during the initialization of the first cluster node. I have requested help from SuSE guys, but they didn't give me any valuable feedback yet now! Below are some screenshots to explain what i have encountered. ~~~ on a SuSE11 SP4 HAE cluster, i run the sbd test command as below: then there will be some information showed up in the local system message log on the second node, we can found that the communication is normal by but when i turn to a SuSE12 SP3 HAE cluster, ran the same command as above: I didn't get any response in the system message log. "systemctl status sbd" also doesn't give me any clue on this. ~~ What could be the reason for this abnormal behavior? Is there any problems with my setup? Any suggestions are appreciate! Thanks! Regards FuLong ___ Users mailing list:Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home:http://www.clusterlabs.org Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Wrong sbd.service dependencies
On 2017/12/16 16:59, Andrei Borzenkov wrote: 04.12.2017 21:55, Andrei Borzenkov пишет: ... I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... That's sbd bug. It declares itself as RequiredBy=corosync.service but puts itself Before=pacemaker.service. Due to systemd design, service A *MUST* have Before dependency on service B if failure to start A should cause failure to start B. *Or* use BindsTo ... but that sounds wrong because it would cause B to start briefly and then be killed. So the question is what is intended here. Should sbd.service be prerequisite for corosync or pacemaker? It should be so only if it's enabled. Try this: https://github.com/ClusterLabs/sbd/pull/39 Thanks to Klaus, btw. Regards, Yan Should failure to start SBD be fatal for startup of dependent service? Finally does sbd need explicit dependency on pacemaker.service at all (in addition to corosync.service)? Adding Before dependency fixes startup logic for me. ha1:~ # systemctl start pacemaker.service A dependency job for pacemaker.service failed. See 'journalctl -xe' for details. ha1:~ # systemctl -l --no-pager status pacemaker.service ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/etc/systemd/system/pacemaker.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:pacemakerd http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. дек 16 18:56:06 ha1 systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. ha1:~ # systemctl -l --no-pager status corosync.service ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; static; vendor preset: disabled) Active: inactive (dead) Docs: man:corosync man:corosync.conf man:corosync_overview дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Corosync Cluster Engine. дек 16 18:56:06 ha1 systemd[1]: corosync.service: Job corosync.service/start failed with result 'dependency'. ha1:~ # systemctl -l --no-pager status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/sbd.service.d └─before-corosync.conf Active: failed (Result: timeout) since Sat 2017-12-16 18:56:06 MSK; 50s ago Process: 3675 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signal=TERM) дек 16 18:54:36 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 16 18:56:06 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 16 18:56:06 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 16 18:56:06 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 16 18:56:06 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. ha1:~ # cat /etc/systemd/system/sbd.service.d/before-corosync.conf [Unit] Before=corosync.service ha1:~ # ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterl
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 03:11 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 05.12.2017 um 15:04 in Nachricht : On 12/05/2017 12:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 01.12.2017 um 20:36 in Nachricht : [...] I meant: There are three delays: 1) The delay until data is on the disk It takes several IOs for the sender to do this -- read the device header, lookup the slot, write the message and verify the message is written (-- A timeout_io defaults to 3s). As mentioned, msgwait timer of the sender starts only after message has been verified to be written. We just need to make sure stonith-timeout is configured longer enough than the sum. 2) Delay until date is read from the disk It's already taken into account with msgwait. Considering the recipient keeps reading in a loop, we don't know when exactly it starts to read for this specific message. But once it starts a reading, it has to be done within timeout_watchdog, otherwise watchdog triggers. So even for a bad case, the message should be read within 2* timemout_watchdog. That's the reason why the sender has to wait msgwait, which is 2 * timeout_watchdog. 3) Delay until Host was killed Kill is basically immediately triggered once poison pill is read. Considering that the response time of a SAN disk system with cache is typically a very few microseconds, writing to disk may be even "more immediate" than killing the node via watchdog reset ;-) Well, it's possible :) Timeout matters for "bad cases" though. Compared with a disk io facing difficulties like path failure and so on, triggering watchdog is trivial. So you can't easily say one is immediate, while the other has to be waited for IMHO. Of course a even longer msgwait with all the factors that you can think of taken into account will be even safer. Regards, Yan Regards, Ulrich A confirmation before 3) could shorten the total wait that includes 2) and 3), right? As mentioned in another email, an alive node, even indeed coming back from death, cannot actually confirm itself or even give a confirmation about if it was ever dead. And a successful fencing means the node being dead. Regards, Yan Regards, Ulrich Regards, Yan [...] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 12:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 01.12.2017 um 20:36 in Nachricht : On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: 30.11.2017 16:11, Klaus Wenninger пишет: On 11/30/2017 01:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot time (I guess), this just shows a fundamental problem in SBD design: Receipt of the fencing command is not confirmed (other than by seeing the consequences of ist execution). The 2 x msgwait is not for confirmations but for writing the poison-pill and for having it read by the target-side. Yes, of course, but that's not what Urlich likely intended to say. msgwait must account for worst case storage path latency, while in normal cases it happens much faster. If fenced node could acknowledge having been killed after reboot, stonith agent could return success much earlier. How could an alive man be sure he died before? ;) I meant: There are three delays: 1) The delay until data is on the disk It takes several IOs for the sender to do this -- read the device header, lookup the slot, write the message and verify the message is written (-- A timeout_io defaults to 3s). As mentioned, msgwait timer of the sender starts only after message has been verified to be written. We just need to make sure stonith-timeout is configured longer enough than the sum. 2) Delay until date is read from the disk It's already taken into account with msgwait. Considering the recipient keeps reading in a loop, we don't know when exactly it starts to read for this specific message. But once it starts a reading, it has to be done within timeout_watchdog, otherwise watchdog triggers. So even for a bad case, the message should be read within 2* timemout_watchdog. That's the reason why the sender has to wait msgwait, which is 2 * timeout_watchdog. 3) Delay until Host was killed Kill is basically immediately triggered once poison pill is read. A confirmation before 3) could shorten the total wait that includes 2) and 3), right? As mentioned in another email, an alive node, even indeed coming back from death, cannot actually confirm itself or even give a confirmation about if it was ever dead. And a successful fencing means the node being dead. Regards, Yan Regards, Ulrich Regards, Yan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote: On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote: 04.12.2017 14:48, Gao,Yan пишет: On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... That's something I always wondered about: if a node is capable of reading a poison pill then it could before shutdown also write an "I'm leaving" message into its slot. Wouldn't that make sbd more reliable? Any reason not to implement that? Probably it's not considered necessary :) SBD is a fencing mechanism which only needs to ensure fencing works. SBD on the fencing target is either there eating the pill or getting reset by watchdog, otherwise it's not there which is supposed to imply the whole cluster stack is not running so that it doesn't need to actually eat the pill. How systemd should handle the service dependencies is another topic... Regards, Yan Thanks, Dejan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 12/04/2017 07:55 PM, Andrei Borzenkov wrote: 04.12.2017 14:48, Gao,Yan пишет: On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... Start of sbd reaches systemd's timeout for starting units and systemd proceeds... TimeoutStartSec should be configured in sbd.service accordingly to be longer than msgwait. Regards, Yan ha1:~ # systemctl show sbd.service -p RequiredBy RequiredBy=corosync.service but ha1:~ # systemctl status corosync.service ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS) Process: 2059 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS) Main PID: 2073 (corosync) Tasks: 2 (limit: 4915) CGroup: /system.slice/corosync.service └─2073 corosync and ha1:~ # crm_mon -1r Stack: corosync Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum Last updated: Mon Dec 4 21:53:24 2017 Last change: Mon Dec 4 21:47:25 2017 by hacluster via crmd on ha1 2 nodes configured 1 resource configured Online: [ ha1 ha2 ] Full list of resources: stonith-sbd (stonith:external/sbd): Started ha1 and if I now sever connection between two nodes I will get two single node clusters each believing it won ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 12/02/2017 08:30 AM, Andrei Borzenkov wrote: 01.12.2017 22:36, Gao,Yan пишет: On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: 30.11.2017 16:11, Klaus Wenninger пишет: On 11/30/2017 01:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot time (I guess), this just shows a fundamental problem in SBD design: Receipt of the fencing command is not confirmed (other than by seeing the consequences of ist execution). The 2 x msgwait is not for confirmations but for writing the poison-pill and for having it read by the target-side. Yes, of course, but that's not what Urlich likely intended to say. msgwait must account for worst case storage path latency, while in normal cases it happens much faster. If fenced node could acknowledge having been killed after reboot, stonith agent could return success much earlier. How could an alive man be sure he died before? ;) It does not need to. It simply needs to write something on startup to indicate it is back. It does that. The thing is the sender cannot just assume that the target is ever gone based on that. And it doesn't make sense that a fencing returns success when the target appears to be alive. If the sender kept watching the slot, probably it'd make more sense to let fencing return failure and try it again. Regards, Yan Actually, fenced side already does it - it clears pending message when sbd is started. It is fencing side that simply unconditionally sleeps for msgwait: if (mbox_write_verify(st, mbox, s_mbox) < -1) { rc = -1; goto out; } if (strcasecmp(cmd, "exit") != 0) { cl_log(LOG_INFO, "Messaging delay: %d", (int)timeout_msgwait); sleep(timeout_msgwait); } What if we do not sleep but rather periodically check slot for acknowledgement for msgwait timeout? Then we could return earlier. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Regards, Yan First, at startup no slot is allocated for a node at all (confirmed with "sbd list"). I manually allocated slots for both nodes, then I see that stonith agent does post "reboot" message (confirmed with "sbd list" again) and sbd never reacts to it. Even after system reboot message on disk is not cleared. Removing SBD_DELAY_START and restarting pacemaker (with implicit SBD restart) immediately cleared pending messages. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: 30.11.2017 16:11, Klaus Wenninger пишет: On 11/30/2017 01:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot time (I guess), this just shows a fundamental problem in SBD design: Receipt of the fencing command is not confirmed (other than by seeing the consequences of ist execution). The 2 x msgwait is not for confirmations but for writing the poison-pill and for having it read by the target-side. Yes, of course, but that's not what Urlich likely intended to say. msgwait must account for worst case storage path latency, while in normal cases it happens much faster. If fenced node could acknowledge having been killed after reboot, stonith agent could return success much earlier. How could an alive man be sure he died before? ;) Regards, Yan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 11/30/2017 01:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot time (I guess), The msgwait timer on the sender starts only after a successful writing. The recipient will either eat the pill or get killed by watchdog within watchdog timeout. As mentioned in sbd man, msgwait should be twice the watchdog timeout. So that the sender can safely assume the target is dead when the msgwait timer is popped. Regards, Yan this just shows a fundamental problem in SBD design: Receipt of the fencing command is not confirmed (other than by seeing the consequences of ist execution). So the fencing node will see the other host is down (on the network), but it won't believe it until SBD msgwait is over. OTOH if your msgwait is very low, and the storage has a problem (exceeding msgwait), the node will assume a successful fencing when in fact it didn't complete. So maybe there should be two timeouts: One for the command to be delivered (without needing a confirmation, but the confirmation could shorten the wait), and another for executing the command (how long will it take from receipt of the command until the host is definitely down). Again a confirmation could stop waiting before the timeout is reached. Regards, Ulrich I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. Regards, Yan I can provide full logs tomorrow if needed. TIA -andrei ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. Regards, Yan I can provide full logs tomorrow if needed. TIA -andrei ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] questions about startup fencing
On 11/30/2017 09:14 AM, Andrei Borzenkov wrote: On Wed, Nov 29, 2017 at 6:54 PM, Ken Gaillot wrote: The same scenario is why a single node can't have quorum at start-up in a cluster with "two_node" set. Both nodes have to see each other at least once before they can assume it's safe to do anything. Unless we set no-quorum-policy=ignore in which case it will proceed after fencing another node. As far as I'm understand this is the only way to get number of active cluster nodes below quorum, right? To be safe, "two_node: 1" automatically enables "wait_for_all". Of course one can explicitly disable "wait_for_all" if they know what they are doing. Regards, Yan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] questions about startup fencing
On 11/29/2017 04:54 PM, Ken Gaillot wrote: On Wed, 2017-11-29 at 14:22 +, Adam Spiers wrote: The same questions apply if this troublesome node was actually a remote node running pacemaker_remoted, rather than the 5th node in the cluster. Remote nodes don't join at the crmd level as cluster nodes do, so they don't "start up" in the same sense, and start-up fencing doesn't apply to them. Instead, the cluster initiates the connection when called for (I don't remember for sure whether it fences the remote node if the connection fails, but that would make sense). According to link_rsc2remotenode() and handle_startup_fencing(), similar "startup-fencing applies to remote nodes too. So if a remote resource fails to start, the remote node will be fenced. A global setting statup-fencing=false will change the behavior for remote nodes too. Regards, Yan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not stop.
Hi Hideo, On 09/08/2015 04:28 AM, renayama19661...@ybb.ne.jp wrote: > Hi All, > > A problem produced us in Pacemaker1.0.13. > > * RHEL6.4(kernel-2.6.32-358.23.2.el6.x86_64) > * SNMP: >* net-snmp-libs-5.5-49.el6_5.1.x86_64 >* hp-snmp-agents-9.50-2564.40.rhel6.x86_64 >* net-snmp-utils-5.5-49.el6_5.1.x86_64 >* net-snmp-5.5-49.el6_5.1.x86_64 > * Pacemaker 1.0.13 > * pacemaker-mgmt-2.0.1 > > We started hbagnet in respawn in this environment, but hbagent did not stop > when we stopped Heartbeat. > SIGTERM seemed to be transmitted by Heartbeat even if we saw log, but there > was not the trace that hbagent received SIGTERM. > > We try the reproduction of the problem, but the problem never reappears for > the moment. > > We suppose that pacemaker-mgmt(hbagent) or snmp has a problem. > > Know similar problem? > Know the cause of the problem? Sounds weird. I've never encountered the issue before. Actually I haven't run it with heartbeat for years ;-) We'd probably have to find the pattern and produce it. Regards, Yan -- Gao,Yan Senior Software Engineer SUSE LINUX GmbH ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org