>>> Andrei Borzenkov <arvidj...@gmail.com> schrieb am 17.12.2020 um 09:50 in Nachricht <caa91j0vuv4nmtetcpqnimf-xrrv_9kqkcnpvman4xbonbqp...@mail.gmail.com>:
... > According to logs from xstha1, it started to activate resources only > after stonith was confirmed > > Dec 16 15:08:12 [708] stonith‑ng: notice: log_operation: > Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2' with > device 'xstha2‑stonith' returned: 0 (OK) > Dec 16 15:08:12 [708] stonith‑ng: notice: remote_op_done: > Operation 'off' targeting xstha2 on xstha1 for > crmd.712@xstha1.e487e7cc: OK > > It is possible that your IPMI/BMC/whatever implementation responds > with success before it actually completes this action. I have seen at Shouldn't a reasonable "stonith-timeout=180" do? Even sbd needs one, because after sending the fence command, it has to be read and processed. For example what I see in the DC logs here around fencing is: Nov 30 11:31:56 h18 pacemaker-fenced[49409]: notice: prm_stonith_sbd is eligible to fence (reboot) h16: dynamic-list Nov 30 11:32:03 h18 corosync[49399]: [TOTEM ] A processor failed, forming new configuration. Nov 30 11:32:09 h18 corosync[49399]: [TOTEM ] A new membership (172.20.16.18:42032) was formed. Members left: 116 ... Nov 30 11:32:09 h18 pacemaker-controld[49413]: notice: Our peer on the DC (h16) is dead ... Nov 30 11:33:57 h18 pacemaker-controld[49413]: notice: Peer h16 was terminated (reboot) by h18 on behalf of pacemaker-controld.69600: OK ...note the delay between node being dead and confirmation... Nov 30 11:36:05 h18 corosync[49399]: [TOTEM ] A new membership (172.20.16.16:42036) was formed. Members joined: 116 ...node re-joined cluster after being fenced > least some delays in the past. There is not really much that can be > done here except adding artificial delay to stonith resource agent. > You need to test IPMI functionality before using it in pacemaker. Another example: Dec 16 14:34:35 h18 pacemaker-controld[4478]: notice: Requesting fencing (reboot) of node h18 ... Dec 16 14:34:38 h18 pacemaker-fenced[4474]: notice: Requesting that h16 perform 'reboot' action targeting h18 ... Dec 16 14:34:40 h18 sbd[3717]: /dev/disk/by-id/dm-name-SBD_1-3P2: notice: servant_md: Received command reset from h16 on disk... ... Dec 16 14:34:40 h18 sbd[3697]: warning: inquisitor_child: /dev/disk/by-id/dm-name-SBD_1-3P2 requested a reset Dec 16 14:34:40 h18 sbd[3697]: emerg: do_exit: Rebooting system: reboot ... Dec 16 14:34:45 h16 corosync[3617]: [TOTEM ] A processor failed, forming new configuration. ... Dec 16 14:35:50 h16 dlm_controld[4802]: 170858 91E73809FE224F2495FE617D556E1800 wait for fencing ... Dec 16 14:36:39 h16 pacemaker-controld[4527]: notice: Peer h18 was terminated (reboot) by h16 on behalf of pacemaker-controld.4478: OK ... Dec 16 14:38:55 h16 corosync[3617]: [TOTEM ] A new membership (172.20.16.16:42128) was formed. Members joined: 118 The timeout (3 min) may be excessive here, but it shows what's going on. Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/