>>> Andrei Borzenkov <arvidj...@gmail.com> schrieb am 03.06.2022 um 17:04 in Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>: > On 03.06.2022 16:51, Zoran Bošnjak wrote: >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed
> OK. I was first experimenting with "softdog", which is blacklisted. So the > reasonable question is how to properly start "softdog" on ubuntu. >> > > blacklist prevents autoloading of modules by alias during hardware > detection. Neither softdog or ipmi_watchdog have any alias so they > cannot be autoloaded and blacklist is irrelevant here. > >> The reason to unload watchdog module (ipmi or softdog) is that there seems > to be a difference between normal reboot and watchdog reboot. >> In case of ipmi watchdog timer reboot: >> - the system hangs at the end of reboot cycle for some time >> - restart seems to be harder (like power off/on cycle), BIOS runs more > diagnostics at startup maybe kdump is enabled in that case? >> - it turns on HW diagnostic indication on the server front panel (dell > server) which stays on forever >> - it logs the event to IDRAC, which is unnecessary, because it was not a > hardware event, but just a normal reboot If the hardware watchdog times out and fires, it is consoidered to be an exceptional event that will be logged and reported. >> >> In case of "sudo reboot" command, I would like to skip this... so the idea > is to fully stop the watchdog just before reboot. I am not sure how to do > this properly. >> >> The "softdog" is better in this respect. It does not trigger nothing from > the list above, but I still get the message during reboot >> [ ... ] watchdog: watchdog0: watchdog did not stop! >> ... with some small timeout. >> > > The first obvious question - is there only one watchdog? Some watchdog > drivers *are* autoloaded. > > Is there only one user of watchdog? systemd may use it too as example. Don't mix timers with a watchdog: It makes little sense to habe multipe watchdogs enabled IMHO. > >> So after some additional testing, the situation is the following: >> >> - without any watchdog and without sbd package, the server reboots normally >> - with "softdog" module loaded, I only get "watchdog did not stop message" > at reboot >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is > normal again >> - same as above, but with "sbd" package loaded, I am getting "watchdog did > not stop message" again >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list > of problems >> >> It looks like the "sbd" is preventing the watchdog to close, so that > watchdog triggers always, even in the case of normal reboot. What am I > missing here? The watchdog may have a "no way out" parameter that prevents disabling it after enabled once. > > While the only way I can reproduce it on my QEMU VM is "reboot -f" > (without stopping all services), there is certainly a race condition in > sbd.service. > > ExecStop=@bindir@/kill -TERM $MAINPID > > > systemd will continue as soon as "kill" completes without waiting for > sbd to actually stop. It means systemd may complete shutdown sequence > before sbd had chance to react on signal and then simply kill it. Which > leaves watchdog armed. > > For test purpose try to use script that loops until sbd is actually > stopped for ExecStop. > > Note that systemd strongly recommends to use synchronous command for > ExecStop (we may argue that this should be handled by service manager > itself, but well ...). > >> >> Zoran >> >> ----- Original Message ----- >> From: "Andrei Borzenkov" <arvidj...@gmail.com> >> To: "users" <users@clusterlabs.org> >> Sent: Friday, June 3, 2022 11:24:03 AM >> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work >> >> On 03.06.2022 11:18, Zoran Bošnjak wrote: >>> Hi all, >>> I would appreciate an advice about sbd fencing (without shared storage). >>> >>> I am using ubuntu 20.04., with default packages from the repository > (pacemaker, corosync, fence-agents, ipmitool, pcs...). >>> >>> HW watchdog is present on servers. The first problem was to load/unload the > watchdog module. For some reason the module is blacklisted on ubuntu, >> >> What makes you think so? >> >> bor@bor-Latitude-E5450:~$ lsb_release -d >> >> Description: Ubuntu 20.04.4 LTS >> >> bor@bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog >> >> bor@bor-Latitude-E5450:~$ >> >> >> >> >> >>> so I've created a service for this purpose. >>> >> >> man modules-load.d >> >> >>> --- file: /etc/systemd/system/watchdog.service >>> [Unit] >>> Description=Load watchdog timer module >>> After=syslog.target >>> >> >> Without any explicit dependencies stop will be attempted as soon as >> possible. >> >>> [Service] >>> Type=oneshot >>> RemainAfterExit=yes >>> ExecStart=/sbin/modprobe ipmi_watchdog >>> ExecStop=/sbin/rmmod ipmi_watchdog >>> >> >> Why on earth do you need to unload kernel driver when system reboots? >> >>> [Install] >>> WantedBy=multi-user.target >>> --- >>> >>> Is this a proper way to load watchdog module under ubuntu? >>> >> >> There is standard way to load non-autoloaded drivers on *any* systemd >> based distribution. Which is modules-load.d. >> >>> Anyway, once the module is loaded, the /dev/watchdog (which is required by > 'sbd') is present. >>> Next, the 'sbd' is installed by >>> >>> sudo apt install sbd >>> (followed by one reboot to get the sbd active) >>> >>> The configuration of the 'sbd' is default. The sbd reacts to network failure > as expected (reboots the server). However, when the 'sbd' is active, the > server won't reboot normally any more. For example from the command line > "sudo reboot", it gets stuck at the end of the reboot sequence. There is a > message on the console: >>> >>> ... reboot progress >>> [ OK ] Finished Reboot. >>> [ OK ] Reached target Reboot. >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog! >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog! >>> ... it gets stuck at this point >>> >>> After some long timeout, it looks like the watchdog timer expires and server > boots, but the failure indication remains on the front panel of the server. > If I uninstall the 'sbd' package, the "sudo reboot" works normally again. >>> >>> My question is: How do I configure the system, to have the 'sbd' function > present, but still be able to reboot the system normally. >>> >> >> As the first step - do not unload watchdog driver on shutdown. >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/