On Tue, Jun 7, 2022 at 7:53 AM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > >>> Andrei Borzenkov <arvidj...@gmail.com> schrieb am 03.06.2022 um 17:04 in > Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>: > > On 03.06.2022 16:51, Zoran Bošnjak wrote: > >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed > > > OK. I was first experimenting with "softdog", which is blacklisted. So the > > reasonable question is how to properly start "softdog" on ubuntu. > >> > > > > blacklist prevents autoloading of modules by alias during hardware > > detection. Neither softdog or ipmi_watchdog have any alias so they > > cannot be autoloaded and blacklist is irrelevant here. > > > >> The reason to unload watchdog module (ipmi or softdog) is that there seems > > > to be a difference between normal reboot and watchdog reboot. > >> In case of ipmi watchdog timer reboot: > >> - the system hangs at the end of reboot cycle for some time > >> - restart seems to be harder (like power off/on cycle), BIOS runs more > > diagnostics at startup > > maybe kdump is enabled in that case? > > >> - it turns on HW diagnostic indication on the server front panel (dell > > server) which stays on forever > >> - it logs the event to IDRAC, which is unnecessary, because it was not a > > hardware event, but just a normal reboot > > If the hardware watchdog times out and fires, it is consoidered to be an > exceptional event that will be logged and reported. > > >> > >> In case of "sudo reboot" command, I would like to skip this... so the idea > > > is to fully stop the watchdog just before reboot. I am not sure how to do > > this properly. > >> > >> The "softdog" is better in this respect. It does not trigger nothing from > > the list above, but I still get the message during reboot > >> [ ... ] watchdog: watchdog0: watchdog did not stop! > >> ... with some small timeout. > >> > > > > The first obvious question - is there only one watchdog? Some watchdog > > drivers *are* autoloaded. > > > > Is there only one user of watchdog? systemd may use it too as example. > > Don't mix timers with a watchdog: It makes little sense to habe multipe > watchdogs enabled IMHO.
Yep that is an issue atm. When you have multiple user of a hardware-watchdog like: watchdog-daemon, sbd, corosync, systemd, ... I'm not aware of an implementation that would provide multiple watchdog-timers with the usual char-device-interface out of one physical. Of course this should be relatively easy to implement - even in user-space. On our embedded devices we usually had something like a service that would offer multiple timers to other instances. The implementation of that service itself was guarded by a hardware-watchdog so that the derived timers would be as reliable as a hardware-watchdog. Last implementation was built into watchdog-daemon and offered a dbus-interface. What systemd has implemented is similarly interesting. Current systemd-implementation has a suspicious loop around it that prevents it from being fit for sbd-purposes as it doesn't guarantee a reboot within a reasonably short time like this. This is why I haven't yet implemented using the systemd-filedescriptor-approach in sbd yet (as a configurable alternative to going for the device directly). Approaching the systemd-guys and asking why it is implemented as it is has been on my todo-list for a while now. If you are running multiple-services on a host that don't offer something like a common supervision main-loop it may make sense to offer a common instance that offers something like a watchdog-service. For a node that has all service under pacemaker-control this shouldn't be needed as we have sbd observing pacemakerd. Pacemakerd in turn observes the other pacemaker subdaemons (released with RHEL-8.6 and iirc 2.1.3 upstream) guaranteeing that the monitors on the resources don't get stuck. Klaus > > > > >> So after some additional testing, the situation is the following: > >> > >> - without any watchdog and without sbd package, the server reboots > normally > >> - with "softdog" module loaded, I only get "watchdog did not stop message" > > > at reboot > >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is > > normal again > >> - same as above, but with "sbd" package loaded, I am getting "watchdog did > > > not stop message" again > >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list > > > of problems > >> > >> It looks like the "sbd" is preventing the watchdog to close, so that > > watchdog triggers always, even in the case of normal reboot. What am I > > missing here? > > The watchdog may have a "no way out" parameter that prevents disabling it > after enabled once. > > > > > While the only way I can reproduce it on my QEMU VM is "reboot -f" > > (without stopping all services), there is certainly a race condition in > > sbd.service. > > > > ExecStop=@bindir@/kill -TERM $MAINPID > > > > > > systemd will continue as soon as "kill" completes without waiting for > > sbd to actually stop. It means systemd may complete shutdown sequence > > before sbd had chance to react on signal and then simply kill it. Which > > leaves watchdog armed. > > > > For test purpose try to use script that loops until sbd is actually > > stopped for ExecStop. > > > > Note that systemd strongly recommends to use synchronous command for > > ExecStop (we may argue that this should be handled by service manager > > itself, but well ...). > > > >> > >> Zoran > >> > >> ----- Original Message ----- > >> From: "Andrei Borzenkov" <arvidj...@gmail.com> > >> To: "users" <users@clusterlabs.org> > >> Sent: Friday, June 3, 2022 11:24:03 AM > >> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work > >> > >> On 03.06.2022 11:18, Zoran Bošnjak wrote: > >>> Hi all, > >>> I would appreciate an advice about sbd fencing (without shared storage). > >>> > >>> I am using ubuntu 20.04., with default packages from the repository > > (pacemaker, corosync, fence-agents, ipmitool, pcs...). > >>> > >>> HW watchdog is present on servers. The first problem was to load/unload > the > > watchdog module. For some reason the module is blacklisted on ubuntu, > >> > >> What makes you think so? > >> > >> bor@bor-Latitude-E5450:~$ lsb_release -d > >> > >> Description: Ubuntu 20.04.4 LTS > >> > >> bor@bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog > >> > >> bor@bor-Latitude-E5450:~$ > >> > >> > >> > >> > >> > >>> so I've created a service for this purpose. > >>> > >> > >> man modules-load.d > >> > >> > >>> --- file: /etc/systemd/system/watchdog.service > >>> [Unit] > >>> Description=Load watchdog timer module > >>> After=syslog.target > >>> > >> > >> Without any explicit dependencies stop will be attempted as soon as > >> possible. > >> > >>> [Service] > >>> Type=oneshot > >>> RemainAfterExit=yes > >>> ExecStart=/sbin/modprobe ipmi_watchdog > >>> ExecStop=/sbin/rmmod ipmi_watchdog > >>> > >> > >> Why on earth do you need to unload kernel driver when system reboots? > >> > >>> [Install] > >>> WantedBy=multi-user.target > >>> --- > >>> > >>> Is this a proper way to load watchdog module under ubuntu? > >>> > >> > >> There is standard way to load non-autoloaded drivers on *any* systemd > >> based distribution. Which is modules-load.d. > >> > >>> Anyway, once the module is loaded, the /dev/watchdog (which is required by > > > 'sbd') is present. > >>> Next, the 'sbd' is installed by > >>> > >>> sudo apt install sbd > >>> (followed by one reboot to get the sbd active) > >>> > >>> The configuration of the 'sbd' is default. The sbd reacts to network > failure > > as expected (reboots the server). However, when the 'sbd' is active, the > > server won't reboot normally any more. For example from the command line > > "sudo reboot", it gets stuck at the end of the reboot sequence. There is a > > message on the console: > >>> > >>> ... reboot progress > >>> [ OK ] Finished Reboot. > >>> [ OK ] Reached target Reboot. > >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog! > >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog! > >>> ... it gets stuck at this point > >>> > >>> After some long timeout, it looks like the watchdog timer expires and > server > > boots, but the failure indication remains on the front panel of the server. > > > If I uninstall the 'sbd' package, the "sudo reboot" works normally again. > >>> > >>> My question is: How do I configure the system, to have the 'sbd' function > > > present, but still be able to reboot the system normally. > >>> > >> > >> As the first step - do not unload watchdog driver on shutdown. > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/