On Tue, Apr 5, 2022 at 7:27 PM lejeczek wrote:
>
>
>
> On 29/03/2022 20:25, Nir Soffer wrote:
> > On Wed, Mar 16, 2022 at 1:55 PM lejeczek wrote:
> >>
> >>
> >> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
> >>> On Tue, Mar 15, 2022 at 10:39:50AM +, lejeczek wrote:
> Hi guys.
>
> Without explicitly, manually using watchdog device for a VM, the VM
> (centOS
> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
> To double check - 'dumpxml' does not show any such device - what kind of
> a
> 'watchdog' that is?
> >>> The kernel can always provide a pure software watchdog IIRC. It can be
> >>> useful if a userspace app wants a watchdog. The limitation is that it
> >>> relies on the kernel remaining functional, as there's no hardware
> >>> backing it up.
> >>>
> >>> Regards,
> >>> Daniel
> >> On a related note - with 'i6300esb' watchdog which I tested
> >> and I believe is working.
> >> I get often in my VMs from 'dmesg':
> >> ...
> >> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
> >> rcu: INFO: rcu_sched self-detected stall on CPU
> >> ...
> >> This above is from Ubuntu and CentOS alike and when this
> >> happens, console via VNC responds to until first 'enter'
> >> then is non-resposive.
> >> This happens after VM(s) was migrated between hosts, but
> >> anyway..
> >> I do not see what I expected from 'watchdog' - there is no
> >> action whatsoever, which should be 'reset'. VM remains in
> >> such 'frozen' state forever.
> >>
> >> any & all shared thoughts much appreciated.
> >> L.
> > You need to run some userspace tool that will open the watchdog
> > device, and pet it periodically, telling the kernel that userspace is alive.
> >
> > If this tool will stop petting the watchdog, maybe because of a soft lockup
> > or other trouble, the watchdog device will reset the VM.
> >
> > watchdog(8) may be the tool you need.
> >
> > See also
> > https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
> >
> > Nir
> >
> I do not think that 'i6300esb' watchog works under those
> soft-lockups, whether it's qemu or OS end I cannot say.
> With:
>
> in dom xml OS sees:
> -> $ llr /dev/watchdog*
> crw---. 1 root root 10, 130 Apr 5 16:59 /dev/watchdog
> crw---. 1 root root 248, 0 Apr 5 16:59 /dev/watchdog0
> crw---. 1 root root 248, 1 Apr 5 16:59 /dev/watchdog1
> and
> -> $ wdctl
> Device:/dev/watchdog
> Identity: i6300ESB timer [version 0]
> Timeout: 30 seconds
> Pre-timeout:0 seconds
> FLAG DESCRIPTION STATUS BOOT-STATUS
> KEEPALIVEPING Keep alive ping reply 1 0
> MAGICCLOSE Supports magic close char 0 0
> SETTIMEOUT Set timeout (in seconds) 0 0
>
> If it worked, the HW watchdog, then 'i6300esb' should reset
> the VM if nothing is pinging the watchdog - I read that it's
> possible to exit 'software' watchdog and not to cause HW
> watchdog take action. I do not know it that's happening here
> when I just 'systemclt stop watchdog'
> In '/etc/watchdog.conf' I do not point to any specific
> device, which I believe makes watchdogd do its things.
> Simple test:
> -> $ cat >> /dev/watchdog
> & 'Enter' press twice
> does invoke 'reset' action and I was to believe 'wdctl' that
> is HW watchdog working. But!...
> The main issue I have are those "soft lockups" where VM's OS
> becomes frozen, but nothing from the watchdog, no action -
> though, as VM is in such frozen state host shows high CPU
> for the VM.
>
> I do not anything fancy so I really wonder if what I see is
> that rare.
> Soft-lockup occur I think usually, cannot say that uniquely
> though, during or after VM live-migration.
>
> thanks, L.
On my fedora 35 vm, I see that /dev/watchdog0 is the right device:
# wdctl
Device:/dev/watchdog0
Identity: i6300ESB timer [version 0]
Timeout: 30 seconds
Pre-timeout:0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
I tested this script:
# cat watchdog-test.py
import os
import time
fd = os.open("/dev/watchdog0", os.O_WRONLY)
print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device
watchdog-device = /dev/watchdog0
for i in range(1, 120):
time.sleep(1)
print(i)
# python3 watchdog-test.py
Opened /dev/watchdog0
1
2
3
...
30
The VM was reset after 30 seconds, showing that the hardware watchdog works.
I also tested the watchdog package, with this configuration:
# cat /etc/watchdog.conf
...
watchdog-device = /dev/watchdog0
Then starting the service:
# systemctl status watchdog
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service;
enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-04-08 23:23:54 I