Re: "default" watchdog device - ?

2022-04-08 Thread Nir Soffer
On Tue, Apr 5, 2022 at 7:27 PM lejeczek  wrote:
>
>
>
> On 29/03/2022 20:25, Nir Soffer wrote:
> > On Wed, Mar 16, 2022 at 1:55 PM lejeczek  wrote:
> >>
> >>
> >> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
> >>> On Tue, Mar 15, 2022 at 10:39:50AM +, lejeczek wrote:
>  Hi guys.
> 
>  Without explicitly, manually using watchdog device for a VM, the VM 
>  (centOS
>  8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
>  To double check - 'dumpxml' does not show any such device - what kind of 
>  a
>  'watchdog' that is?
> >>> The kernel can always provide a pure software watchdog IIRC. It can be
> >>> useful if a userspace app wants a watchdog. The limitation is that it
> >>> relies on the kernel remaining functional, as there's no hardware
> >>> backing it up.
> >>>
> >>> Regards,
> >>> Daniel
> >> On a related note - with 'i6300esb' watchdog which I tested
> >> and I believe is working.
> >> I get often in my VMs from 'dmesg':
> >> ...
> >> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
> >> rcu: INFO: rcu_sched self-detected stall on CPU
> >> ...
> >> This above is from Ubuntu and CentOS alike and when this
> >> happens, console via VNC responds to until first 'enter'
> >> then is non-resposive.
> >> This happens after VM(s) was migrated between hosts, but
> >> anyway..
> >> I do not see what I expected from 'watchdog' - there is no
> >> action whatsoever, which should be 'reset'. VM remains in
> >> such 'frozen' state forever.
> >>
> >> any & all shared thoughts much appreciated.
> >> L.
> > You need to run some userspace tool that will open the watchdog
> > device, and pet it periodically, telling the kernel that userspace is alive.
> >
> > If this tool will stop petting the watchdog, maybe because of a soft lockup
> > or other trouble, the watchdog device will reset the VM.
> >
> > watchdog(8) may be the tool you need.
> >
> > See also
> > https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
> >
> > Nir
> >
> I do not think that 'i6300esb' watchog works under those
> soft-lockups, whether it's qemu or OS end I cannot say.
> With:
>  
> in dom xml OS sees:
> -> $ llr /dev/watchdog*
> crw---. 1 root root  10, 130 Apr  5 16:59 /dev/watchdog
> crw---. 1 root root 248,   0 Apr  5 16:59 /dev/watchdog0
> crw---. 1 root root 248,   1 Apr  5 16:59 /dev/watchdog1
> and
> -> $ wdctl
> Device:/dev/watchdog
> Identity:  i6300ESB timer [version 0]
> Timeout:   30 seconds
> Pre-timeout:0 seconds
> FLAG   DESCRIPTION   STATUS BOOT-STATUS
> KEEPALIVEPING  Keep alive ping reply  1   0
> MAGICCLOSE Supports magic close char  0   0
> SETTIMEOUT Set timeout (in seconds)   0   0
>
> If it worked, the HW watchdog, then 'i6300esb' should reset
> the VM if nothing is pinging the watchdog - I read that it's
> possible to exit 'software' watchdog and not to cause HW
> watchdog take action. I do not know it that's happening here
> when I just 'systemclt stop watchdog'
> In '/etc/watchdog.conf' I do not point to any specific
> device, which I believe makes watchdogd do its things.
> Simple test:
> -> $ cat >> /dev/watchdog
> & 'Enter' press twice
> does invoke 'reset' action and I was to believe 'wdctl' that
> is HW watchdog working. But!...
> The main issue I have are those "soft lockups" where VM's OS
> becomes frozen, but nothing from the watchdog, no action -
> though, as VM is in such frozen state host shows high CPU
> for the VM.
>
> I do not anything fancy so I really wonder if what I see is
> that rare.
> Soft-lockup occur I think usually, cannot say that uniquely
> though, during or after VM live-migration.
>
> thanks, L.

On my fedora 35 vm, I see that /dev/watchdog0 is the right device:

# wdctl
Device:/dev/watchdog0
Identity:  i6300ESB timer [version 0]
Timeout:   30 seconds
Pre-timeout:0 seconds
FLAG   DESCRIPTION   STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply  1   0
MAGICCLOSE Supports magic close char  0   0
SETTIMEOUT Set timeout (in seconds)   0   0

I tested this script:

# cat watchdog-test.py
import os
import time

fd = os.open("/dev/watchdog0", os.O_WRONLY)

print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device
watchdog-device = /dev/watchdog0


for i in range(1, 120):
time.sleep(1)
print(i)

# python3 watchdog-test.py
Opened /dev/watchdog0
1
2
3
...
30

The VM was reset after 30 seconds, showing that the hardware watchdog works.

I also tested the watchdog package, with this configuration:

# cat /etc/watchdog.conf
...
watchdog-device = /dev/watchdog0

Then starting the service:

# systemctl status watchdog
● watchdog.service - watchdog daemon
 Loaded: loaded (/usr/lib/systemd/system/watchdog.service;
enabled; vendor preset: disabled)
 Active: active (running) since Fri 2022-04-08 23:23:54 

Re: "default" watchdog device - ?

2022-03-29 Thread Nir Soffer
On Wed, Mar 16, 2022 at 1:55 PM lejeczek  wrote:
>
>
>
> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
> > On Tue, Mar 15, 2022 at 10:39:50AM +, lejeczek wrote:
> >> Hi guys.
> >>
> >> Without explicitly, manually using watchdog device for a VM, the VM (centOS
> >> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
> >> To double check - 'dumpxml' does not show any such device - what kind of a
> >> 'watchdog' that is?
> > The kernel can always provide a pure software watchdog IIRC. It can be
> > useful if a userspace app wants a watchdog. The limitation is that it
> > relies on the kernel remaining functional, as there's no hardware
> > backing it up.
> >
> > Regards,
> > Daniel
> On a related note - with 'i6300esb' watchdog which I tested
> and I believe is working.
> I get often in my VMs from 'dmesg':
> ...
> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
> rcu: INFO: rcu_sched self-detected stall on CPU
> ...
> This above is from Ubuntu and CentOS alike and when this
> happens, console via VNC responds to until first 'enter'
> then is non-resposive.
> This happens after VM(s) was migrated between hosts, but
> anyway..
> I do not see what I expected from 'watchdog' - there is no
> action whatsoever, which should be 'reset'. VM remains in
> such 'frozen' state forever.
>
> any & all shared thoughts much appreciated.
> L.

You need to run some userspace tool that will open the watchdog
device, and pet it periodically, telling the kernel that userspace is alive.

If this tool will stop petting the watchdog, maybe because of a soft lockup
or other trouble, the watchdog device will reset the VM.

watchdog(8) may be the tool you need.

See also
https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst

Nir



Re: "default" watchdog device - ?

2022-03-15 Thread Daniel P . Berrangé
On Tue, Mar 15, 2022 at 10:39:50AM +, lejeczek wrote:
> Hi guys.
> 
> Without explicitly, manually using watchdog device for a VM, the VM (centOS
> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
> To double check - 'dumpxml' does not show any such device - what kind of a
> 'watchdog' that is?

The kernel can always provide a pure software watchdog IIRC. It can be
useful if a userspace app wants a watchdog. The limitation is that it
relies on the kernel remaining functional, as there's no hardware
backing it up.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|