[ovirt-users] Re: Random reboots

Nir Soffer Thu, 17 Feb 2022 02:05:56 -0800

On Thu, Feb 17, 2022 at 11:58 AM Nir Soffer <nsof...@redhat.com> wrote:
>
> On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera <p.oliv...@telfy.com> wrote:
> >
> > Hi Nir,
> >
> >
> > Thank you very much for your detailed explanations.
> >
> > The pid 6398 looks like it's HostedEngine:
> >
> > audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 
> > uid=0 auid=4294967295 ses=4294967295 
> > subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start 
> > reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc 
> > vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? 
> > res=success'UID="root" AUID="unset"
> >
> > So, I understand that SanLock has problems with the storage (it loses 
> > connection with NFS storage). The watchdog begins to check connectivity 
> > with the MV and after the established time, the order to
> > reboot the machine.
> >
> > I don't know if I can somehow increase these timeouts, or try to make 
> > sanlock force the reconnection or renewal with the storage and in this way 
> > try to avoid host reboots for this reason.
>
> You can do one of these:
> 1. Use lower timeouts on the NFS server mount, so the NFS server at
> the same time
>    the sanlock lease times out.
> 2. Use larger sanlock timeout so sanlock lease time out when the NFS
> server times out.
> 3. Both 1 and 2
>
> The problem is that NFS timeouts are not predictable. In the past we used:
> "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically
> we saw up to a 30 minutes timeout.
>
> In 
> https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289
> we change this to the recommended seting:
> "timeo=100,retrans=3"
>
> Which according to the docs, should fail in 60 seconds if all retries
> fail. But practically we
> saw up to 270 seconds timeout with this setting, which does not play
> well with sanlock.
>
> We assumed that the timeout value should not be less than sanlock io timeout
> (10 seconds) but I'm not sure this assumption is correct.
>
> You can smaller timeout value in engine storage domain
> "custom connections parameters"
> - Retransmissions - mapped to "retrans" mount option
> - Timeout (deciseconds) - mapped to "timeo" mount option
>
> For example:
> Retransmissions: 3
> Timeout: 5


Correction:

    Timeout: 50 (5 seconds, 50 deciseconds)

>
> Theoretically this will behave like this:
>
>     00:00   retry 1 (5 seconds timeout)
>     00:10   retry 2 (10 seconds timeout)
>     00:30   retry 3 (15 seconds timeout)
>     00:45   request fail
>
> But based on what we see with the defaults, this is likely to take more time.
> If it fails before 140 seconds, the VM will be killed and the host
> will not reboot.
>
> The other way is to increase sanlock timeout, in vdsm configuration.
> note that changing sanlock timeout requires also changing other
> settings (e.g. spm:watchdog_interval).
>
> Add this file on all hosts:
>
> $ cat /etc/vdsm/vdsm.conf.d/99-local.conf
> [spm]
>
> # If enabled, montior the SPM lease status and panic if the lease
> # status is not expected. The SPM host will lose the SPM role, and
> # engine will select a new SPM host. (default true)
> # watchdog_enable = true
>
> # Watchdog check internal in seconds. The recommended value is
> # sanlock:io_timeout * 2. (default 20)
> watchdog_interval = 40
>
> [sanlock]
>
> # I/O timeout in seconds. All sanlock timeouts are computed based on
> # this value. Using larger timeout will make VMs more resilient to
> # short storage outage, but increase VM failover time and the time to
> # acquire a host id. For more info on sanlock timeouts please check
> # sanlock source:
> # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your
> # storage requires larger timeouts, you can increase the value to 15
> # or 20 seconds. If you change this you need to update also multipath
> # no_path_retry. For more info onconfiguring multipath please check
> # /etc/multipath.conf.oVirt is tested only with the default value (10
> # seconds)
> io_timeout = 20
>
>
> You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md
> to learn more about sanlock timeouts.
>
> Alternatively, you can make a small change in NFS timeout and small change in
> sanlock timeout to make them work better together.
>
> All this is of course to handle the case when the NFS server is not 
> accessible,
> but this is something that should not happen in a healthy cluster. You need
> to check why the server was not accessible and fix this problem.
>
> Nir
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/5MSXZ6PCKQFTMCC3KIFJJWZJXAKCPIAP/

[ovirt-users] Re: Random reboots

Reply via email to