On Thu, Feb 17, 2022 at 11:58 AM Nir Soffer <nsof...@redhat.com> wrote: > > On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera <p.oliv...@telfy.com> wrote: > > > > Hi Nir, > > > > > > Thank you very much for your detailed explanations. > > > > The pid 6398 looks like it's HostedEngine: > > > > audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 > > uid=0 auid=4294967295 ses=4294967295 > > subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start > > reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc > > vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? > > res=success'UID="root" AUID="unset" > > > > So, I understand that SanLock has problems with the storage (it loses > > connection with NFS storage). The watchdog begins to check connectivity > > with the MV and after the established time, the order to > > reboot the machine. > > > > I don't know if I can somehow increase these timeouts, or try to make > > sanlock force the reconnection or renewal with the storage and in this way > > try to avoid host reboots for this reason. > > You can do one of these: > 1. Use lower timeouts on the NFS server mount, so the NFS server at > the same time > the sanlock lease times out. > 2. Use larger sanlock timeout so sanlock lease time out when the NFS > server times out. > 3. Both 1 and 2 > > The problem is that NFS timeouts are not predictable. In the past we used: > "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically > we saw up to a 30 minutes timeout. > > In > https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289 > we change this to the recommended seting: > "timeo=100,retrans=3" > > Which according to the docs, should fail in 60 seconds if all retries > fail. But practically we > saw up to 270 seconds timeout with this setting, which does not play > well with sanlock. > > We assumed that the timeout value should not be less than sanlock io timeout > (10 seconds) but I'm not sure this assumption is correct. > > You can smaller timeout value in engine storage domain > "custom connections parameters" > - Retransmissions - mapped to "retrans" mount option > - Timeout (deciseconds) - mapped to "timeo" mount option > > For example: > Retransmissions: 3 > Timeout: 5
Correction: Timeout: 50 (5 seconds, 50 deciseconds) > > Theoretically this will behave like this: > > 00:00 retry 1 (5 seconds timeout) > 00:10 retry 2 (10 seconds timeout) > 00:30 retry 3 (15 seconds timeout) > 00:45 request fail > > But based on what we see with the defaults, this is likely to take more time. > If it fails before 140 seconds, the VM will be killed and the host > will not reboot. > > The other way is to increase sanlock timeout, in vdsm configuration. > note that changing sanlock timeout requires also changing other > settings (e.g. spm:watchdog_interval). > > Add this file on all hosts: > > $ cat /etc/vdsm/vdsm.conf.d/99-local.conf > [spm] > > # If enabled, montior the SPM lease status and panic if the lease > # status is not expected. The SPM host will lose the SPM role, and > # engine will select a new SPM host. (default true) > # watchdog_enable = true > > # Watchdog check internal in seconds. The recommended value is > # sanlock:io_timeout * 2. (default 20) > watchdog_interval = 40 > > [sanlock] > > # I/O timeout in seconds. All sanlock timeouts are computed based on > # this value. Using larger timeout will make VMs more resilient to > # short storage outage, but increase VM failover time and the time to > # acquire a host id. For more info on sanlock timeouts please check > # sanlock source: > # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your > # storage requires larger timeouts, you can increase the value to 15 > # or 20 seconds. If you change this you need to update also multipath > # no_path_retry. For more info onconfiguring multipath please check > # /etc/multipath.conf.oVirt is tested only with the default value (10 > # seconds) > io_timeout = 20 > > > You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md > to learn more about sanlock timeouts. > > Alternatively, you can make a small change in NFS timeout and small change in > sanlock timeout to make them work better together. > > All this is of course to handle the case when the NFS server is not > accessible, > but this is something that should not happen in a healthy cluster. You need > to check why the server was not accessible and fix this problem. > > Nir _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5MSXZ6PCKQFTMCC3KIFJJWZJXAKCPIAP/