[ovirt-users] Re: Host reboots when network switch goes down

2021-09-29 Thread Strahil Nikolov via Users
Tinkering with timeouts could be risky, so in case you can't have a second 
switch - your solution (shutting down all VMs, maintenance, etc) should be the 
safest.
If possible test it on a cluster on VMs, so you get used to the whole procedure.
 Best Regards,Strahil Nikolov
 
 
  On Wed, Sep 29, 2021 at 16:16, cen wrote:   On 29. 09. 21 
13:31, Vojtech Juranek wrote:
> this is possible, but changing sanlock timeouts can be very tricky and can
> have unwanted/unexpected consequences, so be very careful. Here is a guideline
> how to do it:
>
> https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md

Thank you for your feedback, this seems to be exactly what is happening.

After reading the doc, my gut feeling tells me it would be smarter to 
shut down our VMs, go into maintenance mode and then perform any switch 
upgrades/reboots instead of trying to tweak the timeouts to survive a 
possible 3min+ reboot. We don't have any serious uptime requirements so 
this seems like the easiest and safest way forward.


Best regards,

cen
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/JEJMMLI2WH72J4PPRKSYHAEQFTIEBPZ5/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3TYARHX4QMALGVGF7DWUJV7LRC7LVJWP/


[ovirt-users] Re: Host reboots when network switch goes down

2021-09-29 Thread cen

On 29. 09. 21 13:31, Vojtech Juranek wrote:

this is possible, but changing sanlock timeouts can be very tricky and can
have unwanted/unexpected consequences, so be very careful. Here is a guideline
how to do it:

https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md


Thank you for your feedback, this seems to be exactly what is happening.

After reading the doc, my gut feeling tells me it would be smarter to 
shut down our VMs, go into maintenance mode and then perform any switch 
upgrades/reboots instead of trying to tweak the timeouts to survive a 
possible 3min+ reboot. We don't have any serious uptime requirements so 
this seems like the easiest and safest way forward.



Best regards,

cen
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/JEJMMLI2WH72J4PPRKSYHAEQFTIEBPZ5/


[ovirt-users] Re: Host reboots when network switch goes down

2021-09-29 Thread Vojtech Juranek
On Wednesday, 29 September 2021 09:43:56 CEST cen wrote:
> Hi,
> 
> we are experiencing a weird issue with our Ovirt setup. We have two 
> physical hosts (DC1 and DC2) and mounted Lenovo NAS storage for all VM
> data.
 
> They are connected via a managed network switch.
> 
> What happens is that if switch goes down for whatever reason (firmware 
> update etc), physical host reboots. Not sure if this is an action 
> performed by Ovirt but I suspect it is because connection to mounted 
> storage is lost and it  performs some kind of an emergency action. I 
> would need to get some direction pointers to find out
> 
> a) who triggers the reboot and why

sanlock, resp. wdmd because it cannot renew the lease of some HA resource (it 
renews it by writing to the storage) and failed to kill the process using this 
resource (it should first try to kill the process and only if it fails reboot 
the host)

> c) a way to prevent reboots by increasing storage? timeouts

this is possible, but changing sanlock timeouts can be very tricky and can 
have unwanted/unexpected consequences, so be very careful. Here is a guideline 
how to do it:

https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md


> Switch reboot takes 2-3 minutes.
> 
> 
> These are the host /var/log/messages just before reboot occurs:
> 
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984 
> [10993]: s11 check_our_lease warning 72 last_success 7690912
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984 
> [10993]: s3 check_our_lease warning 76 last_success 7690908
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984 
> [10993]: s1 check_our_lease warning 68 last_success 7690916
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984 
> [27983]: s11 delta_renew read timeout 10 sec offset 0 
> /var/run/vdsm/storage/15514c65-5d45-4ba7-bcd4-cc772351c940/fce598a8-11c3-44f
> 9-8aaf-8712c96e00ce/65413499-6970-4a4c-af04-609ef78891a2
 Sep 28 16:20:00
> ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984 [27983]: s11
> renewal error -202 delta_length 20 last_success 7690912 Sep 28 16:20:00
> ovirtnode02 wdmd[11102]: test warning now 7690984 ping 7690970 close
> 7690980 renewal 7690912 expire 7690992 client 10993 sanlock_hosted-engine:2
> Sep 28 16:20:00 ovirtnode02 wdmd[11102]: test warning now 7690984 ping 
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993 
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:01 ovirtnode02 systemd: Created slice User Slice of root.
> Sep 28 16:20:01 ovirtnode02 systemd: Started Session 15148 of user root.
> Sep 28 16:20:01 ovirtnode02 systemd: Removed slice User Slice of root.
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985 
> [10993]: s11 check_our_lease warning 73 last_success 7690912
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985 
> [10993]: s3 check_our_lease warning 77 last_success 7690908
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985 
> [10993]: s1 check_our_lease warning 69 last_success 7690916
> Sep 28 16:20:01 ovirtnode02 wdmd[11102]: test warning now 7690985 ping 
> 7690970 close 7690980 renewal 7690912 expire 7690992 client 10993 
> sanlock_hosted-engine:2
> Sep 28 16:20:01 ovirtnode02 wdmd[11102]: test warning now 7690985 ping 
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993 
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986 
> [10993]: s11 check_our_lease warning 74 last_success 7690912
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986 
> [10993]: s3 check_our_lease warning 78 last_success 7690908
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986 
> [10993]: s1 check_our_lease warning 70 last_success 7690916
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping 
> 7690970 close 7690980 renewal 7690916 expire 7690996 client 10993 
> sanlock_15514c65-5d45-4ba7-bcd4-cc772351c940:2
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping 
> 7690970 close 7690980 renewal 7690912 expire 7690992 client 10993 
> sanlock_hosted-engine:2
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping 
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993 
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987 
> [10993]: s11 check_our_lease warning 75 last_success 7690912
> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987 
> [10993]: s3 check_our_lease warning 79 last_success 7690908
> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987 
> [10993]: s1 check_our_lease warning 71 last_success 7690916
> 
> 
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email 

[ovirt-users] Re: Host reboots when network switch goes down

2021-09-29 Thread Nir Soffer
On Wed, Sep 29, 2021 at 2:08 PM cen  wrote:
>
> Hi,
>
> we are experiencing a weird issue with our Ovirt setup. We have two
> physical hosts (DC1 and DC2) and mounted Lenovo NAS storage for all VM data.
>
> They are connected via a managed network switch.
>
> What happens is that if switch goes down for whatever reason (firmware
> update etc), physical host reboots. Not sure if this is an action
> performed by Ovirt but I suspect it is because connection to mounted
> storage is lost and it  performs some kind of an emergency action. I
> would need to get some direction pointers to find out
>
> a) who triggers the reboot and why
>
> c) a way to prevent reboots by increasing storage? timeouts
>
> Switch reboot takes 2-3 minutes.
>
>
> These are the host /var/log/messages just before reboot occurs:
>
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984
> [10993]: s11 check_our_lease warning 72 last_success 7690912
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984
> [10993]: s3 check_our_lease warning 76 last_success 7690908
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984
> [10993]: s1 check_our_lease warning 68 last_success 7690916
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984
> [27983]: s11 delta_renew read timeout 10 sec offset 0
> /var/run/vdsm/storage/15514c65-5d45-4ba7-bcd4-cc772351c940/fce598a8-11c3-44f9-8aaf-8712c96e00ce/65413499-6970-4a4c-af04-609ef78891a2
> Sep 28 16:20:00 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:00 7690984
> [27983]: s11 renewal error -202 delta_length 20 last_success 7690912
> Sep 28 16:20:00 ovirtnode02 wdmd[11102]: test warning now 7690984 ping
> 7690970 close 7690980 renewal 7690912 expire 7690992 client 10993
> sanlock_hosted-engine:2
> Sep 28 16:20:00 ovirtnode02 wdmd[11102]: test warning now 7690984 ping
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:01 ovirtnode02 systemd: Created slice User Slice of root.
> Sep 28 16:20:01 ovirtnode02 systemd: Started Session 15148 of user root.
> Sep 28 16:20:01 ovirtnode02 systemd: Removed slice User Slice of root.
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985
> [10993]: s11 check_our_lease warning 73 last_success 7690912
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985
> [10993]: s3 check_our_lease warning 77 last_success 7690908
> Sep 28 16:20:01 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:01 7690985
> [10993]: s1 check_our_lease warning 69 last_success 7690916
> Sep 28 16:20:01 ovirtnode02 wdmd[11102]: test warning now 7690985 ping
> 7690970 close 7690980 renewal 7690912 expire 7690992 client 10993
> sanlock_hosted-engine:2
> Sep 28 16:20:01 ovirtnode02 wdmd[11102]: test warning now 7690985 ping
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986
> [10993]: s11 check_our_lease warning 74 last_success 7690912
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986
> [10993]: s3 check_our_lease warning 78 last_success 7690908
> Sep 28 16:20:02 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:02 7690986
> [10993]: s1 check_our_lease warning 70 last_success 7690916
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping
> 7690970 close 7690980 renewal 7690916 expire 7690996 client 10993
> sanlock_15514c65-5d45-4ba7-bcd4-cc772351c940:2
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping
> 7690970 close 7690980 renewal 7690912 expire 7690992 client 10993
> sanlock_hosted-engine:2
> Sep 28 16:20:02 ovirtnode02 wdmd[11102]: test warning now 7690986 ping
> 7690970 close 7690980 renewal 7690908 expire 7690988 client 10993
> sanlock_3cb12f04-5d68-4d79-8663-f33c0655baa6:2
> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987
> [10993]: s11 check_our_lease warning 75 last_success 7690912
> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987
> [10993]: s3 check_our_lease warning 79 last_success 7690908

Leases on lockspace s3 will expire in one second after this message...

> Sep 28 16:20:03 ovirtnode02 sanlock[10993]: 2021-09-28 16:20:03 7690987
> [10993]: s1 check_our_lease warning 71 last_success 7690916

When leases expire, sanlock to terminate the lease owner (.e.g vdsm, qemu).
If the owner of the lease cannot be terminated in (~40 seconds) the sanlock must
reboot the host.

So the host running hosted engine may be rebooted because storage is
inaccessible
and qemu is stuck on storage.

Other hosts may have the same issue they run HA VMs, server as the SPM, or run
storage tasks that use a lease.

To understand if this is the case, we need complete sanlock.log and
vdsm.log from
the hosts, when the issue happens.

Please file ovirt vdsm bug for this, and attach relevant logs.

Nir