[ovirt-users] Ovirt 4.5 HA over NFS fails when a single host goes down

youssef . khristo Tue, 27 Feb 2024 10:30:33 -0800

Greetings,

we have recently installed ovirt as a hosted-engine with high availability on 
six nodes over NFS storage (no Gluster), with power management through an 
on-board IPMI device, and the setup was successful. All the nodes (from 
Supermicro) are identical in every aspect, so no hardware differences exist and 
no modifications to the servers' hardware were performed. The hosted-engine was 
deployed on a second host, where two of the six hosts only were required to 
host the HE VM.


The network interface on each node is bonded between two physical fiber optics 
NICs in LACP mode with a VLAN on top, serving as the sole networking interface 
for the server/node, no separate VM or storage networks were needed, as the 
host OS, hosted-engine vm, and storage are required to be on the same network 
and VLAN.

We started by testing the high-availability of the hosted-engine VM (as it was 
deployed on two of the six nodes) by rebooting or powering off one of the 
hosts, and the VM would migrate successfully to the second HE node. The main 
goal of our experiments is to test the robustness of the setup, as it is 
required for the cluster to remain functional even when up to two hosts are 
brought down (whether due to a network or power issue), however, when rebooting 
or powering off one of the hosts, the HE VM goes down and takes the entire 
cluster with it, where we can't even access the web portal. Once the host is 
rebooted, the HE VM and the cluster becomes functional again. Sometimes the HE 
VM stays down for a set amount of time (5 to 6 minutes) and then goes back up, 
and sometimes it goes down until the problematic host is back up. This behavior 
happens to other VMs as well not the the HE.

We suspected an issue with the NFS storage, however, during ovirt operation it 
is being mounted properly over /rhev/data-center/mnt/<nfs:directory>, while the 
expected behavior is for the cluster to stay operational and any other VMs to 
be migrated to other hosts. During one of the tests, we tried to mount the NFS 
storage on a different directory and there was no problem, we were even able to 
perform commands such as ls without any issues, as well as writing a text file 
at the directory's root, and be able to modify it normally.

We suspected a couple of things the first being that the HE is unable to fence 
the problematic host (the one we took down), however, power management is setup 
properly.

The other thing we suspected is the cluster hosts (after taking down one of 
them) are unable to acquire storage lease, which is weird since the host in 
question is down and non-operational, hence no locks should be in place. The 
reason behind this suspicion is the following two errors that we receive 
frequently when one host or more goes down from the 
engine\ovirt-engine\engine.log file:
1- "EVENT_ID: VM_DOWN_ERROR(119), VM HostedEngine is down with error. Exit 
message: resource busy: Failed to acquire lock: Lease is held by another host."
2- "[<id>] Command 'GetVmLeaseInfoVDSCommand( 
VmLeaseVDSParameters:{expectedEngineErrors='[NoSuchVmLeaseOnDomain]', 
storagePoolId='<pool-id>', ignoreFailoverLimit='false', leaseId='<lease-id>', 
storageDomainId='<domain-id>'})' execution failed: IRSGenericException: 
IRSErrorException: No such lease: 'lease=<lease-id>'"

This is a third warning from the /var/log/vdsm/vdsm.log file
1- "WARN  (check/loop) [storage.check] Checker 
'/rhev/data-center/mnt/<nfs-domain:/directory>/<id>/dom_md/metadata' is blocked 
for 310.00 seconds (check:265)"

All the tests are done without setting nodes into maintenance mode as we are 
simulating an emergency situation. No HE configuration were modified via the 
config-engine command, the default values are used.

Is this a normal behavior? Are we missing something? Do we need to tweak a 
certain configuration using the config-engine command to get a better behavior 
(e.g., shorter down period)?

Best regards
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MUSSKYJMVEIMM5CT7WEPKBZW6Y7XZFRN/

[ovirt-users] Ovirt 4.5 HA over NFS fails when a single host goes down

Reply via email to