Hi,

I always thought the SPM role is also "managed" by a storage lease :) But that does not seem to be the case.

So this means a storage lease is only useful if the host is not the SPM? If the SPM host is completely unreachable, not via OS, not via power management, then the storage lease won't help to restart VMs on other hosts automatically? This is definitely something I did not consider when building my environment.


Greetings

Klaas


On 8/9/21 6:25 PM, Nir Soffer wrote:
On Thu, Aug 5, 2021 at 5:45 PM Gianluca Cecchi
<gianluca.cec...@gmail.com> wrote:
Hello,
supposing latest 4.4.7 environment installed with an external engine and two 
hosts, one in one site and one in another site.
For storage I have one FC storage domain.
I try to simulate a sort of "site failure scenario" to see what kind of HA I 
should expect.

The 2 hosts have power mgmt configured through fence_ipmilan.

I have 2 VMs, one configured as HA with lease on storage (Resume Behavior: 
kill) and one not marked as HA.

Initially host1 is SPM and it is the host that runs the two VMs.

Fencing of host1 from host2 initially works ok. I can test also from command 
line:
# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S 
/usr/local/bin/pwd.sh -o status
Status: ON

On host2 I then prevent reaching host1 iDRAC:
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p udp 
--dport 623 -j DROP
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT
Why do you need to prevent access from host1 to host2? Hosts do not
access each other unless you migrate vms between hosts.

so that:

# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S 
/usr/local/bin/pwd.sh -o status
2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug status or 
plug is not available

On host1 I generate panic:
# date ; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
Thu Aug  5 15:06:24 CEST 2021

host1 correctly completes its crash dump (kdump integration is enabled) and 
reboots, but I stop it at grub prompt so that host1 is unreachable from host2 
point of view and also power fencing not determined
Crashing the host and preventing it from booting is fine, but isn't it
simpler to stop the host using power management?

At this point I thought that VM lease functionality would have come in place 
and host2 would be able to re-start the HA VM, as it is able to see that the 
lease is not taken from the other host and so it can acquire the lock itself....
Once host1 disappears from the system, engine should detect that the HA VM
is at unknown status, and start it on the other host.

But you kill the SPM, and without SPM some operation cannot
work until a new SPM is selected. And for the SPM we don't have a way
to start it on another host *before* the old SPM host reboot, and we can
verify that the old host is not the SPM.

Instead it goes through the attempt to power fence loop
I wait about 25 minutes without any effect but continuous attempts.

After 2 minutes host2 correctly becomes SPM and VMs are marked as unknown
I wonder how host2 became the SPM. This should not be possible before
host 1 is rebooted. Did you use "Confirm host was rebooted" in engine?

At a certain point after the failures in power fencing host1, I see the event:

Failed to power fence host host1. Please check the host status and it's power management 
settings, and then manually reboot it and click "Confirm Host Has Been Rebooted"

If I select host and choose "Confirm Host Has Been Rebooted", then the two VMs 
are marked as down and the HA one is correctly booted by host2.

But this requires my manual intervention.
So you host2 became the SPM after you chose: "Confirm Host Has Been Rebooted"?

Is the behavior above the expected one or the use of VM leases should have 
allowed host2 to bypass fencing inability and start the HA VM with lease? 
Otherwise I don't understand the reason to have the lease itself at all....
The vm lease allows engine to start HA VM on another host when it cannot
access the original host the VM was running on.

The VM can start only if it is not running on the original host. If
the VM is running
it will keep the lease live, and other hosts will not be able to acquire it.

I suggest you file an ovirt-engine bug with clear instructions on how
to reproduce
the issue.

You can check this presentation on this topic:
https://www.youtube.com/watch?v=WLnU_YsHWtU

Nir
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/J3QWSUZTYHZM74ZESJG7M2VAGZRLY5L2/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/7RYGVJO52IHZPIKRJLCBTKTRV2O56DM2/

Reply via email to