Dear community,

I want to share my concern upgrading from 4.9 to 4.11 in regards to how the
host HA framework works and the handling of various failure conditions.

Since we have been running CS on 4.9.3 with NFS on KVM,  VM HA have been
working as expected when hypervisor crashed.... and I agree we might have
been lucky knowing the limitations of the KVM investigator and the
possibility to fire the same VM on 2 KVM hosts is real when you know the
recipe for it.

Still, on 4.9.3 we were tolerant to transient primary NFS storage access
issues, typical of a network problem (and we've seen it lately for a 22
minutes disconnection).  Although these events are quite rare,  when they
do happen their blast radius can be a huge impact on the business.

So when we initially tested CS on 4.9.3 we purposely blocked access to NFS
and we observe the results.   Changing the kvmhearbeat.sh script so it
doesn't reboot the node after 5 minutes has been essential to defuse the
potential of a massive KVM hosts reboot.    In the end,  it's far less
damage to let NFS recover than having all those VMs rebooted.   On 4.9.3
the cloudtack-agent will remain "Up"  and not fire any VM twice if the NFS
storage becomes available again within 30 minutes.

Now, testing the upgrade from 4.9 to 4.11 in our lab and the same  failure
conditions we rapidly saw a different behavior although not perfectly
consistent.  On 4.11.2 without host HA enabled,  we will see the agent
"try" to disconnect after 5 minutes tho sometimes the KVM host goes into
Disconnect state and sometimes it goes straight to Down state.  In that
case we'll see a duplicate VM created in no time and once the NFS issue is
resolved,  we have 2 copies of that VM and cloudstack only knowns about
that last copy.   This is obviously a disaster forcing us to look at how
host HA can help.

Now with host HA enabled and simulating the same NFS hiccup,  we won't get
duplicate VMs but we will get a KVM host reset.  The problem here is that,
yes the host HA does ensure we don't have dup VMs but at scale this would
also provoke a lot of KVM host resets (if not all of them).   If we are at
risk with host HA to have massive KVM host resets,  then I might prefer to
disable host/VM HA entirely and just handle KVM host failures manually.
This is supper annoying for the ops team,  but far less risky for the
business.

Im trying to find if there's a middle ground here between the 4.9 behavior
with NFS hiccups and the reliability of the new host HA framework.

best,

Jean-Francois

Reply via email to