Dear community, I want to share my concern upgrading from 4.9 to 4.11 in regards to how the host HA framework works and the handling of various failure conditions.
Since we have been running CS on 4.9.3 with NFS on KVM, VM HA have been working as expected when hypervisor crashed.... and I agree we might have been lucky knowing the limitations of the KVM investigator and the possibility to fire the same VM on 2 KVM hosts is real when you know the recipe for it. Still, on 4.9.3 we were tolerant to transient primary NFS storage access issues, typical of a network problem (and we've seen it lately for a 22 minutes disconnection). Although these events are quite rare, when they do happen their blast radius can be a huge impact on the business. So when we initially tested CS on 4.9.3 we purposely blocked access to NFS and we observe the results. Changing the kvmhearbeat.sh script so it doesn't reboot the node after 5 minutes has been essential to defuse the potential of a massive KVM hosts reboot. In the end, it's far less damage to let NFS recover than having all those VMs rebooted. On 4.9.3 the cloudtack-agent will remain "Up" and not fire any VM twice if the NFS storage becomes available again within 30 minutes. Now, testing the upgrade from 4.9 to 4.11 in our lab and the same failure conditions we rapidly saw a different behavior although not perfectly consistent. On 4.11.2 without host HA enabled, we will see the agent "try" to disconnect after 5 minutes tho sometimes the KVM host goes into Disconnect state and sometimes it goes straight to Down state. In that case we'll see a duplicate VM created in no time and once the NFS issue is resolved, we have 2 copies of that VM and cloudstack only knowns about that last copy. This is obviously a disaster forcing us to look at how host HA can help. Now with host HA enabled and simulating the same NFS hiccup, we won't get duplicate VMs but we will get a KVM host reset. The problem here is that, yes the host HA does ensure we don't have dup VMs but at scale this would also provoke a lot of KVM host resets (if not all of them). If we are at risk with host HA to have massive KVM host resets, then I might prefer to disable host/VM HA entirely and just handle KVM host failures manually. This is supper annoying for the ops team, but far less risky for the business. Im trying to find if there's a middle ground here between the 4.9 behavior with NFS hiccups and the reliability of the new host HA framework. best, Jean-Francois