Hi Jean,

I have previously done some HA testing and have pretty much came to similar 
conclusions as you have. My testing showed that using HA is very unreliable at 
best and data loosing at worst cases. I have had the following outcome from 
various testing scenarios:

1. Works as expected (very rarely)
2. Starts 2 vms on different hosts (data loss / corruption)
3. Reboots ALL KVM hosts (even those hosts that do not have a single vm with 
nfs volumes)

Now, I can not justify having HA with even a slim chances of having 2 or 3 
above. Honestly, I do not know a single business that is happy to accept those 
scenarios. Frankly speaking, for me the cloudstack HA options create more 
problems than solve and thus I've not enabled them. I have decided that ACS 
with KVM is not HA friendly, full stop. Having said this, I've not tested the 
latest couple of releases, so I will give it a benefit of the doubt and wait 
for user's reports to prove my conclusion otherwise. I've wasted enough of my 
own time on KVM HA.

My HA approach to ACS is more of a manual nature, which is far more reliable 
and is less prone to issues in my experience. I have a monitoring system 
sending me alerts when VMs, host servers and storage become unreachable. It is 
not as convenient as a fully working automatic HA, I agree, but it is far 
better to be woken up at 3am to deal with restarting a handful of vms and 
perhaps a KVM host force reboot than dealing with mass KVM hosts reboots and/or 
trying to find duplicate vms lurking somewhere on the host servers. Been there, 
done that - NO THANKS!

Cheers

Andrei

----- Original Message -----
> From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com>
> To: "users" <users@cloudstack.apache.org>
> Sent: Monday, 22 October, 2018 22:13:35
> Subject: Host HA vs transient NFS problems on KVM

> Dear community,
> 
> I want to share my concern upgrading from 4.9 to 4.11 in regards to how the
> host HA framework works and the handling of various failure conditions.
> 
> Since we have been running CS on 4.9.3 with NFS on KVM,  VM HA have been
> working as expected when hypervisor crashed.... and I agree we might have
> been lucky knowing the limitations of the KVM investigator and the
> possibility to fire the same VM on 2 KVM hosts is real when you know the
> recipe for it.
> 
> Still, on 4.9.3 we were tolerant to transient primary NFS storage access
> issues, typical of a network problem (and we've seen it lately for a 22
> minutes disconnection).  Although these events are quite rare,  when they
> do happen their blast radius can be a huge impact on the business.
> 
> So when we initially tested CS on 4.9.3 we purposely blocked access to NFS
> and we observe the results.   Changing the kvmhearbeat.sh script so it
> doesn't reboot the node after 5 minutes has been essential to defuse the
> potential of a massive KVM hosts reboot.    In the end,  it's far less
> damage to let NFS recover than having all those VMs rebooted.   On 4.9.3
> the cloudtack-agent will remain "Up"  and not fire any VM twice if the NFS
> storage becomes available again within 30 minutes.
> 
> Now, testing the upgrade from 4.9 to 4.11 in our lab and the same  failure
> conditions we rapidly saw a different behavior although not perfectly
> consistent.  On 4.11.2 without host HA enabled,  we will see the agent
> "try" to disconnect after 5 minutes tho sometimes the KVM host goes into
> Disconnect state and sometimes it goes straight to Down state.  In that
> case we'll see a duplicate VM created in no time and once the NFS issue is
> resolved,  we have 2 copies of that VM and cloudstack only knowns about
> that last copy.   This is obviously a disaster forcing us to look at how
> host HA can help.
> 
> Now with host HA enabled and simulating the same NFS hiccup,  we won't get
> duplicate VMs but we will get a KVM host reset.  The problem here is that,
> yes the host HA does ensure we don't have dup VMs but at scale this would
> also provoke a lot of KVM host resets (if not all of them).   If we are at
> risk with host HA to have massive KVM host resets,  then I might prefer to
> disable host/VM HA entirely and just handle KVM host failures manually.
> This is supper annoying for the ops team,  but far less risky for the
> business.
> 
> Im trying to find if there's a middle ground here between the 4.9 behavior
> with NFS hiccups and the reliability of the new host HA framework.
> 
> best,
> 
> Jean-Francois

Reply via email to