Re: Host HA vs transient NFS problems on KVM

ilya musayev Tue, 23 Oct 2018 08:36:48 -0700

Would you please file the JIRA bugs describing in exact details

1) your setup
2) what was done or happened
3) expected result


I imagine this will be fixed in the next point release if issues are indeed
correct. We’ve yet to try this framework and if it does not work as
anticipated we will have lots of issues.



On Tue, Oct 23, 2018 at 8:30 AM Andrei Mikhailovsky
<and...@arhont.com.invalid> wrote:

> Hi Jean,
>
> I have previously done some HA testing and have pretty much came to
> similar conclusions as you have. My testing showed that using HA is very
> unreliable at best and data loosing at worst cases. I have had the
> following outcome from various testing scenarios:
>
> 1. Works as expected (very rarely)
> 2. Starts 2 vms on different hosts (data loss / corruption)
> 3. Reboots ALL KVM hosts (even those hosts that do not have a single vm
> with nfs volumes)
>
> Now, I can not justify having HA with even a slim chances of having 2 or 3
> above. Honestly, I do not know a single business that is happy to accept
> those scenarios. Frankly speaking, for me the cloudstack HA options create
> more problems than solve and thus I've not enabled them. I have decided
> that ACS with KVM is not HA friendly, full stop. Having said this, I've not
> tested the latest couple of releases, so I will give it a benefit of the
> doubt and wait for user's reports to prove my conclusion otherwise. I've
> wasted enough of my own time on KVM HA.
>
> My HA approach to ACS is more of a manual nature, which is far more
> reliable and is less prone to issues in my experience. I have a monitoring
> system sending me alerts when VMs, host servers and storage become
> unreachable. It is not as convenient as a fully working automatic HA, I
> agree, but it is far better to be woken up at 3am to deal with restarting a
> handful of vms and perhaps a KVM host force reboot than dealing with mass
> KVM hosts reboots and/or trying to find duplicate vms lurking somewhere on
> the host servers. Been there, done that - NO THANKS!
>
> Cheers
>
> Andrei
>
> ----- Original Message -----
> > From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com>
> > To: "users" <users@cloudstack.apache.org>
> > Sent: Monday, 22 October, 2018 22:13:35
> > Subject: Host HA vs transient NFS problems on KVM
>
> > Dear community,
> >
> > I want to share my concern upgrading from 4.9 to 4.11 in regards to how
> the
> > host HA framework works and the handling of various failure conditions.
> >
> > Since we have been running CS on 4.9.3 with NFS on KVM,  VM HA have been
> > working as expected when hypervisor crashed.... and I agree we might have
> > been lucky knowing the limitations of the KVM investigator and the
> > possibility to fire the same VM on 2 KVM hosts is real when you know the
> > recipe for it.
> >
> > Still, on 4.9.3 we were tolerant to transient primary NFS storage access
> > issues, typical of a network problem (and we've seen it lately for a 22
> > minutes disconnection).  Although these events are quite rare,  when they
> > do happen their blast radius can be a huge impact on the business.
> >
> > So when we initially tested CS on 4.9.3 we purposely blocked access to
> NFS
> > and we observe the results.   Changing the kvmhearbeat.sh script so it
> > doesn't reboot the node after 5 minutes has been essential to defuse the
> > potential of a massive KVM hosts reboot.    In the end,  it's far less
> > damage to let NFS recover than having all those VMs rebooted.   On 4.9.3
> > the cloudtack-agent will remain "Up"  and not fire any VM twice if the
> NFS
> > storage becomes available again within 30 minutes.
> >
> > Now, testing the upgrade from 4.9 to 4.11 in our lab and the same
> failure
> > conditions we rapidly saw a different behavior although not perfectly
> > consistent.  On 4.11.2 without host HA enabled,  we will see the agent
> > "try" to disconnect after 5 minutes tho sometimes the KVM host goes into
> > Disconnect state and sometimes it goes straight to Down state.  In that
> > case we'll see a duplicate VM created in no time and once the NFS issue
> is
> > resolved,  we have 2 copies of that VM and cloudstack only knowns about
> > that last copy.   This is obviously a disaster forcing us to look at how
> > host HA can help.
> >
> > Now with host HA enabled and simulating the same NFS hiccup,  we won't
> get
> > duplicate VMs but we will get a KVM host reset.  The problem here is
> that,
> > yes the host HA does ensure we don't have dup VMs but at scale this would
> > also provoke a lot of KVM host resets (if not all of them).   If we are
> at
> > risk with host HA to have massive KVM host resets,  then I might prefer
> to
> > disable host/VM HA entirely and just handle KVM host failures manually.
> > This is supper annoying for the ops team,  but far less risky for the
> > business.
> >
> > Im trying to find if there's a middle ground here between the 4.9
> behavior
> > with NFS hiccups and the reliability of the new host HA framework.
> >
> > best,
> >
> > Jean-Francois
>

Re: Host HA vs transient NFS problems on KVM

Reply via email to