Re: Host HA vs transient NFS problems on KVM

Jean-Francois Nadeau Tue, 23 Oct 2018 09:32:41 -0700

I will fill a Jira for the issue.

On Tue, Oct 23, 2018 at 11:36 AM ilya musayev <ilya.mailing.li...@gmail.com>
wrote:


> Would you please file the JIRA bugs describing in exact details
>
> 1) your setup
> 2) what was done or happened
> 3) expected result
>
> I imagine this will be fixed in the next point release if issues are indeed
> correct. We’ve yet to try this framework and if it does not work as
> anticipated we will have lots of issues.
>
>
>
> On Tue, Oct 23, 2018 at 8:30 AM Andrei Mikhailovsky
> <and...@arhont.com.invalid> wrote:
>
> > Hi Jean,
> >
> > I have previously done some HA testing and have pretty much came to
> > similar conclusions as you have. My testing showed that using HA is very
> > unreliable at best and data loosing at worst cases. I have had the
> > following outcome from various testing scenarios:
> >
> > 1. Works as expected (very rarely)
> > 2. Starts 2 vms on different hosts (data loss / corruption)
> > 3. Reboots ALL KVM hosts (even those hosts that do not have a single vm
> > with nfs volumes)
> >
> > Now, I can not justify having HA with even a slim chances of having 2 or
> 3
> > above. Honestly, I do not know a single business that is happy to accept
> > those scenarios. Frankly speaking, for me the cloudstack HA options
> create
> > more problems than solve and thus I've not enabled them. I have decided
> > that ACS with KVM is not HA friendly, full stop. Having said this, I've
> not
> > tested the latest couple of releases, so I will give it a benefit of the
> > doubt and wait for user's reports to prove my conclusion otherwise. I've
> > wasted enough of my own time on KVM HA.
> >
> > My HA approach to ACS is more of a manual nature, which is far more
> > reliable and is less prone to issues in my experience. I have a
> monitoring
> > system sending me alerts when VMs, host servers and storage become
> > unreachable. It is not as convenient as a fully working automatic HA, I
> > agree, but it is far better to be woken up at 3am to deal with
> restarting a
> > handful of vms and perhaps a KVM host force reboot than dealing with mass
> > KVM hosts reboots and/or trying to find duplicate vms lurking somewhere
> on
> > the host servers. Been there, done that - NO THANKS!
> >
> > Cheers
> >
> > Andrei
> >
> > ----- Original Message -----
> > > From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com>
> > > To: "users" <users@cloudstack.apache.org>
> > > Sent: Monday, 22 October, 2018 22:13:35
> > > Subject: Host HA vs transient NFS problems on KVM
> >
> > > Dear community,
> > >
> > > I want to share my concern upgrading from 4.9 to 4.11 in regards to how
> > the
> > > host HA framework works and the handling of various failure conditions.
> > >
> > > Since we have been running CS on 4.9.3 with NFS on KVM,  VM HA have
> been
> > > working as expected when hypervisor crashed.... and I agree we might
> have
> > > been lucky knowing the limitations of the KVM investigator and the
> > > possibility to fire the same VM on 2 KVM hosts is real when you know
> the
> > > recipe for it.
> > >
> > > Still, on 4.9.3 we were tolerant to transient primary NFS storage
> access
> > > issues, typical of a network problem (and we've seen it lately for a 22
> > > minutes disconnection).  Although these events are quite rare,  when
> they
> > > do happen their blast radius can be a huge impact on the business.
> > >
> > > So when we initially tested CS on 4.9.3 we purposely blocked access to
> > NFS
> > > and we observe the results.   Changing the kvmhearbeat.sh script so it
> > > doesn't reboot the node after 5 minutes has been essential to defuse
> the
> > > potential of a massive KVM hosts reboot.    In the end,  it's far less
> > > damage to let NFS recover than having all those VMs rebooted.   On
> 4.9.3
> > > the cloudtack-agent will remain "Up"  and not fire any VM twice if the
> > NFS
> > > storage becomes available again within 30 minutes.
> > >
> > > Now, testing the upgrade from 4.9 to 4.11 in our lab and the same
> > failure
> > > conditions we rapidly saw a different behavior although not perfectly
> > > consistent.  On 4.11.2 without host HA enabled,  we will see the agent
> > > "try" to disconnect after 5 minutes tho sometimes the KVM host goes
> into
> > > Disconnect state and sometimes it goes straight to Down state.  In that
> > > case we'll see a duplicate VM created in no time and once the NFS issue
> > is
> > > resolved,  we have 2 copies of that VM and cloudstack only knowns about
> > > that last copy.   This is obviously a disaster forcing us to look at
> how
> > > host HA can help.
> > >
> > > Now with host HA enabled and simulating the same NFS hiccup,  we won't
> > get
> > > duplicate VMs but we will get a KVM host reset.  The problem here is
> > that,
> > > yes the host HA does ensure we don't have dup VMs but at scale this
> would
> > > also provoke a lot of KVM host resets (if not all of them).   If we are
> > at
> > > risk with host HA to have massive KVM host resets,  then I might prefer
> > to
> > > disable host/VM HA entirely and just handle KVM host failures manually.
> > > This is supper annoying for the ops team,  but far less risky for the
> > > business.
> > >
> > > Im trying to find if there's a middle ground here between the 4.9
> > behavior
> > > with NFS hiccups and the reliability of the new host HA framework.
> > >
> > > best,
> > >
> > > Jean-Francois
> >
>

Re: Host HA vs transient NFS problems on KVM

Reply via email to