We opened that one for the cloustack agent that closely relates to the
problem and change in behavior since 4.9:

https://github.com/apache/cloudstack/issues/2890

On Tue, Oct 23, 2018 at 2:26 PM Simon Weller <swel...@ena.com.invalid>
wrote:

> JF,
>
>
> I suggest you open a github issue instead. It will get a lot more
> attention than Jira.
>
>
> - Si
>
>
> ________________________________
> From: Jean-Francois Nadeau <the.jfnad...@gmail.com>
> Sent: Tuesday, October 23, 2018 11:32 AM
> To: users@cloudstack.apache.org
> Subject: Re: Host HA vs transient NFS problems on KVM
>
> I will fill a Jira for the issue.
>
> On Tue, Oct 23, 2018 at 11:36 AM ilya musayev <
> ilya.mailing.li...@gmail.com>
> wrote:
>
> > Would you please file the JIRA bugs describing in exact details
> >
> > 1) your setup
> > 2) what was done or happened
> > 3) expected result
> >
> > I imagine this will be fixed in the next point release if issues are
> indeed
> > correct. We’ve yet to try this framework and if it does not work as
> > anticipated we will have lots of issues.
> >
> >
> >
> > On Tue, Oct 23, 2018 at 8:30 AM Andrei Mikhailovsky
> > <and...@arhont.com.invalid> wrote:
> >
> > > Hi Jean,
> > >
> > > I have previously done some HA testing and have pretty much came to
> > > similar conclusions as you have. My testing showed that using HA is
> very
> > > unreliable at best and data loosing at worst cases. I have had the
> > > following outcome from various testing scenarios:
> > >
> > > 1. Works as expected (very rarely)
> > > 2. Starts 2 vms on different hosts (data loss / corruption)
> > > 3. Reboots ALL KVM hosts (even those hosts that do not have a single vm
> > > with nfs volumes)
> > >
> > > Now, I can not justify having HA with even a slim chances of having 2
> or
> > 3
> > > above. Honestly, I do not know a single business that is happy to
> accept
> > > those scenarios. Frankly speaking, for me the cloudstack HA options
> > create
> > > more problems than solve and thus I've not enabled them. I have decided
> > > that ACS with KVM is not HA friendly, full stop. Having said this, I've
> > not
> > > tested the latest couple of releases, so I will give it a benefit of
> the
> > > doubt and wait for user's reports to prove my conclusion otherwise.
> I've
> > > wasted enough of my own time on KVM HA.
> > >
> > > My HA approach to ACS is more of a manual nature, which is far more
> > > reliable and is less prone to issues in my experience. I have a
> > monitoring
> > > system sending me alerts when VMs, host servers and storage become
> > > unreachable. It is not as convenient as a fully working automatic HA, I
> > > agree, but it is far better to be woken up at 3am to deal with
> > restarting a
> > > handful of vms and perhaps a KVM host force reboot than dealing with
> mass
> > > KVM hosts reboots and/or trying to find duplicate vms lurking somewhere
> > on
> > > the host servers. Been there, done that - NO THANKS!
> > >
> > > Cheers
> > >
> > > Andrei
> > >
> > > ----- Original Message -----
> > > > From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com>
> > > > To: "users" <users@cloudstack.apache.org>
> > > > Sent: Monday, 22 October, 2018 22:13:35
> > > > Subject: Host HA vs transient NFS problems on KVM
> > >
> > > > Dear community,
> > > >
> > > > I want to share my concern upgrading from 4.9 to 4.11 in regards to
> how
> > > the
> > > > host HA framework works and the handling of various failure
> conditions.
> > > >
> > > > Since we have been running CS on 4.9.3 with NFS on KVM,  VM HA have
> > been
> > > > working as expected when hypervisor crashed.... and I agree we might
> > have
> > > > been lucky knowing the limitations of the KVM investigator and the
> > > > possibility to fire the same VM on 2 KVM hosts is real when you know
> > the
> > > > recipe for it.
> > > >
> > > > Still, on 4.9.3 we were tolerant to transient primary NFS storage
> > access
> > > > issues, typical of a network problem (and we've seen it lately for a
> 22
> > > > minutes disconnection).  Although these events are quite rare,  when
> > they
> > > > do happen their blast radius can be a huge impact on the business.
> > > >
> > > > So when we initially tested CS on 4.9.3 we purposely blocked access
> to
> > > NFS
> > > > and we observe the results.   Changing the kvmhearbeat.sh script so
> it
> > > > doesn't reboot the node after 5 minutes has been essential to defuse
> > the
> > > > potential of a massive KVM hosts reboot.    In the end,  it's far
> less
> > > > damage to let NFS recover than having all those VMs rebooted.   On
> > 4.9.3
> > > > the cloudtack-agent will remain "Up"  and not fire any VM twice if
> the
> > > NFS
> > > > storage becomes available again within 30 minutes.
> > > >
> > > > Now, testing the upgrade from 4.9 to 4.11 in our lab and the same
> > > failure
> > > > conditions we rapidly saw a different behavior although not perfectly
> > > > consistent.  On 4.11.2 without host HA enabled,  we will see the
> agent
> > > > "try" to disconnect after 5 minutes tho sometimes the KVM host goes
> > into
> > > > Disconnect state and sometimes it goes straight to Down state.  In
> that
> > > > case we'll see a duplicate VM created in no time and once the NFS
> issue
> > > is
> > > > resolved,  we have 2 copies of that VM and cloudstack only knowns
> about
> > > > that last copy.   This is obviously a disaster forcing us to look at
> > how
> > > > host HA can help.
> > > >
> > > > Now with host HA enabled and simulating the same NFS hiccup,  we
> won't
> > > get
> > > > duplicate VMs but we will get a KVM host reset.  The problem here is
> > > that,
> > > > yes the host HA does ensure we don't have dup VMs but at scale this
> > would
> > > > also provoke a lot of KVM host resets (if not all of them).   If we
> are
> > > at
> > > > risk with host HA to have massive KVM host resets,  then I might
> prefer
> > > to
> > > > disable host/VM HA entirely and just handle KVM host failures
> manually.
> > > > This is supper annoying for the ops team,  but far less risky for the
> > > > business.
> > > >
> > > > Im trying to find if there's a middle ground here between the 4.9
> > > behavior
> > > > with NFS hiccups and the reliability of the new host HA framework.
> > > >
> > > > best,
> > > >
> > > > Jean-Francois
> > >
> >
>

Reply via email to