We opened that one for the cloustack agent that closely relates to the problem and change in behavior since 4.9:
https://github.com/apache/cloudstack/issues/2890 On Tue, Oct 23, 2018 at 2:26 PM Simon Weller <swel...@ena.com.invalid> wrote: > JF, > > > I suggest you open a github issue instead. It will get a lot more > attention than Jira. > > > - Si > > > ________________________________ > From: Jean-Francois Nadeau <the.jfnad...@gmail.com> > Sent: Tuesday, October 23, 2018 11:32 AM > To: users@cloudstack.apache.org > Subject: Re: Host HA vs transient NFS problems on KVM > > I will fill a Jira for the issue. > > On Tue, Oct 23, 2018 at 11:36 AM ilya musayev < > ilya.mailing.li...@gmail.com> > wrote: > > > Would you please file the JIRA bugs describing in exact details > > > > 1) your setup > > 2) what was done or happened > > 3) expected result > > > > I imagine this will be fixed in the next point release if issues are > indeed > > correct. We’ve yet to try this framework and if it does not work as > > anticipated we will have lots of issues. > > > > > > > > On Tue, Oct 23, 2018 at 8:30 AM Andrei Mikhailovsky > > <and...@arhont.com.invalid> wrote: > > > > > Hi Jean, > > > > > > I have previously done some HA testing and have pretty much came to > > > similar conclusions as you have. My testing showed that using HA is > very > > > unreliable at best and data loosing at worst cases. I have had the > > > following outcome from various testing scenarios: > > > > > > 1. Works as expected (very rarely) > > > 2. Starts 2 vms on different hosts (data loss / corruption) > > > 3. Reboots ALL KVM hosts (even those hosts that do not have a single vm > > > with nfs volumes) > > > > > > Now, I can not justify having HA with even a slim chances of having 2 > or > > 3 > > > above. Honestly, I do not know a single business that is happy to > accept > > > those scenarios. Frankly speaking, for me the cloudstack HA options > > create > > > more problems than solve and thus I've not enabled them. I have decided > > > that ACS with KVM is not HA friendly, full stop. Having said this, I've > > not > > > tested the latest couple of releases, so I will give it a benefit of > the > > > doubt and wait for user's reports to prove my conclusion otherwise. > I've > > > wasted enough of my own time on KVM HA. > > > > > > My HA approach to ACS is more of a manual nature, which is far more > > > reliable and is less prone to issues in my experience. I have a > > monitoring > > > system sending me alerts when VMs, host servers and storage become > > > unreachable. It is not as convenient as a fully working automatic HA, I > > > agree, but it is far better to be woken up at 3am to deal with > > restarting a > > > handful of vms and perhaps a KVM host force reboot than dealing with > mass > > > KVM hosts reboots and/or trying to find duplicate vms lurking somewhere > > on > > > the host servers. Been there, done that - NO THANKS! > > > > > > Cheers > > > > > > Andrei > > > > > > ----- Original Message ----- > > > > From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com> > > > > To: "users" <users@cloudstack.apache.org> > > > > Sent: Monday, 22 October, 2018 22:13:35 > > > > Subject: Host HA vs transient NFS problems on KVM > > > > > > > Dear community, > > > > > > > > I want to share my concern upgrading from 4.9 to 4.11 in regards to > how > > > the > > > > host HA framework works and the handling of various failure > conditions. > > > > > > > > Since we have been running CS on 4.9.3 with NFS on KVM, VM HA have > > been > > > > working as expected when hypervisor crashed.... and I agree we might > > have > > > > been lucky knowing the limitations of the KVM investigator and the > > > > possibility to fire the same VM on 2 KVM hosts is real when you know > > the > > > > recipe for it. > > > > > > > > Still, on 4.9.3 we were tolerant to transient primary NFS storage > > access > > > > issues, typical of a network problem (and we've seen it lately for a > 22 > > > > minutes disconnection). Although these events are quite rare, when > > they > > > > do happen their blast radius can be a huge impact on the business. > > > > > > > > So when we initially tested CS on 4.9.3 we purposely blocked access > to > > > NFS > > > > and we observe the results. Changing the kvmhearbeat.sh script so > it > > > > doesn't reboot the node after 5 minutes has been essential to defuse > > the > > > > potential of a massive KVM hosts reboot. In the end, it's far > less > > > > damage to let NFS recover than having all those VMs rebooted. On > > 4.9.3 > > > > the cloudtack-agent will remain "Up" and not fire any VM twice if > the > > > NFS > > > > storage becomes available again within 30 minutes. > > > > > > > > Now, testing the upgrade from 4.9 to 4.11 in our lab and the same > > > failure > > > > conditions we rapidly saw a different behavior although not perfectly > > > > consistent. On 4.11.2 without host HA enabled, we will see the > agent > > > > "try" to disconnect after 5 minutes tho sometimes the KVM host goes > > into > > > > Disconnect state and sometimes it goes straight to Down state. In > that > > > > case we'll see a duplicate VM created in no time and once the NFS > issue > > > is > > > > resolved, we have 2 copies of that VM and cloudstack only knowns > about > > > > that last copy. This is obviously a disaster forcing us to look at > > how > > > > host HA can help. > > > > > > > > Now with host HA enabled and simulating the same NFS hiccup, we > won't > > > get > > > > duplicate VMs but we will get a KVM host reset. The problem here is > > > that, > > > > yes the host HA does ensure we don't have dup VMs but at scale this > > would > > > > also provoke a lot of KVM host resets (if not all of them). If we > are > > > at > > > > risk with host HA to have massive KVM host resets, then I might > prefer > > > to > > > > disable host/VM HA entirely and just handle KVM host failures > manually. > > > > This is supper annoying for the ops team, but far less risky for the > > > > business. > > > > > > > > Im trying to find if there's a middle ground here between the 4.9 > > > behavior > > > > with NFS hiccups and the reliability of the new host HA framework. > > > > > > > > best, > > > > > > > > Jean-Francois > > > > > >