I will fill a Jira for the issue. On Tue, Oct 23, 2018 at 11:36 AM ilya musayev <ilya.mailing.li...@gmail.com> wrote:
> Would you please file the JIRA bugs describing in exact details > > 1) your setup > 2) what was done or happened > 3) expected result > > I imagine this will be fixed in the next point release if issues are indeed > correct. We’ve yet to try this framework and if it does not work as > anticipated we will have lots of issues. > > > > On Tue, Oct 23, 2018 at 8:30 AM Andrei Mikhailovsky > <and...@arhont.com.invalid> wrote: > > > Hi Jean, > > > > I have previously done some HA testing and have pretty much came to > > similar conclusions as you have. My testing showed that using HA is very > > unreliable at best and data loosing at worst cases. I have had the > > following outcome from various testing scenarios: > > > > 1. Works as expected (very rarely) > > 2. Starts 2 vms on different hosts (data loss / corruption) > > 3. Reboots ALL KVM hosts (even those hosts that do not have a single vm > > with nfs volumes) > > > > Now, I can not justify having HA with even a slim chances of having 2 or > 3 > > above. Honestly, I do not know a single business that is happy to accept > > those scenarios. Frankly speaking, for me the cloudstack HA options > create > > more problems than solve and thus I've not enabled them. I have decided > > that ACS with KVM is not HA friendly, full stop. Having said this, I've > not > > tested the latest couple of releases, so I will give it a benefit of the > > doubt and wait for user's reports to prove my conclusion otherwise. I've > > wasted enough of my own time on KVM HA. > > > > My HA approach to ACS is more of a manual nature, which is far more > > reliable and is less prone to issues in my experience. I have a > monitoring > > system sending me alerts when VMs, host servers and storage become > > unreachable. It is not as convenient as a fully working automatic HA, I > > agree, but it is far better to be woken up at 3am to deal with > restarting a > > handful of vms and perhaps a KVM host force reboot than dealing with mass > > KVM hosts reboots and/or trying to find duplicate vms lurking somewhere > on > > the host servers. Been there, done that - NO THANKS! > > > > Cheers > > > > Andrei > > > > ----- Original Message ----- > > > From: "Jean-Francois Nadeau" <the.jfnad...@gmail.com> > > > To: "users" <users@cloudstack.apache.org> > > > Sent: Monday, 22 October, 2018 22:13:35 > > > Subject: Host HA vs transient NFS problems on KVM > > > > > Dear community, > > > > > > I want to share my concern upgrading from 4.9 to 4.11 in regards to how > > the > > > host HA framework works and the handling of various failure conditions. > > > > > > Since we have been running CS on 4.9.3 with NFS on KVM, VM HA have > been > > > working as expected when hypervisor crashed.... and I agree we might > have > > > been lucky knowing the limitations of the KVM investigator and the > > > possibility to fire the same VM on 2 KVM hosts is real when you know > the > > > recipe for it. > > > > > > Still, on 4.9.3 we were tolerant to transient primary NFS storage > access > > > issues, typical of a network problem (and we've seen it lately for a 22 > > > minutes disconnection). Although these events are quite rare, when > they > > > do happen their blast radius can be a huge impact on the business. > > > > > > So when we initially tested CS on 4.9.3 we purposely blocked access to > > NFS > > > and we observe the results. Changing the kvmhearbeat.sh script so it > > > doesn't reboot the node after 5 minutes has been essential to defuse > the > > > potential of a massive KVM hosts reboot. In the end, it's far less > > > damage to let NFS recover than having all those VMs rebooted. On > 4.9.3 > > > the cloudtack-agent will remain "Up" and not fire any VM twice if the > > NFS > > > storage becomes available again within 30 minutes. > > > > > > Now, testing the upgrade from 4.9 to 4.11 in our lab and the same > > failure > > > conditions we rapidly saw a different behavior although not perfectly > > > consistent. On 4.11.2 without host HA enabled, we will see the agent > > > "try" to disconnect after 5 minutes tho sometimes the KVM host goes > into > > > Disconnect state and sometimes it goes straight to Down state. In that > > > case we'll see a duplicate VM created in no time and once the NFS issue > > is > > > resolved, we have 2 copies of that VM and cloudstack only knowns about > > > that last copy. This is obviously a disaster forcing us to look at > how > > > host HA can help. > > > > > > Now with host HA enabled and simulating the same NFS hiccup, we won't > > get > > > duplicate VMs but we will get a KVM host reset. The problem here is > > that, > > > yes the host HA does ensure we don't have dup VMs but at scale this > would > > > also provoke a lot of KVM host resets (if not all of them). If we are > > at > > > risk with host HA to have massive KVM host resets, then I might prefer > > to > > > disable host/VM HA entirely and just handle KVM host failures manually. > > > This is supper annoying for the ops team, but far less risky for the > > > business. > > > > > > Im trying to find if there's a middle ground here between the 4.9 > > behavior > > > with NFS hiccups and the reliability of the new host HA framework. > > > > > > best, > > > > > > Jean-Francois > > >