Hi Remi, So we started here with Andrei (v4.5) complaining a slow NFS causes a mass reboot: http://mail-archives.apache.org/mod_mbox/cloudstack-dev/201510.mbox/%3C18886119.904.1444382474932.JavaMail.andrei%40tuchka%3E
My claim that the VM is not started until the HV is back is not based on personal testing alas, but on Marcus' statement below as well as Simon Weller's reply in the "slow nfs = reboot all hosts" thread above: http://mail-archives.apache.org/mod_mbox/cloudstack-dev/201508.mbox/%3CCALFpzo5CotX0Qz%2Bd_OXEZJGYTau%2BfA%2Bmzxg_yQEUzswi_9gz5w%40mail.gmail.com%3E If what you say is true about the HV not having to come back then this is great; we need to double check this is actually the case. We could then try to tweak the settings in the heartbeat script to be more forgiving re timeouts and/or to add additional logic such as checking if other nodes or the mgmt server is online (therefore the HV has network) before rebooting. Any further thoughts are welcome. I'll try to setup HA on my test deployment and check. Lucian -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro ----- Original Message ----- > From: "Remi Bergsma" <rberg...@schubergphilis.com> > To: d...@cloudstack.apache.org > Cc: "Cloudstack Users List" <users@cloudstack.apache.org> > Sent: Saturday, 10 October, 2015 11:35:36 > Subject: Re: KVM HA is broken, let's fix it > Hi Lucian, > > Can you please explain what the issue is with KVM HA? In my tests, HA starts > all > VMs just fine without the hypervisor coming back. At least that is on current > 4.6. Assuming a cluster of multiple nodes of course. It will then do a > neighbor > check from another host in the same cluster. > > Also, malfunctioning NFS leads to corruption and therefore we fence a box when > the shared storage is unreliable. Combining primary and secondary NFS is not a > good idea for production in my opinion. > > I'm happy to help and if you have a scenario I can replay I will try that in > my > lab. > > Regards, Remi > > Sent from my iPhone > >> On 10 Oct 2015, at 00:19, Nux! <n...@li.nux.ro> wrote: >> >> Hello, >> >> Following a recent thread on the users ml where slow NFS caused a mass >> reboot, I >> have opened the following issue about improving HA on KVM. >> https://issues.apache.org/jira/browse/CLOUDSTACK-8943 >> >> I know there are many people around here who use KVM and are interested in a >> more robust way of doing HA. >> >> Please share your ideas, comments, suggestions, let's see what we can come up >> with to make this better. >> >> Regards, >> Lucian >> >> -- >> Sent from the Delta quadrant using Borg technology! >> >> Nux! > > www.nux.ro