> On 23 Apr 2018, at 10:52, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> > wrote: > > Hi Michal, > > in your last mail you wrote, that the values can be turned down - how can > this be done? > >
this is not anything we change very often as it then decreases the system’s tolerance to short network glitches You’d have to take a look at vdc_options and play with some of those parameters…Martin/Eli may have some suggestions, otherwise you’d have to read the source code and experiment > Best > Daniel > > On 12.04.2018 20:29, Michal Skrivanek wrote: >> >> >>> On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.men...@hhi.fraunhofer.de >>> <mailto:daniel.men...@hhi.fraunhofer.de>> wrote: >>> >>> Hi there, >>> >>> does anyone have an idea how to decrease a virtual machine's downtime? >>> >>> Best >>> Daniel >>> >>> On 06.04.2018 13:34, Daniel Menzel wrote: >>>> Hi Michal, >>>> >>>> >> >> Hi Daniel, >> adding Martin to review fencing behavior >>>> (sorry for misspelling your name in my first mail). >>>> >>>> >> >> that’s not the reason I’m replying late!:-)) >> >>>> The settings for the VMs are the following (oVirt 4.2): >>>> >>>> HA checkbox enabled of course >>>> "Target Storage Domain for VM Lease" -> left empty >> >> if you need faster reactions then try to use VM Leases as well, it won’t >> make a difference in this case but will help in case of network issues. E.g. >> if you use iSCSI and the storage connection breaks while host connection >> still works it would restart the VM in about 80s; otherwise it would take >5 >> mins. >>>> "Resume Behavior" -> AUTO_RESUME >>>> Priority for Migration -> High >>>> "Watchdog Model" -> No-Watchdog >>>> For testing we did not kill any VM but the host. So basically we simulated >>>> an instantaneous crash by manually turning the machine off via >>>> IPMI-Interface (not via operating system!) and ping the guest(s). What >>>> happens then? >>>> >>>> 2-3 seconds after the we press the host's shutdown button we lose ping >>>> contact to the VM(s). >>>> After another 20s oVirt changes the host's status to "connecting", the >>>> VM's status is set to a question mark. >>>> After ~1:30 the host is flagged to "non responsive” >> >> that sounds about right. Now fencing action should have been initiated, if >> you can share the engine logs we can confirm that. IIRC we first try soft >> fencing - try to ssh to that host, that might take some time to time out I >> guess. Martin? >>>> >>>> After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest >>>> is back online. >>>> So, there seems to be one mistake I made in the first mail: The downtime >>>> is "only" 2.5min. But still I think this time can be decreased as for some >>>> services it is still quite a long time. >>>> >>>> >> >> these values can be tuned down, but then you may be more susceptible to >> fencing power cycling a host in case of shorter network outages. It may be >> ok…depending on your requirements. >>>> Best >>>> Daniel >>>> >>>> On 06.04.2018 12:49, Michal Skrivanek wrote: >>>>>> On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> >>>>>> <mailto:daniel.men...@hhi.fraunhofer.de> wrote: >>>>>> >>>>>> Hi Michael, >>>>>> thanks for your mail. Sorry, I forgot to write that. Yes, we have power >>>>>> management and fencing enabled on all hosts. We also tested this and >>>>>> found out that it works perfectly. So this cannot be the reason I guess. >>>>> Hi Daniel, >>>>> ok, then it’s worth looking into details. Can you describe in more detail >>>>> what happens? What exact settings you’re using for such VM? Are you >>>>> killing the HE VM or other VMs or both? Would be good to narrow it down a >>>>> bit and then review the exact flow >>>>> >>>>> Thanks, >>>>> michal >>>>> >>>>>> Daniel >>>>>> >>>>>> >>>>>> >>>>>> On 06.04.2018 11:11, Michal Skrivanek wrote: >>>>>>>> On 4 Apr 2018, at 15:36, Daniel Menzel >>>>>>>> <daniel.men...@hhi.fraunhofer.de> >>>>>>>> <mailto:daniel.men...@hhi.fraunhofer.de> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> we're successfully using a setup with 4 Nodes and a replicated Gluster >>>>>>>> for storage. The engine is self hosted. What we're dealing with at the >>>>>>>> moment is the high availability: If a node fails (for example >>>>>>>> simulated by a forced power loss) the engine comes back up online >>>>>>>> withing ~2min. But guests (having the HA option enabled) come back >>>>>>>> online only after a very long grace time of ~5min. As we have a >>>>>>>> reliable network (40 GbE) and reliable servers I think that the >>>>>>>> default grace times are way too high for us - is there any possibility >>>>>>>> to change those values? >>>>>>> And do you have Power Management(iLO, iDRAC,etc) configured for your >>>>>>> hosts? Otherwise we have to resort to relatively long timeouts to make >>>>>>> sure the host is really dead >>>>>>> Thanks, >>>>>>> michal >>>>>>>> Thanks in advance! >>>>>>>> Daniel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> Users@ovirt.org <mailto:Users@ovirt.org> >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> <http://lists.ovirt.org/mailman/listinfo/users> >>>>>>>> >>>>>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org <mailto:Users@ovirt.org> >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> <http://lists.ovirt.org/mailman/listinfo/users> >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org <mailto:Users@ovirt.org> >>> http://lists.ovirt.org/mailman/listinfo/users >>> <http://lists.ovirt.org/mailman/listinfo/users> >> >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users