Re: [ovirt-users] Decrease downtime for HA

Michal Skrivanek Mon, 23 Apr 2018 10:07:16 -0700


> On 23 Apr 2018, at 10:52, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> 
> wrote:
> 
> Hi Michal,
> 
> in your last mail you wrote, that the values can be turned down - how can 
> this be done?
> 
>


this is not anything we change very often as it then decreases the system’s 
tolerance to short network glitches
You’d have to take a look at vdc_options and play with some of those 
parameters…Martin/Eli may have some suggestions, otherwise you’d have to read 
the source code and experiment
> Best
> Daniel
> 
> On 12.04.2018 20:29, Michal Skrivanek wrote:
>> 
>> 
>>> On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.men...@hhi.fraunhofer.de 
>>> <mailto:daniel.men...@hhi.fraunhofer.de>> wrote:
>>> 
>>> Hi there,
>>> 
>>> does anyone have an idea how to decrease a virtual machine's downtime?
>>> 
>>> Best
>>> Daniel
>>> 
>>> On 06.04.2018 13:34, Daniel Menzel wrote:
>>>> Hi Michal,
>>>> 
>>>> 
>> 
>> Hi Daniel,
>> adding Martin to review fencing behavior
>>>> (sorry for misspelling your name in my first mail).
>>>> 
>>>> 
>> 
>> that’s not the reason I’m replying late!:-))
>> 
>>>> The settings for the VMs are the following (oVirt 4.2):
>>>> 
>>>> HA checkbox enabled of course
>>>> "Target Storage Domain for VM Lease" -> left empty
>> 
>> if you need faster reactions then try to use VM Leases as well, it won’t 
>> make a difference in this case but will help in case of network issues. E.g. 
>> if you use iSCSI and the storage connection breaks while host connection 
>> still works it would restart the VM in about 80s; otherwise it would take >5 
>> mins. 
>>>> "Resume Behavior" -> AUTO_RESUME
>>>> Priority for Migration -> High
>>>> "Watchdog Model" -> No-Watchdog
>>>> For testing we did not kill any VM but the host. So basically we simulated 
>>>> an instantaneous crash by manually turning the machine off via 
>>>> IPMI-Interface (not via operating system!) and ping the guest(s). What 
>>>> happens then?
>>>> 
>>>> 2-3 seconds after the we press the host's shutdown button we lose ping 
>>>> contact to the VM(s).
>>>> After another 20s oVirt changes the host's status to "connecting", the 
>>>> VM's status is set to a question mark.
>>>> After ~1:30 the host is flagged to "non responsive”
>> 
>> that sounds about right. Now fencing action should have been initiated, if 
>> you can share the engine logs we can confirm that. IIRC we first try soft 
>> fencing - try to ssh to that host, that might take some time to time out I 
>> guess. Martin?
>>>> 
>>>> After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest 
>>>> is back online.
>>>> So, there seems to be one mistake I made in the first mail: The downtime 
>>>> is "only" 2.5min. But still I think this time can be decreased as for some 
>>>> services it is still quite a long time.
>>>> 
>>>> 
>> 
>> these values can be tuned down, but then you may be more susceptible to 
>> fencing power cycling a host in case of shorter network outages. It may be 
>> ok…depending on your requirements.
>>>> Best
>>>> Daniel
>>>> 
>>>> On 06.04.2018 12:49, Michal Skrivanek wrote:
>>>>>> On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> 
>>>>>> <mailto:daniel.men...@hhi.fraunhofer.de> wrote:
>>>>>> 
>>>>>> Hi Michael,
>>>>>> thanks for your mail. Sorry, I forgot to write that. Yes, we have power 
>>>>>> management and fencing enabled on all hosts. We also tested this and 
>>>>>> found out that it works perfectly. So this cannot be the reason I guess.
>>>>> Hi Daniel,
>>>>> ok, then it’s worth looking into details. Can you describe in more detail 
>>>>> what happens? What exact settings you’re using for such VM? Are you 
>>>>> killing the HE VM or other VMs or both? Would be good to narrow it down a 
>>>>> bit and then review the exact flow
>>>>> 
>>>>> Thanks,
>>>>> michal
>>>>> 
>>>>>> Daniel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 06.04.2018 11:11, Michal Skrivanek wrote:
>>>>>>>> On 4 Apr 2018, at 15:36, Daniel Menzel 
>>>>>>>> <daniel.men...@hhi.fraunhofer.de> 
>>>>>>>> <mailto:daniel.men...@hhi.fraunhofer.de> wrote:
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> we're successfully using a setup with 4 Nodes and a replicated Gluster 
>>>>>>>> for storage. The engine is self hosted. What we're dealing with at the 
>>>>>>>> moment is the high availability: If a node fails (for example 
>>>>>>>> simulated by a forced power loss) the engine comes back up online 
>>>>>>>> withing ~2min. But guests (having the HA option enabled) come back 
>>>>>>>> online only after a very long grace time of ~5min. As we have a 
>>>>>>>> reliable network (40 GbE) and reliable servers I think that the 
>>>>>>>> default grace times are way too high for us - is there any possibility 
>>>>>>>> to change those values?
>>>>>>> And do you have Power Management(iLO, iDRAC,etc) configured for your 
>>>>>>> hosts? Otherwise we have to resort to relatively long timeouts to make 
>>>>>>> sure the host is really dead
>>>>>>> Thanks,
>>>>>>> michal
>>>>>>>> Thanks in advance!
>>>>>>>> Daniel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users@ovirt.org <mailto:Users@ovirt.org>
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users 
>>>>>>>> <http://lists.ovirt.org/mailman/listinfo/users>
>>>>>>>> 
>>>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@ovirt.org <mailto:Users@ovirt.org>
>>>> http://lists.ovirt.org/mailman/listinfo/users 
>>>> <http://lists.ovirt.org/mailman/listinfo/users>
>>> 
>>> _______________________________________________
>>> Users mailing list
>>> Users@ovirt.org <mailto:Users@ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users 
>>> <http://lists.ovirt.org/mailman/listinfo/users>
>> 
>

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Decrease downtime for HA

Reply via email to