[ovirt-users] Re: Host Reboot Timeout of 10 Minutes

Peter H Mon, 27 Feb 2023 02:31:42 -0800

On Wed, Jan 25, 2023 at 8:22 AM Yedidyah Bar David <d...@redhat.com> wrote:
>
> On Wed, Jan 25, 2023 at 2:08 AM Peter H <pe...@hashbang.org> wrote:
> > Does a config file exist where this timeout can be set to a lower value?
>
> I intended to provide a short reply just pointing out what value to
> change, then realized this might not be helpful, so decided to give up
> and not reply. Then I decided to take this opportunity and write the
> following.
>
> For background, please see:
> https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKBEAXTJT45N5BZT72VI67T/


Thanks for taking the time to answer me at such length. I was unaware
of the tool engine-config(1) even after maintaining an oVirt cluster
for 3 years... The tool was just what I was looking for.

> You do not need to be a developer, to search and read source code. One
> of the biggest advantages of FOSS is that you can do this, even
> without knowing how to write/update it.

I installed my first Linux OS (Slackware) back in '93 so I have
downloaded and compiled my fair share of Free/OpenSource projects. In
the last century I also got a couple of kernel patches accepted. I
actually checked the code a couple of years ago while investigating
the logic behind the dropdown menu regarding VM types. I found out the
UI code was written in some Java framework that was quite hard to
understand for me.

> My main work in oVirt was in packaging/setup/backup/restore, not in
> the engine itself or vdsm - the two main parts of the project. But I
> know enough to guess that the error message you got is from the
> engine. I already have the engine source code git cloned on my laptop,
> so grepped it for 'for server to finish reboot', and found this in
> backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java:

I acknowledge that I could have found the error message in the code
but I'm unsure if I then had made the connection that would have me
discover the engine-setup(1) tool.

>     private void sleepOnReboot(final VDSStatus status) {
>         int sleepTimeInSec = Config.<Integer>
> getValue(ConfigValues.ServerRebootTimeout);
>         log.info("Waiting {} seconds, for server to finish reboot process.",
>                 sleepTimeInSec);
>
> Even without knowing Java, ServerRebootTimeout seems relevant.
> grepping for this, finds it also in:
>
> packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select
> fn_db_add_config_value('ServerRebootTimeout','600','general');
> packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:--
> Increase default ServerRebootTimeout from 5 to 10 minutes
> packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select
> fn_db_update_default_config_value('ServerRebootTimeout', '300', '600',
> 'general', false);
>
> where it's set and then updated, and in:
>
> packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host
> Reboot Timeout (in seconds)"
> packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer
>
> where it's exposed to engine-config. So if all you want is to get this
> error message earlier, this should be enough.

I can confirm that the ServerRebootTimeout is set to 300 in our
current 4.4.1 installation.
I have also tested that I can change it in my 4.5.4 test system using:

engine-config -s ServerRebootTimeout=300
systemctl restart ovirt-engine

> However, I also checked the git log (or blame, if you want, but I
> prefer the log) for the former file, trying to understand when and why
> it was changed from 5 to 10 minutes. 'git log -u
> packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then
> searching for 'ServerRebootTimeout' finds
> https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at
> https://bugzilla.redhat.com/1947403 . That one sadly does not provide
> many more details. It does show that it was done in 4.4.6. So I can
> only guess that one of two things happened:
>
> 1. Someone complained that hosts become non-operational e.g. because
> their boot sequence/POST/whatever takes more than 5 minutes. Perhaps
> this was rare enough to be reported and handled only recently (two
> years ago, and not, say, 10). (Although I personally managed machines
> that needed more than 5 minutes to reboot, or even just test the RAM -
> but that's indeed rare).

We have some HP servers in our 4.4.1 installation that take around 6-7
minutes to reboot from selecting Restart through the SSH Management
(dropdown menu). Even Though this is longer than 5 minutes (300 secs)
the installation never fails. The state is Reboot for 5 minutes then
it goes into NonResponsive for some minutes until the state is set to
Up.

> 2. Something else changed, and made this less comfortable. E.g.
> perhaps the engine didn't move them in the past to non-operational and
> now does, or something like that.
>
> Not sure which of these, it at all.

I'm seeing some differences between 4.4.1 and 4.5.4.
In 4.4.1 with default timeout of 10 minutes a host would be set to Up
just after reboot. That could be within 3, 5 or 7 minutes.

In 4.5.4 this seems to have changed.
With a 10 minute timeout a host that reboots in 5 minutes will not be
set to "Up" before the 10 minutes have gone. It's a long time to sit
and wait when you know the host is up.
If I lower the timeout to 1 minute the host state will go from Reboot
to NonResponsive and finally Up very shortly after the host has
started up again.

So in 4.5.4 it seems that hosts can't connect until
ServerRebootTimeout seconds after the reboot was initiated.

> You are welcome to change it to some low value using engine-config and
> see if it helps. If it's "just enough", you should notice no
> difference from previous versions.

Setting the timeout to 1 minute seems to work fine. We will probably
just need to work on our alerting rules because the hosts will be in
NonResponsive state for a while before coming "Up".

> 3. Try to find out why the above patch was needed and think if it's
> important enough for you to dive deeper and provide, or at least
> suggest, a fix/change/whatever that will make it unnecessary.
>
> That said, we (the RHV team) are still interested, and if you think
> it's a significant issue/bug, we might decide to spend the time and
> come up with a fix/change ourselves. In the past, I might have tried
> to guess who might know more details and Cced them. But at this point
> in time, I decided it's better to explain what users can do by
> themselves. I hope this is helpful, and that enough people will take
> the time to read, understand, and apply my explanation as applicable.
>
> Best regards,
> --
> Didi

The issue is not a blocker for us as things work. It's just a matter
of how long we have to wait or deal with the NonResponsive state. We
can install hosts and add them to the cluster without failures.

For the versions 4.5.1 - 4.5.3 it was another matter. Here the reboot
timeout caused the installations to fail. I'm wondering about how this
was not seen by others or some regression.

Once again, thanks for your answer.

BR
Peter
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/YXDCFARDLFANQ3ANG7OXMAEVI5RVVQ25/

[ovirt-users] Re: Host Reboot Timeout of 10 Minutes

Reply via email to