[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

Milan Zamazal Mon, 15 Nov 2021 03:03:19 -0800

Nir Soffer <nsof...@redhat.com> writes:

> On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzama...@redhat.com> wrote:
>>
>> Hi,
>
>>
>> Michal has observed occasional OST failures in test_vdsm_recovery last
>> days, which hadn't been seen before.  When `systemctl stop vdsmd' is
>> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
>> within the 10 seconds timeout and gets then killed with SIGKILL.  If
>> this action is accompanied by "Job for vdsmd.service canceled." message
>> then the test fails; otherwise OST continues normally.
>>
>> The situation is reproducible by running OST basic-suite-master and
>> making it artificially failing after test_vdsm_recovery.  Then running
>> `systemctl stop vdsmd' manually on the given OST host (can be done
>> repeatedly, so it provides a good opportunity to examine the problem).
>>
>> There are two problems there:
>>
>> - "Job for vdsmd.service canceled." message that sometimes occurs after
>>   `systemctl stop vdsmd' and then the test fails.  I don't know what it
>>   means and I can't identify any difference in journal between when the
>>   message occurs and when it doesn't.
>
> This is likely caused by hosted engine agent discovering that hosted engine
> storage domain was deactivated and panicking. systemd will try to restart
> the hosted engine agent, and since it requires vdsm, systemd may start vdsm
> while it is being shut down, and you will get this "canceled" message.
>
> Basically if you want to stop vdsm you must stop other services requiring it
> on the host.
>
> If we want to avoid this craziness, we need to fix other services not to 
> require
> vdsm. It is not possible to manage a service when other services require it.
>
>> - The fact that Vdsm doesn't stop within the timeout and must be killed.
>>   This doesn't happen in my normal oVirt installation.  It apparently
>>   blocks in self.irs.prepareForShutdown() call from clientIF.py.
>
> This is normal in the last 7-8 years. Vdsm storage shutdown is broken
> and is likely
> to get stuck during shutdown. However, It never caused trouble in the past.
>
>>   Journal says:
>>
>>     systemd[1]: Stopping Virtual Desktop Server Manager...
>>     systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
>>     systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal 
>> SIGKILL.
>>     systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with 
>> signal SIGKILL.
>>     systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with 
>> signal SIGKILL.
>>     systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with 
>> signal SIGKILL.
>>     systemd[1]: vdsmd.service: Main process exited, code=killed, 
>> status=9/KILL
>>     systemd[1]: vdsmd.service: Failed with result 'timeout'.
>>     systemd[1]: Stopped Virtual Desktop Server Manager.
>>
>>   And vdsm.log (from a different run, sorry):
> ...
>>     2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Shutting down domain monitors (monitor:243)
>
> This is shutdown flow - good...
>
>>     2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True)
>> (monitor:268)
>
> We stop the monitor with the shutdown flag - good...
>
>>     2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True)
>> (monitor:268)
> ...
>
>>     2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e)
>> [storage.check] Stop checking
>> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
>> (check:135)
>
> We stopped checking the domain - good...
>
>>     2021-11-12 07:09:32,867+0000 INFO (monitor/186180c)
>> [storage.blocksd] Tearing down domain
>> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
>
> Tearing down the storage domain - bad!
>
> This is a bug that was hidden for several years since tearing down a
> storage domain
> did not really tear down the storage. Now that we really tear down the
> storage, it will
> deactivate logical volumes and remove device mapper devices.
>
>>     2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c)
>> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n
>> /sbin/lvm vgchange --config 'devices {
>> preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3
>> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
>> "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
>> locking_type=1 prioritise_write_locks=1 wait_for_locks=1
>> use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0
>> }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None)
>> (commands:154)
>
> This an attempt to deactivate a logical volume that should not happen
> during shutdown...
>
>>     2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c)
>> [common.commands] FAILED: <err> = b' Logical volume
>> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate
>> volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open
>> logical volume(s)\n'; <rc> = 5 (commands:186)
>
> And it fails, since the logical volume is in use.
>
>>   What is interesting here is that ioprocess is started here rather than
>>   being killed as on clean vdsmd shutdowns.
>
> This is normal, when systemd kill ioprocess, it try to restart. We need to
> improve iopprocess retry mechanism to wait before restarting, so it can
> learn that we are during a shutdown. Basically this is just noise since 
> systemd
> will kill these ioprocess processes anyway when vdsm terminates.
>
> This should fix the wrong teardown during shutdown:
> https://gerrit.ovirt.org/c/vdsm/+/117616
>
> Didi: this may also be related to:
> https://bugzilla.redhat.com/1986732
>
> If vdsm is stopped during that flow. During shutdown, it will teardown
> the hosted
> engine storage domain. Tearing down the storage domain should happen only when
> vdsm is asked to deactivate and disconnect from the storage, not
> during shutdown.


Thank you for all the explanations, it makes sense now.

> If you know how to reproduce the issue, please try with this fix
> and report it it makes a difference.

vdsmd stops cleanly now in an environment where it previously didn't,
thank you for the fix.

It cannot be reliably tested with OST itself because ost-*-host-1, on
which the test usually runs, is a Node host preventing package upgrades
from normal repos and thus repo passed via --custom-repo is effectively
ignored :-O (I wonder what we actually test in gating on Jenkins).  The
best way to test whether the shutdown fix avoids the OST failures would
be to merge the patch, I think.

> Thanks for the detailed report!
>
> Nir
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/GFRRWU25BXGKOW7B5H5VV4ZYMYDX6O4C/

[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

Reply via email to