On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal wrote:
>
> Hi,
>
> Michal has observed occasional OST failures in test_vdsm_recovery last
> days, which hadn't been seen before. When `systemctl stop vdsmd' is
> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
> within the 10 seconds timeout and gets then killed with SIGKILL. If
> this action is accompanied by "Job for vdsmd.service canceled." message
> then the test fails; otherwise OST continues normally.
>
> The situation is reproducible by running OST basic-suite-master and
> making it artificially failing after test_vdsm_recovery. Then running
> `systemctl stop vdsmd' manually on the given OST host (can be done
> repeatedly, so it provides a good opportunity to examine the problem).
>
> There are two problems there:
>
> - "Job for vdsmd.service canceled." message that sometimes occurs after
> `systemctl stop vdsmd' and then the test fails. I don't know what it
> means and I can't identify any difference in journal between when the
> message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine
storage domain was deactivated and panicking. systemd will try to restart
the hosted engine agent, and since it requires vdsm, systemd may start vdsm
while it is being shut down, and you will get this "canceled" message.
Basically if you want to stop vdsm you must stop other services requiring it
on the host.
If we want to avoid this craziness, we need to fix other services not to require
vdsm. It is not possible to manage a service when other services require it.
> - The fact that Vdsm doesn't stop within the timeout and must be killed.
> This doesn't happen in my normal oVirt installation. It apparently
> blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken
and is likely
to get stuck during shutdown. However, It never caused trouble in the past.
> Journal says:
>
> systemd[1]: Stopping Virtual Desktop Server Manager...
> systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
> systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal
> SIGKILL.
> systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
> systemd[1]: vdsmd.service: Failed with result 'timeout'.
> systemd[1]: Stopped Virtual Desktop Server Manager.
>
> And vdsm.log (from a different run, sorry):
...
> 2021-11-12 07:09:32,820+ INFO (MainThread) [storage.monitor]
> Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
> 2021-11-12 07:09:32,820+ INFO (MainThread) [storage.monitor] Stop
> monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
We stop the monitor with the shutdown flag - good...
> 2021-11-12 07:09:32,820+ INFO (MainThread) [storage.monitor] Stop
> monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...
> 2021-11-12 07:09:32,838+ INFO (monitor/a8bab4e) [storage.check] Stop
> checking
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
> (check:135)
We stopped checking the domain - good...
> 2021-11-12 07:09:32,867+ INFO (monitor/186180c) [storage.blocksd]
> Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad!
This is a bug that was hidden for several years since tearing down a
storage domain
did not really tear down the storage. Now that we really tear down the
storage, it will
deactivate logical volumes and remove device mapper devices.
> 2021-11-12 07:09:32,867+ DEBUG (monitor/186180c) [common.commands]
> /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config
> 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
> write_cache_state=0 disable_after_error_count=3
> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
> "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
> locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0
> use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n
> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
This an attempt to deactivate a logical volume that should not happen
during shutdown...
> 2021-11-12 07:09:38,063+ DEBUG (monitor/186180c) [common.commands]
> FAILED: = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids
> in use.\n Can\'t deactivate volume group
> "186180cb-