Nir Soffer <nsof...@redhat.com> writes: > On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzama...@redhat.com> wrote: >> >> Hi, > >> >> Michal has observed occasional OST failures in test_vdsm_recovery last >> days, which hadn't been seen before. When `systemctl stop vdsmd' is >> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown >> within the 10 seconds timeout and gets then killed with SIGKILL. If >> this action is accompanied by "Job for vdsmd.service canceled." message >> then the test fails; otherwise OST continues normally. >> >> The situation is reproducible by running OST basic-suite-master and >> making it artificially failing after test_vdsm_recovery. Then running >> `systemctl stop vdsmd' manually on the given OST host (can be done >> repeatedly, so it provides a good opportunity to examine the problem). >> >> There are two problems there: >> >> - "Job for vdsmd.service canceled." message that sometimes occurs after >> `systemctl stop vdsmd' and then the test fails. I don't know what it >> means and I can't identify any difference in journal between when the >> message occurs and when it doesn't. > > This is likely caused by hosted engine agent discovering that hosted engine > storage domain was deactivated and panicking. systemd will try to restart > the hosted engine agent, and since it requires vdsm, systemd may start vdsm > while it is being shut down, and you will get this "canceled" message. > > Basically if you want to stop vdsm you must stop other services requiring it > on the host. > > If we want to avoid this craziness, we need to fix other services not to > require > vdsm. It is not possible to manage a service when other services require it. > >> - The fact that Vdsm doesn't stop within the timeout and must be killed. >> This doesn't happen in my normal oVirt installation. It apparently >> blocks in self.irs.prepareForShutdown() call from clientIF.py. > > This is normal in the last 7-8 years. Vdsm storage shutdown is broken > and is likely > to get stuck during shutdown. However, It never caused trouble in the past. > >> Journal says: >> >> systemd[1]: Stopping Virtual Desktop Server Manager... >> systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. >> systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal >> SIGKILL. >> systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with >> signal SIGKILL. >> systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with >> signal SIGKILL. >> systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with >> signal SIGKILL. >> systemd[1]: vdsmd.service: Main process exited, code=killed, >> status=9/KILL >> systemd[1]: vdsmd.service: Failed with result 'timeout'. >> systemd[1]: Stopped Virtual Desktop Server Manager. >> >> And vdsm.log (from a different run, sorry): > ... >> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] >> Shutting down domain monitors (monitor:243) > > This is shutdown flow - good... > >> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] >> Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) >> (monitor:268) > > We stop the monitor with the shutdown flag - good... > >> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] >> Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) >> (monitor:268) > ... > >> 2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) >> [storage.check] Stop checking >> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' >> (check:135) > > We stopped checking the domain - good... > >> 2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) >> [storage.blocksd] Tearing down domain >> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996) > > Tearing down the storage domain - bad! > > This is a bug that was hidden for several years since tearing down a > storage domain > did not really tear down the storage. Now that we really tear down the > storage, it will > deactivate logical volumes and remove device mapper devices. > >> 2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) >> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n >> /sbin/lvm vgchange --config 'devices { >> preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 >> write_cache_state=0 disable_after_error_count=3 >> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", >> "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { >> locking_type=1 prioritise_write_locks=1 wait_for_locks=1 >> use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 >> }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) >> (commands:154) > > This an attempt to deactivate a logical volume that should not happen > during shutdown... > >> 2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) >> [common.commands] FAILED: <err> = b' Logical volume >> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate >> volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open >> logical volume(s)\n'; <rc> = 5 (commands:186) > > And it fails, since the logical volume is in use. > >> What is interesting here is that ioprocess is started here rather than >> being killed as on clean vdsmd shutdowns. > > This is normal, when systemd kill ioprocess, it try to restart. We need to > improve iopprocess retry mechanism to wait before restarting, so it can > learn that we are during a shutdown. Basically this is just noise since > systemd > will kill these ioprocess processes anyway when vdsm terminates. > > This should fix the wrong teardown during shutdown: > https://gerrit.ovirt.org/c/vdsm/+/117616 > > Didi: this may also be related to: > https://bugzilla.redhat.com/1986732 > > If vdsm is stopped during that flow. During shutdown, it will teardown > the hosted > engine storage domain. Tearing down the storage domain should happen only when > vdsm is asked to deactivate and disconnect from the storage, not > during shutdown.
Thank you for all the explanations, it makes sense now. > If you know how to reproduce the issue, please try with this fix > and report it it makes a difference. vdsmd stops cleanly now in an environment where it previously didn't, thank you for the fix. It cannot be reliably tested with OST itself because ost-*-host-1, on which the test usually runs, is a Node host preventing package upgrades from normal repos and thus repo passed via --custom-repo is effectively ignored :-O (I wonder what we actually test in gating on Jenkins). The best way to test whether the shutdown fix avoids the OST failures would be to merge the patch, I think. > Thanks for the detailed report! > > Nir _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/GFRRWU25BXGKOW7B5H5VV4ZYMYDX6O4C/