Hi,

Michal has observed occasional OST failures in test_vdsm_recovery last
days, which hadn't been seen before.  When `systemctl stop vdsmd' is
called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
within the 10 seconds timeout and gets then killed with SIGKILL.  If
this action is accompanied by "Job for vdsmd.service canceled." message
then the test fails; otherwise OST continues normally.

The situation is reproducible by running OST basic-suite-master and
making it artificially failing after test_vdsm_recovery.  Then running
`systemctl stop vdsmd' manually on the given OST host (can be done
repeatedly, so it provides a good opportunity to examine the problem).

There are two problems there:

- "Job for vdsmd.service canceled." message that sometimes occurs after
  `systemctl stop vdsmd' and then the test fails.  I don't know what it
  means and I can't identify any difference in journal between when the
  message occurs and when it doesn't.

- The fact that Vdsm doesn't stop within the timeout and must be killed.
  This doesn't happen in my normal oVirt installation.  It apparently
  blocks in self.irs.prepareForShutdown() call from clientIF.py.
  Journal says:

    systemd[1]: Stopping Virtual Desktop Server Manager...
    systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
    systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal 
SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal 
SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal 
SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal 
SIGKILL.
    systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
    systemd[1]: vdsmd.service: Failed with result 'timeout'.
    systemd[1]: Stopped Virtual Desktop Server Manager.

  And vdsm.log (from a different run, sorry):

    2021-11-12 07:09:30,274+0000 INFO  (MainThread) [vdsm.api] START 
prepareForShutdown() from=internal, 
task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48)
    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener 
thread has ended. (vmchannels:214)
    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread 
<Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261)
    2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread 
<Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261)
    2021-11-12 07:09:30,517+0000 INFO  (ioprocess/143197) [IOProcess] 
(3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,521+0000 INFO  (ioprocess/143199) [IOProcess] 
(ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess 
(__init__:465)
    2021-11-12 07:09:30,535+0000 INFO  (ioprocess/143193) [IOProcess] 
(ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess 
(__init__:465)
    2021-11-12 07:09:30,679+0000 INFO  (ioprocess/143195) [IOProcess] 
(0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,719+0000 INFO  (ioprocess/143192) [IOProcess] 
(15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,756+0000 INFO  (ioprocess/143194) [IOProcess] 
(ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess 
(__init__:465)
    2021-11-12 07:09:30,768+0000 INFO  (ioprocess/143198) [IOProcess] 
(ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess 
(__init__:465)
    2021-11-12 07:09:30,774+0000 INFO  (ioprocess/143196) [IOProcess] 
(a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread 
<Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261)
    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) 
[storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail monitoring 
thread stopped, clearing outgoing mail (mailbox:500)
    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) 
[storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - 
['/usr/bin/dd', 
'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox',
 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 
'seek=2'] (mailbox:382)
    2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread 
<Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261)
    2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread 
<Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261)
    2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread 
<Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Shutting 
down domain monitors (monitor:243)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,831+0000 INFO  (monitor/0187cf2) [storage.check] Stop 
checking 
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata'
 (check:135)
    2021-11-12 07:09:32,838+0000 INFO  (monitor/a8bab4e) [storage.check] Stop 
checking 
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
 (check:135)
    2021-11-12 07:09:32,844+0000 INFO  (monitor/15fa3d6) [storage.check] Stop 
checking 
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata'
 (check:135)
    2021-11-12 07:09:32,864+0000 INFO  (monitor/3a729aa) [storage.check] Stop 
checking 
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata'
 (check:135)
    2021-11-12 07:09:32,865+0000 INFO  (monitor/186180c) [storage.check] Stop 
checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
    2021-11-12 07:09:32,867+0000 INFO  (monitor/186180c) [storage.blocksd] 
Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
    2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] 
/usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 
'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
write_cache_state=0  disable_after_error_count=3  
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
 "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
    2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] 
FAILED: <err> = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in 
use.\n  Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" 
with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
    2021-11-12 07:09:38,066+0000 WARN  (monitor/186180c) [storage.lvm] Command 
with specific filter failed or returned no data, retrying with a wider filter: 
LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices 
{  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
write_cache_state=0  disable_after_error_count=3  
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
 "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--available\', 
\'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\'  Logical 
volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \'  Can\\\'t 
deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open 
logical volume(s)\']' (lvm:532)
    2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] 
/usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 
'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
write_cache_state=0  disable_after_error_count=3  
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|",
 "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)

  What is interesting here is that ioprocess is started here rather than
  being killed as on clean vdsmd shutdowns.

Any ideas?

Thanks,
Milan
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/X636NSS7EGCAJ3TSHMUEUIVTNYDGJLZ4/

Reply via email to