[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

2021-11-14 Thread Nir Soffer
On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal  wrote:
>
> Hi,
>
> Michal has observed occasional OST failures in test_vdsm_recovery last
> days, which hadn't been seen before.  When `systemctl stop vdsmd' is
> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
> within the 10 seconds timeout and gets then killed with SIGKILL.  If
> this action is accompanied by "Job for vdsmd.service canceled." message
> then the test fails; otherwise OST continues normally.
>
> The situation is reproducible by running OST basic-suite-master and
> making it artificially failing after test_vdsm_recovery.  Then running
> `systemctl stop vdsmd' manually on the given OST host (can be done
> repeatedly, so it provides a good opportunity to examine the problem).
>
> There are two problems there:
>
> - "Job for vdsmd.service canceled." message that sometimes occurs after
>   `systemctl stop vdsmd' and then the test fails.  I don't know what it
>   means and I can't identify any difference in journal between when the
>   message occurs and when it doesn't.

This is likely caused by hosted engine agent discovering that hosted engine
storage domain was deactivated and panicking. systemd will try to restart
the hosted engine agent, and since it requires vdsm, systemd may start vdsm
while it is being shut down, and you will get this "canceled" message.

Basically if you want to stop vdsm you must stop other services requiring it
on the host.

If we want to avoid this craziness, we need to fix other services not to require
vdsm. It is not possible to manage a service when other services require it.

> - The fact that Vdsm doesn't stop within the timeout and must be killed.
>   This doesn't happen in my normal oVirt installation.  It apparently
>   blocks in self.irs.prepareForShutdown() call from clientIF.py.

This is normal in the last 7-8 years. Vdsm storage shutdown is broken
and is likely
to get stuck during shutdown. However, It never caused trouble in the past.

>   Journal says:
>
> systemd[1]: Stopping Virtual Desktop Server Manager...
> systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
> systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal 
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal 
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal 
> SIGKILL.
> systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal 
> SIGKILL.
> systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
> systemd[1]: vdsmd.service: Failed with result 'timeout'.
> systemd[1]: Stopped Virtual Desktop Server Manager.
>
>   And vdsm.log (from a different run, sorry):
...
> 2021-11-12 07:09:32,820+ INFO  (MainThread) [storage.monitor] 
> Shutting down domain monitors (monitor:243)

This is shutdown flow - good...

> 2021-11-12 07:09:32,820+ INFO  (MainThread) [storage.monitor] Stop 
> monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)

We stop the monitor with the shutdown flag - good...

> 2021-11-12 07:09:32,820+ INFO  (MainThread) [storage.monitor] Stop 
> monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...

> 2021-11-12 07:09:32,838+ INFO  (monitor/a8bab4e) [storage.check] Stop 
> checking 
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
>  (check:135)

We stopped checking the domain - good...

> 2021-11-12 07:09:32,867+ INFO  (monitor/186180c) [storage.blocksd] 
> Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)

Tearing down the storage domain - bad!

This is a bug that was hidden for several years since tearing down a
storage domain
did not really tear down the storage. Now that we really tear down the
storage, it will
deactivate logical volumes and remove device mapper devices.

> 2021-11-12 07:09:32,867+ DEBUG (monitor/186180c) [common.commands] 
> /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 
> 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
> write_cache_state=0  disable_after_error_count=3  
> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
>  "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
> locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
> use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 
> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)

This an attempt to deactivate a logical volume that should not happen
during shutdown...

> 2021-11-12 07:09:38,063+ DEBUG (monitor/186180c) [common.commands] 
> FAILED:  = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids 
> in use.\n  Can\'t deactivate volume group 
> "186180cb-

[ovirt-devel] Gerrit: ovirt-engine build is unstable

2021-11-14 Thread Saif Abu Saleh
Hi,

Gerrit build on ovirt-engine on master branch seems to be unstable

I ran twice and it failed on two errors un-related to my changes, failures:


1- job link:
https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/15074/

Errors:
fatal: unable to access 'https://gerrit.ovirt.org/ovirt-engine/': Could not
resolve host: gerrit.ovirt.org; Unknown error


2- job link:
https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/15080/

Errors:

[2021-11-14T12:32:21.224Z] No matching package to install:
'ovirt-jboss-modules-maven-plugin >= 2.0.1'
[2021-11-14T12:32:21.224Z] Error: Some packages could not be found. irt.oer:


Anyone is aware of / checking these failures?


Thanks,
Saif
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/OA33VQKPHBRB7JL24OQ5ZHB2VVFD2YCR/


[ovirt-devel] Re: Gerrit: ovirt-engine build is unstable

2021-11-14 Thread Milan Zamazal
Saif Abu Saleh  writes:

> Hi,
>
> Gerrit build on ovirt-engine on master branch seems to be unstable
>
> I ran twice and it failed on two errors un-related to my changes, failures:
>
>
> 1- job link:
> https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/15074/
>
> Errors:
> fatal: unable to access 'https://gerrit.ovirt.org/ovirt-engine/': Could not
> resolve host: gerrit.ovirt.org; Unknown error

I experienced this more than once in Vdsm Jenkins jobs too.  So it seems
to be a re-occurring problem that disturbs our CI and should be fixed.

> 2- job link:
> https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/15080/
>
> Errors:
>
> [2021-11-14T12:32:21.224Z] No matching package to install:
> 'ovirt-jboss-modules-maven-plugin >= 2.0.1'
> [2021-11-14T12:32:21.224Z] Error: Some packages could not be found. irt.oer:
>
>
> Anyone is aware of / checking these failures?
>
>
> Thanks,
> Saif
> ___
> Devel mailing list -- devel@ovirt.org
> To unsubscribe send an email to devel-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/OA33VQKPHBRB7JL24OQ5ZHB2VVFD2YCR/
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZGS4YFTO7BBQNCXLFQ47JG5W3ADGV26N/