Unfortunately I don't see anything wrong in both engine and vdsm logs.
There is one last thing that comes to my mind that you try - restart engine
service. That is exactly the case I have been investigating.
But before restarting I would like to ask you, if possible, for a java
(jvm) thread dump.
The procedure is as follows:
1)  find jboss pid  ie.
$ ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
2) trigger thread dump
$ kill -3 <jboss-pid>
3)  thread dump logs can be found at /var/log/ovirt-engine/console.log

And then restart engine service to check if that helps.

Artur


On Mon, Aug 9, 2021 at 2:19 PM Andrei Verovski <andre...@starlett.lv> wrote:

> Hi, Artur,
>
> Small update with vdsm status, forgot to include in previous post.
>
> I partially fixed problem with VDSM start.
>
> Bug "Failed to create session: Start job for unit user-0.slice failed with
> ‘canceled’”
> is being described here
> https://bugzilla.redhat.com/show_bug.cgi?id=1967962
> and fix seem to be available here, so I have downgraded systemd with
> backport fix:
>
> http://people.redhat.com/dtardon/systemd/bz1642460-backport-UserStopDelaySec=/
>
> Now vdsmd service starts successfully, but node14 still cannot be
> activated because of same error. This is quite strange, before restart on
> Friday node just worked. There were no upgrades, nothing, just restart.
>
> [root@node14 ~]# service vdsmd status
> Redirecting to /bin/systemctl status vdsmd.service
> ● vdsmd.service - Virtual Desktop Server Manager
>    Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor
> preset: disabled)
>    Active: active (running) since Mon 2021-08-09 15:12:59 EEST; 4min 20s
> ago
>   Process: 4066 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh
> --pre-start (code=exited, status=0/SUCCESS)
>  Main PID: 4130 (vdsmd)
>     Tasks: 41 (limit: 615525)
>    Memory: 59.5M
>    CGroup: /system.slice/vdsmd.service
>            └─4130 /usr/bin/python3 /usr/share/vdsm/vdsmd
>
> Aug 09 15:12:55 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> prepare_transient_repository
> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> syslog_available
> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> nwfilter
> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> dummybr
> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> tune_system
> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> test_space
> Aug 09 15:12:59 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
> test_lo
> Aug 09 15:12:59 node14.***.lv systemd[1]: Started Virtual Desktop Server
> Manager.
> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available. Error:
> [Errno 111] Connection refused
> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available, KSM
> stats will be missing. Error:
>
>
> [root@node14]# firewall-cmd --list-all
> public (active)
>   target: default
>   icmp-block-inversion: no
>   interfaces: DMZ_node14 eno1 eno2 ovirtmgmt
>   sources:
>   services: cockpit dhcpv6-client libvirt-tls mountd nfs ovirt-imageio
> ovirt-vmconsole rpc-bind snmp ssh vdsm
>   ports: 2301/tcp 2381/tcp 22/tcp 6081/udp
>   protocols:
>   forward: no
>   masquerade: no
>   forward-ports:
>   source-ports:
>   icmp-blocks:
>   rich rules:
> [root@node14 andrei]#
>
>
> vdsm-client Host getStats and vdsm-client Host getCapabilities attached.
>
>
>
>
> On 9 Aug 2021, at 13:18, Artur Socha <aso...@redhat.com> wrote:
>
> Thanks for the logs.  I am checking them at the moment. I have noticed so
> far that node14 is serving NFS share which had been marked as problematic
> (probably because of the downtime during the migration) but it has
> recovered.
>
> In the meantime, is is possible to get some meaningful results when
> calling:
> $ vdsm-client Host getStats
> and
> $ vdsm-client Host getCapabilities
> on node14?
>
> What  is the state for vdsmd service when running systemctl status vdsmd?
> One other thing to rule out is the networking/firewall. Here the list of
> the ports to be open for the host (the documentation is for hosted engine,
> but it applies for standalone setup as well):
>
> https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine_using_the_command_line/index.html#host-firewall-requirements_SHE_cli_deploy
>
> btw. I have been hunting for the rare and hard to recreate bug for quite a
> long time (without success yet) so any reported connectivity issues between
> the manager and hosts are super interesting to me.
>
> Artur
>
> On Mon, Aug 9, 2021 at 11:44 AM Andrei Verovski <andreil1@***.lv
> <andre...@starlett.lv>> wrote:
>
>> Hi, Artur,
>>
>>
>> Thanks for assistance. Zipped engine starting from the day of upgrade
>> attached.
>> Restart via SSH from oVirt Web GUI works.
>> oVirt engine runs on dedicated server, not hosted engine.
>>
>>
>>
>>
>> On 9 Aug 2021, at 11:24, Artur Socha <aso...@redhat.com> wrote:
>>
>> Hi Andrei,
>> Could you also post a relevant piece of engine.log? I don't have high
>> expectations to find the answer there but  I just want  to be sure of it.
>> VDSM.log does not show any trace of error from the vdsm point of view.
>> For example it looks like it started correctly and subscribed to receiving
>> commands from the engine (yet that does not mean I connected to it - only
>> in listening mode).
>>
>> Can you confirm that 'SSH restart' from UI works - by 'works' I mean the
>> host is actually restarted after a few minutes and there are no ssh related
>> (public key etc) errors in engine.log?
>>
>> Artur
>>
>> On Mon, Aug 9, 2021 at 9:55 AM Andrei Verovski <andreil1@***.lv
>> <andre...@starlett.lv>> wrote:
>>
>>> Hi,
>>>
>>> I have oVirt 4.4.7.6-1.el8 and one problematic node (HP ProLiant with
>>> CentOS 8 stream).
>>> After replacing server rack router switch and restart got this error I
>>> can’t recover from:
>>>
>>> VDSM node14 command Get Host Capabilities failed: Message timeout which
>>> can be caused by communication issues
>>>
>>> vdsm-network running fine, but vdsmd can’t start on node14 for whatever
>>> reason. All other nodes running fine.
>>>
>>> Aug 09 10:24:12 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>> Running dummybr
>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>> Running tune_system
>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>> Running test_space
>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>> Running test_lo
>>> Aug 09 10:24:13 node14.mydomain.lv systemd[1]: Started Virtual Desktop
>>> Server Manager.
>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]:
>>> pam_systemd(sudo:session): Failed to create session: Start job for unit
>>> user-0.slice failed with 'canceled'
>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session):
>>> session opened for user root by (uid=0)
>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session):
>>> session closed for user root
>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available.
>>> Error: [Errno 2] No such file or directory
>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available,
>>> KSM stats will be missing. Error:
>>>
>>>
>>> In web gui -> Management I can’t do anything with the host except
>>> restart. Stop aborts with error, all other commands are gray-ed out.
>>> Status is “Unassigned”. Host is answering to pings as usual.
>>> vdsm.log (from node14) attached.
>>>
>>> Thanks in advance for any help.
>>>
>>>
>>> _______________________________________________
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/55M65W57Z43ZVPOARDTK7HKHCAMAUGO5/
>>>
>>
>>
>> --
>> Artur Socha
>> Senior Software Engineer, RHV
>> Red Hat
>>
>>
>>
>
> --
> Artur Socha
> Senior Software Engineer, RHV
> Red Hat
>
>
>

-- 
Artur Socha
Senior Software Engineer, RHV
Red Hat
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/WPDZIL4XS5LNX7O2U7VPVAZRUANPZOBZ/

Reply via email to