Actually you could even make 3 thread dumps in 30second intervals.
Artur

On Mon, Aug 9, 2021 at 4:53 PM Artur Socha <aso...@redhat.com> wrote:

> Unfortunately I don't see anything wrong in both engine and vdsm logs.
> There is one last thing that comes to my mind that you try - restart
> engine service. That is exactly the case I have been investigating.
> But before restarting I would like to ask you, if possible, for a java
> (jvm) thread dump.
> The procedure is as follows:
> 1)  find jboss pid  ie.
> $ ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
> 2) trigger thread dump
> $ kill -3 <jboss-pid>
> 3)  thread dump logs can be found at /var/log/ovirt-engine/console.log
>
> And then restart engine service to check if that helps.
>
> Artur
>
>
> On Mon, Aug 9, 2021 at 2:19 PM Andrei Verovski <andre...@starlett.lv>
> wrote:
>
>> Hi, Artur,
>>
>> Small update with vdsm status, forgot to include in previous post.
>>
>> I partially fixed problem with VDSM start.
>>
>> Bug "Failed to create session: Start job for unit user-0.slice failed
>> with ‘canceled’”
>> is being described here
>> https://bugzilla.redhat.com/show_bug.cgi?id=1967962
>> and fix seem to be available here, so I have downgraded systemd with
>> backport fix:
>>
>> http://people.redhat.com/dtardon/systemd/bz1642460-backport-UserStopDelaySec=/
>>
>> Now vdsmd service starts successfully, but node14 still cannot be
>> activated because of same error. This is quite strange, before restart on
>> Friday node just worked. There were no upgrades, nothing, just restart.
>>
>> [root@node14 ~]# service vdsmd status
>> Redirecting to /bin/systemctl status vdsmd.service
>> ● vdsmd.service - Virtual Desktop Server Manager
>>    Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor
>> preset: disabled)
>>    Active: active (running) since Mon 2021-08-09 15:12:59 EEST; 4min 20s
>> ago
>>   Process: 4066 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh
>> --pre-start (code=exited, status=0/SUCCESS)
>>  Main PID: 4130 (vdsmd)
>>     Tasks: 41 (limit: 615525)
>>    Memory: 59.5M
>>    CGroup: /system.slice/vdsmd.service
>>            └─4130 /usr/bin/python3 /usr/share/vdsm/vdsmd
>>
>> Aug 09 15:12:55 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> prepare_transient_repository
>> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> syslog_available
>> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> nwfilter
>> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> dummybr
>> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> tune_system
>> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> test_space
>> Aug 09 15:12:59 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running
>> test_lo
>> Aug 09 15:12:59 node14.***.lv systemd[1]: Started Virtual Desktop Server
>> Manager.
>> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available. Error:
>> [Errno 111] Connection refused
>> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available, KSM
>> stats will be missing. Error:
>>
>>
>> [root@node14]# firewall-cmd --list-all
>> public (active)
>>   target: default
>>   icmp-block-inversion: no
>>   interfaces: DMZ_node14 eno1 eno2 ovirtmgmt
>>   sources:
>>   services: cockpit dhcpv6-client libvirt-tls mountd nfs ovirt-imageio
>> ovirt-vmconsole rpc-bind snmp ssh vdsm
>>   ports: 2301/tcp 2381/tcp 22/tcp 6081/udp
>>   protocols:
>>   forward: no
>>   masquerade: no
>>   forward-ports:
>>   source-ports:
>>   icmp-blocks:
>>   rich rules:
>> [root@node14 andrei]#
>>
>>
>> vdsm-client Host getStats and vdsm-client Host getCapabilities attached.
>>
>>
>>
>>
>> On 9 Aug 2021, at 13:18, Artur Socha <aso...@redhat.com> wrote:
>>
>> Thanks for the logs.  I am checking them at the moment. I have noticed so
>> far that node14 is serving NFS share which had been marked as problematic
>> (probably because of the downtime during the migration) but it has
>> recovered.
>>
>> In the meantime, is is possible to get some meaningful results when
>> calling:
>> $ vdsm-client Host getStats
>> and
>> $ vdsm-client Host getCapabilities
>> on node14?
>>
>> What  is the state for vdsmd service when running systemctl status vdsmd?
>> One other thing to rule out is the networking/firewall. Here the list of
>> the ports to be open for the host (the documentation is for hosted engine,
>> but it applies for standalone setup as well):
>>
>> https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine_using_the_command_line/index.html#host-firewall-requirements_SHE_cli_deploy
>>
>> btw. I have been hunting for the rare and hard to recreate bug for quite
>> a long time (without success yet) so any reported connectivity issues
>> between the manager and hosts are super interesting to me.
>>
>> Artur
>>
>> On Mon, Aug 9, 2021 at 11:44 AM Andrei Verovski <andreil1@***.lv
>> <andre...@starlett.lv>> wrote:
>>
>>> Hi, Artur,
>>>
>>>
>>> Thanks for assistance. Zipped engine starting from the day of upgrade
>>> attached.
>>> Restart via SSH from oVirt Web GUI works.
>>> oVirt engine runs on dedicated server, not hosted engine.
>>>
>>>
>>>
>>>
>>> On 9 Aug 2021, at 11:24, Artur Socha <aso...@redhat.com> wrote:
>>>
>>> Hi Andrei,
>>> Could you also post a relevant piece of engine.log? I don't have high
>>> expectations to find the answer there but  I just want  to be sure of it.
>>> VDSM.log does not show any trace of error from the vdsm point of view.
>>> For example it looks like it started correctly and subscribed to receiving
>>> commands from the engine (yet that does not mean I connected to it - only
>>> in listening mode).
>>>
>>> Can you confirm that 'SSH restart' from UI works - by 'works' I mean the
>>> host is actually restarted after a few minutes and there are no ssh related
>>> (public key etc) errors in engine.log?
>>>
>>> Artur
>>>
>>> On Mon, Aug 9, 2021 at 9:55 AM Andrei Verovski <andreil1@***.lv
>>> <andre...@starlett.lv>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have oVirt 4.4.7.6-1.el8 and one problematic node (HP ProLiant with
>>>> CentOS 8 stream).
>>>> After replacing server rack router switch and restart got this error I
>>>> can’t recover from:
>>>>
>>>> VDSM node14 command Get Host Capabilities failed: Message timeout which
>>>> can be caused by communication issues
>>>>
>>>> vdsm-network running fine, but vdsmd can’t start on node14 for whatever
>>>> reason. All other nodes running fine.
>>>>
>>>> Aug 09 10:24:12 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>>> Running dummybr
>>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>>> Running tune_system
>>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>>> Running test_space
>>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm:
>>>> Running test_lo
>>>> Aug 09 10:24:13 node14.mydomain.lv systemd[1]: Started Virtual Desktop
>>>> Server Manager.
>>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]:
>>>> pam_systemd(sudo:session): Failed to create session: Start job for unit
>>>> user-0.slice failed with 'canceled'
>>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session):
>>>> session opened for user root by (uid=0)
>>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session):
>>>> session closed for user root
>>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available.
>>>> Error: [Errno 2] No such file or directory
>>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available,
>>>> KSM stats will be missing. Error:
>>>>
>>>>
>>>> In web gui -> Management I can’t do anything with the host except
>>>> restart. Stop aborts with error, all other commands are gray-ed out.
>>>> Status is “Unassigned”. Host is answering to pings as usual.
>>>> vdsm.log (from node14) attached.
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list -- users@ovirt.org
>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>>> oVirt Code of Conduct:
>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/55M65W57Z43ZVPOARDTK7HKHCAMAUGO5/
>>>>
>>>
>>>
>>> --
>>> Artur Socha
>>> Senior Software Engineer, RHV
>>> Red Hat
>>>
>>>
>>>
>>
>> --
>> Artur Socha
>> Senior Software Engineer, RHV
>> Red Hat
>>
>>
>>
>
> --
> Artur Socha
> Senior Software Engineer, RHV
> Red Hat
>


-- 
Artur Socha
Senior Software Engineer, RHV
Red Hat
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/6JKAD7D4WJEIWRCCBV75WMQNCL46OJDI/

Reply via email to