Actually you could even make 3 thread dumps in 30second intervals. Artur On Mon, Aug 9, 2021 at 4:53 PM Artur Socha <aso...@redhat.com> wrote:
> Unfortunately I don't see anything wrong in both engine and vdsm logs. > There is one last thing that comes to my mind that you try - restart > engine service. That is exactly the case I have been investigating. > But before restarting I would like to ask you, if possible, for a java > (jvm) thread dump. > The procedure is as follows: > 1) find jboss pid ie. > $ ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' > 2) trigger thread dump > $ kill -3 <jboss-pid> > 3) thread dump logs can be found at /var/log/ovirt-engine/console.log > > And then restart engine service to check if that helps. > > Artur > > > On Mon, Aug 9, 2021 at 2:19 PM Andrei Verovski <andre...@starlett.lv> > wrote: > >> Hi, Artur, >> >> Small update with vdsm status, forgot to include in previous post. >> >> I partially fixed problem with VDSM start. >> >> Bug "Failed to create session: Start job for unit user-0.slice failed >> with ‘canceled’” >> is being described here >> https://bugzilla.redhat.com/show_bug.cgi?id=1967962 >> and fix seem to be available here, so I have downgraded systemd with >> backport fix: >> >> http://people.redhat.com/dtardon/systemd/bz1642460-backport-UserStopDelaySec=/ >> >> Now vdsmd service starts successfully, but node14 still cannot be >> activated because of same error. This is quite strange, before restart on >> Friday node just worked. There were no upgrades, nothing, just restart. >> >> [root@node14 ~]# service vdsmd status >> Redirecting to /bin/systemctl status vdsmd.service >> ● vdsmd.service - Virtual Desktop Server Manager >> Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor >> preset: disabled) >> Active: active (running) since Mon 2021-08-09 15:12:59 EEST; 4min 20s >> ago >> Process: 4066 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh >> --pre-start (code=exited, status=0/SUCCESS) >> Main PID: 4130 (vdsmd) >> Tasks: 41 (limit: 615525) >> Memory: 59.5M >> CGroup: /system.slice/vdsmd.service >> └─4130 /usr/bin/python3 /usr/share/vdsm/vdsmd >> >> Aug 09 15:12:55 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> prepare_transient_repository >> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> syslog_available >> Aug 09 15:12:57 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> nwfilter >> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> dummybr >> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> tune_system >> Aug 09 15:12:58 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> test_space >> Aug 09 15:12:59 node14.***.lv vdsmd_init_common.sh[4066]: vdsm: Running >> test_lo >> Aug 09 15:12:59 node14.***.lv systemd[1]: Started Virtual Desktop Server >> Manager. >> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available. Error: >> [Errno 111] Connection refused >> Aug 09 15:13:00 node14.***.lv vdsm[4130]: WARN MOM not available, KSM >> stats will be missing. Error: >> >> >> [root@node14]# firewall-cmd --list-all >> public (active) >> target: default >> icmp-block-inversion: no >> interfaces: DMZ_node14 eno1 eno2 ovirtmgmt >> sources: >> services: cockpit dhcpv6-client libvirt-tls mountd nfs ovirt-imageio >> ovirt-vmconsole rpc-bind snmp ssh vdsm >> ports: 2301/tcp 2381/tcp 22/tcp 6081/udp >> protocols: >> forward: no >> masquerade: no >> forward-ports: >> source-ports: >> icmp-blocks: >> rich rules: >> [root@node14 andrei]# >> >> >> vdsm-client Host getStats and vdsm-client Host getCapabilities attached. >> >> >> >> >> On 9 Aug 2021, at 13:18, Artur Socha <aso...@redhat.com> wrote: >> >> Thanks for the logs. I am checking them at the moment. I have noticed so >> far that node14 is serving NFS share which had been marked as problematic >> (probably because of the downtime during the migration) but it has >> recovered. >> >> In the meantime, is is possible to get some meaningful results when >> calling: >> $ vdsm-client Host getStats >> and >> $ vdsm-client Host getCapabilities >> on node14? >> >> What is the state for vdsmd service when running systemctl status vdsmd? >> One other thing to rule out is the networking/firewall. Here the list of >> the ports to be open for the host (the documentation is for hosted engine, >> but it applies for standalone setup as well): >> >> https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine_using_the_command_line/index.html#host-firewall-requirements_SHE_cli_deploy >> >> btw. I have been hunting for the rare and hard to recreate bug for quite >> a long time (without success yet) so any reported connectivity issues >> between the manager and hosts are super interesting to me. >> >> Artur >> >> On Mon, Aug 9, 2021 at 11:44 AM Andrei Verovski <andreil1@***.lv >> <andre...@starlett.lv>> wrote: >> >>> Hi, Artur, >>> >>> >>> Thanks for assistance. Zipped engine starting from the day of upgrade >>> attached. >>> Restart via SSH from oVirt Web GUI works. >>> oVirt engine runs on dedicated server, not hosted engine. >>> >>> >>> >>> >>> On 9 Aug 2021, at 11:24, Artur Socha <aso...@redhat.com> wrote: >>> >>> Hi Andrei, >>> Could you also post a relevant piece of engine.log? I don't have high >>> expectations to find the answer there but I just want to be sure of it. >>> VDSM.log does not show any trace of error from the vdsm point of view. >>> For example it looks like it started correctly and subscribed to receiving >>> commands from the engine (yet that does not mean I connected to it - only >>> in listening mode). >>> >>> Can you confirm that 'SSH restart' from UI works - by 'works' I mean the >>> host is actually restarted after a few minutes and there are no ssh related >>> (public key etc) errors in engine.log? >>> >>> Artur >>> >>> On Mon, Aug 9, 2021 at 9:55 AM Andrei Verovski <andreil1@***.lv >>> <andre...@starlett.lv>> wrote: >>> >>>> Hi, >>>> >>>> I have oVirt 4.4.7.6-1.el8 and one problematic node (HP ProLiant with >>>> CentOS 8 stream). >>>> After replacing server rack router switch and restart got this error I >>>> can’t recover from: >>>> >>>> VDSM node14 command Get Host Capabilities failed: Message timeout which >>>> can be caused by communication issues >>>> >>>> vdsm-network running fine, but vdsmd can’t start on node14 for whatever >>>> reason. All other nodes running fine. >>>> >>>> Aug 09 10:24:12 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm: >>>> Running dummybr >>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm: >>>> Running tune_system >>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm: >>>> Running test_space >>>> Aug 09 10:24:13 node14.mydomain.lv vdsmd_init_common.sh[4825]: vdsm: >>>> Running test_lo >>>> Aug 09 10:24:13 node14.mydomain.lv systemd[1]: Started Virtual Desktop >>>> Server Manager. >>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: >>>> pam_systemd(sudo:session): Failed to create session: Start job for unit >>>> user-0.slice failed with 'canceled' >>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session): >>>> session opened for user root by (uid=0) >>>> Aug 09 10:24:16 node14.mydomain.lv sudo[7721]: pam_unix(sudo:session): >>>> session closed for user root >>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available. >>>> Error: [Errno 2] No such file or directory >>>> Aug 09 10:24:17 node14.mydomain.lv vdsm[6754]: WARN MOM not available, >>>> KSM stats will be missing. Error: >>>> >>>> >>>> In web gui -> Management I can’t do anything with the host except >>>> restart. Stop aborts with error, all other commands are gray-ed out. >>>> Status is “Unassigned”. Host is answering to pings as usual. >>>> vdsm.log (from node14) attached. >>>> >>>> Thanks in advance for any help. >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list -- users@ovirt.org >>>> To unsubscribe send an email to users-le...@ovirt.org >>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html >>>> oVirt Code of Conduct: >>>> https://www.ovirt.org/community/about/community-guidelines/ >>>> List Archives: >>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/55M65W57Z43ZVPOARDTK7HKHCAMAUGO5/ >>>> >>> >>> >>> -- >>> Artur Socha >>> Senior Software Engineer, RHV >>> Red Hat >>> >>> >>> >> >> -- >> Artur Socha >> Senior Software Engineer, RHV >> Red Hat >> >> >> > > -- > Artur Socha > Senior Software Engineer, RHV > Red Hat > -- Artur Socha Senior Software Engineer, RHV Red Hat
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6JKAD7D4WJEIWRCCBV75WMQNCL46OJDI/