Hi Artur Hope you are well, please see below, this after I restarted the engine:
host: [root@ovirt-aa-1-21:~]↥ # tcpdump -i ovirtmgmt -c 1000 -ttttnnvvS dst ovirt-engine-aa-1-01 tcpdump: listening on ovirtmgmt, link-type EN10MB (Ethernet), capture size 262144 bytes 2020-08-07 12:09:32.553543 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 28 2020-08-07 12:10:05.584594 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0x5cd5 (incorrect -> 0xc8ca), seq 4036072905, ack 3265413231, win 28960, options [mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0 2020-08-07 12:10:10.589276 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 28 2020-08-07 12:10:15.596230 IP (tos 0x0, ttl 64, id 48438, offset 0, flags [DF], proto TCP (6), length 52) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x5ccd (incorrect -> 0x40b8), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val 3039514647 ecr 341411251], length 0 2020-08-07 12:10:20.596429 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.140.220.23 tell 172.140.220.111, length 28 2020-08-07 12:10:20.663699 IP (tos 0x0, ttl 64, id 64726, offset 0, flags [DF], proto TCP (6), length 40) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20 (correct), seq 4036072907, win 0, length 0 engine [root@ovirt-engine-aa-1-01 ~]# tcpdump -i eth0 -c 1000 -ttttnnvvS src ovirt-aa-1-21 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 2020-08-07 12:09:31.891242 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0xc8ca (correct), seq 4036072905, ack 3265413231, win 28960, options [mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0 2020-08-07 12:09:36.895502 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 42 2020-08-07 12:09:41.901981 IP (tos 0x0, ttl 64, id 48438, offset 0, flags [DF], proto TCP (6), length 52) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x40b8 (correct), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val 3039514647 ecr 341411251], length 0 2020-08-07 12:09:46.901681 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.140.220.23 tell 172.140.220.111, length 42 2020-08-07 12:09:46.968911 IP (tos 0x0, ttl 64, id 64726, offset 0, flags [DF], proto TCP (6), length 40) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20 (correct), seq 4036072907, win 0, length 0 Regards Nar On Fri, 7 Aug 2020 at 11:54, Artur Socha <aso...@redhat.com> wrote: > Hi Nardus, > There is one more thing to be checked. > > 1) could you check if there are any packets sent from the affected host to > the engine? > on host: > # outgoing traffic > sudo tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst > <engine_host> > > 2) same the other way round. Check if there are packets received on engine > side from affected host > on engine: > # incoming traffic > sudo tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src > <affected_host> > > Artur > > > On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <aso...@redhat.com> wrote: > >> Thanks Nardus, >> After a quick look I found what I was suspecting - there are way too many >> threads in Blocked state. I don't know yet the reason but this is very >> helpful. I'll let you know about the findings/investigation. Meanwhile, you >> may try restarting the engine as (a very brute and ugly) workaround). >> You may try to setup slightly bigger thread pool - may save you some time >> until the next hiccup. However, please be aware that this may come with the >> cost in memory usage and higher cpu usage (due to increased context >> switching) >> Here are some docs: >> >> # Specify the thread pool size for jboss managed scheduled executor service >> used by commands to periodically execute >> # methods. It is generally not necessary to increase the number of threads >> in this thread pool. To change the value >> # permanently create a conf file 99-engine-scheduled-thread-pool.conf in >> /etc/ovirt-engine/engine.conf.d/ >> ENGINE_SCHEDULED_THREAD_POOL_SIZE=100 >> >> >> A. >> >> >> On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nard...@gmail.com> >> wrote: >> >>> Hi Artur >>> >>> Please find attached, also let me know if I need to rerun. They 5 min >>> apart >>> >>> [root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep >>> | awk '{ print $2 }' >>> 27390 >>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >>> your_engine_thread_dump_1.txt >>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >>> your_engine_thread_dump_2.txt >>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >>> your_engine_thread_dump_3.txt >>> >>> Regards >>> >>> Nar >>> >>> On Thu, 6 Aug 2020 at 15:55, Artur Socha <aso...@redhat.com> wrote: >>> >>>> Sure thing. >>>> On engine host please find jboss pid. You can use this command: >>>> >>>> ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' >>>> >>>> or jps tool from jdk. Sample output on my dev environment is: >>>> >>>> ± % jps >>>> !2860 >>>> 64853 jboss-modules.jar >>>> 196217 Jps >>>> >>>> Then use jstack from jdk: >>>> jstack <pid> > your_engine_thread_dump.txt >>>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even >>>> more useful. >>>> >>>> Here you can find even more options >>>> https://www.baeldung.com/java-thread-dump >>>> >>>> Artur >>>> >>>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nard...@gmail.com> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> Can create thread dump, please send details on howto. >>>>> >>>>> Regards >>>>> >>>>> Nardus >>>>> >>>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <aso...@redhat.com> wrote: >>>>> >>>>>> Hi Nardus, >>>>>> You might have hit an issue I have been hunting for some time ( [1] >>>>>> and [2] ). >>>>>> [1] could not be properly resolved because at a time was not able to >>>>>> recreate an issue on dev setup. >>>>>> I suspect [2] is related. >>>>>> >>>>>> Would you be able to prepare a thread dump from your engine instance? >>>>>> Additionally, please check for potential libvirt errors/warnings. >>>>>> Can you also paste the output of: >>>>>> sudo yum list installed | grep vdsm >>>>>> sudo yum list installed | grep ovirt-engine >>>>>> sudo yum list installed | grep libvirt >>>>>> >>>>>> Usually, according to previous reports, restarting the engine helps >>>>>> to restore connectivity with hosts ... at least for some time. >>>>>> >>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 >>>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338 >>>>>> >>>>>> regards, >>>>>> Artur >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nard...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Also see this in engine: >>>>>>> >>>>>>> Aug 6, 2020, 7:37:17 AM >>>>>>> VDSM someserver command Get Host Capabilities failed: Message >>>>>>> timeout which can be caused by communication issues >>>>>>> >>>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86...@yahoo.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Can you fheck for errors on the affected host. Most probably you >>>>>>>> need the vdsm logs. >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Strahil Nikolov >>>>>>>> >>>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < >>>>>>>> nard...@gmail.com> написа: >>>>>>>> >Hi Strahil >>>>>>>> > >>>>>>>> >Hope you are well. I get the following error when I tried to >>>>>>>> confirm >>>>>>>> >reboot: >>>>>>>> > >>>>>>>> >Error while executing action: Cannot confirm 'Host has been >>>>>>>> rebooted' >>>>>>>> >Host. >>>>>>>> >Valid Host statuses are "Non operational", "Maintenance" or >>>>>>>> >"Connecting". >>>>>>>> > >>>>>>>> >And I can't put it in maintenance, only option is "restart" or >>>>>>>> "stop". >>>>>>>> > >>>>>>>> >Regards >>>>>>>> > >>>>>>>> >Nar >>>>>>>> > >>>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov < >>>>>>>> hunter86...@yahoo.com> >>>>>>>> >wrote: >>>>>>>> > >>>>>>>> >> After rebooting the node, have you "marked" it that it was >>>>>>>> rebooted ? >>>>>>>> >> >>>>>>>> >> Best Regards, >>>>>>>> >> Strahil Nikolov >>>>>>>> >> >>>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < >>>>>>>> >> nard...@gmail.com> написа: >>>>>>>> >> >Hi oVirt land >>>>>>>> >> > >>>>>>>> >> >Hope you are well. Got a bit of an issue, actually a big issue. >>>>>>>> We >>>>>>>> >had >>>>>>>> >> >some >>>>>>>> >> >sort of dip of some sort. All the VM's is still running, but >>>>>>>> some of >>>>>>>> >> >the >>>>>>>> >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts >>>>>>>> was >>>>>>>> >> >showing >>>>>>>> >> >UP and was fine before our dip. So I did increase >>>>>>>> >vdsHeartbeatInSecond >>>>>>>> >> >to >>>>>>>> >> >240, no luck. >>>>>>>> >> > >>>>>>>> >> >I still get a timeout on the engine lock even thou I can >>>>>>>> connect to >>>>>>>> >> >that >>>>>>>> >> >host from the engine using nc to test to port 54321. I also did >>>>>>>> >restart >>>>>>>> >> >vdsmd and also rebooted the host with no luck. >>>>>>>> >> > >>>>>>>> >> > nc -v someserver 54321 >>>>>>>> >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) >>>>>>>> >> >Ncat: Connected to 172.40.2.172:54321. >>>>>>>> >> > >>>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR >>>>>>>> >> >>>>>>>> >>>>>>>> >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>>>>> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: >>>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get >>>>>>>> Host >>>>>>>> >> >Capabilities failed: Message timeout which can be caused by >>>>>>>> >> >communication >>>>>>>> >> >issues >>>>>>>> >> > >>>>>>>> >> >Any troubleshoot ideas will be gladly appreciated. >>>>>>>> >> > >>>>>>>> >> >Regards >>>>>>>> >> > >>>>>>>> >> >Nar >>>>>>>> >> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list -- users@ovirt.org >>>>>>> To unsubscribe send an email to users-le...@ovirt.org >>>>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html >>>>>>> oVirt Code of Conduct: >>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>> List Archives: >>>>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2325Z4AV4VCCEKH4M3S/ >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Artur Socha >>>>>> Senior Software Engineer, RHV >>>>>> Red Hat >>>>>> >>>>> >>>> >>>> -- >>>> Artur Socha >>>> Senior Software Engineer, RHV >>>> Red Hat >>>> >>> >> >> -- >> Artur Socha >> Senior Software Engineer, RHV >> Red Hat >> > > > -- > Artur Socha > Senior Software Engineer, RHV > Red Hat >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/HESQGCIGK53EF7YPGUIQXMMDQYMTJWAR/