Hi Artur

Hope you are well, please see below, this after I restarted the engine:

host:
[root@ovirt-aa-1-21:~]↥ # tcpdump -i ovirtmgmt -c 1000 -ttttnnvvS dst
ovirt-engine-aa-1-01
tcpdump: listening on ovirtmgmt, link-type EN10MB (Ethernet), capture size
262144 bytes
2020-08-07 12:09:32.553543 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 28
2020-08-07 12:10:05.584594 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0x5cd5
(incorrect -> 0xc8ca), seq 4036072905, ack 3265413231, win 28960, options
[mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0
2020-08-07 12:10:10.589276 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 28
2020-08-07 12:10:15.596230 IP (tos 0x0, ttl 64, id 48438, offset 0, flags
[DF], proto TCP (6), length 52)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x5ccd
(incorrect -> 0x40b8), seq 4036072906, ack 3265413231, win 227, options
[nop,nop,TS val 3039514647 ecr 341411251], length 0
2020-08-07 12:10:20.596429 ARP, Ethernet (len 6), IPv4 (len 4), Request
who-has 172.140.220.23 tell 172.140.220.111, length 28
2020-08-07 12:10:20.663699 IP (tos 0x0, ttl 64, id 64726, offset 0, flags
[DF], proto TCP (6), length 40)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20
(correct), seq 4036072907, win 0, length 0

engine
[root@ovirt-engine-aa-1-01 ~]# tcpdump -i eth0 -c 1000 -ttttnnvvS src
ovirt-aa-1-21
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
262144 bytes
2020-08-07 12:09:31.891242 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0xc8ca
(correct), seq 4036072905, ack 3265413231, win 28960, options [mss
1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0
2020-08-07 12:09:36.895502 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 42
2020-08-07 12:09:41.901981 IP (tos 0x0, ttl 64, id 48438, offset 0, flags
[DF], proto TCP (6), length 52)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x40b8
(correct), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val
3039514647 ecr 341411251], length 0
2020-08-07 12:09:46.901681 ARP, Ethernet (len 6), IPv4 (len 4), Request
who-has 172.140.220.23 tell 172.140.220.111, length 42
2020-08-07 12:09:46.968911 IP (tos 0x0, ttl 64, id 64726, offset 0, flags
[DF], proto TCP (6), length 40)
    172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20
(correct), seq 4036072907, win 0, length 0

Regards

Nar

On Fri, 7 Aug 2020 at 11:54, Artur Socha <aso...@redhat.com> wrote:

> Hi Nardus,
> There is one more thing to be checked.
>
> 1) could you check if there are any packets sent from the affected host to
> the engine?
> on host:
> # outgoing traffic
>  sudo  tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst
> <engine_host>
>
> 2) same the other way round. Check if there are packets received on engine
> side from affected host
> on engine:
> # incoming traffic
> sudo  tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src
> <affected_host>
>
> Artur
>
>
> On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <aso...@redhat.com> wrote:
>
>> Thanks Nardus,
>> After a quick look I found what I was suspecting - there are way too many
>> threads in Blocked state. I don't know yet the reason but this is very
>> helpful. I'll let you know about the findings/investigation. Meanwhile, you
>> may try restarting the engine as (a very brute and ugly) workaround).
>> You may try to setup slightly bigger thread pool - may save you some time
>> until the next hiccup. However, please be aware that this may come with the
>> cost in memory usage and higher cpu usage (due to increased context
>> switching)
>> Here are some docs:
>>
>> # Specify the thread pool size for jboss managed scheduled executor service 
>> used by commands to periodically execute
>> # methods. It is generally not necessary to increase the number of threads 
>> in this thread pool. To change the value
>> # permanently create a conf file 99-engine-scheduled-thread-pool.conf in 
>> /etc/ovirt-engine/engine.conf.d/
>> ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
>>
>>
>> A.
>>
>>
>> On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nard...@gmail.com>
>> wrote:
>>
>>> Hi Artur
>>>
>>> Please find attached, also let me know if I need to rerun. They 5 min
>>> apart
>>>
>>> [root@engine-aa-1-01 ovirt-engine]#  ps -ef | grep jboss | grep -v grep
>>> | awk '{ print $2 }'
>>> 27390
>>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>>> your_engine_thread_dump_1.txt
>>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>>> your_engine_thread_dump_2.txt
>>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>>> your_engine_thread_dump_3.txt
>>>
>>> Regards
>>>
>>> Nar
>>>
>>> On Thu, 6 Aug 2020 at 15:55, Artur Socha <aso...@redhat.com> wrote:
>>>
>>>> Sure thing.
>>>> On engine host please find  jboss pid. You can use this command:
>>>>
>>>>  ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
>>>>
>>>> or jps tool from jdk. Sample output on my dev environment is:
>>>>
>>>> ± % jps
>>>>                                                        !2860
>>>> 64853 jboss-modules.jar
>>>> 196217 Jps
>>>>
>>>> Then use jstack from jdk:
>>>> jstack <pid>  > your_engine_thread_dump.txt
>>>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even
>>>> more useful.
>>>>
>>>> Here you can find even more options
>>>> https://www.baeldung.com/java-thread-dump
>>>>
>>>> Artur
>>>>
>>>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nard...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Can create thread dump, please send details on howto.
>>>>>
>>>>> Regards
>>>>>
>>>>> Nardus
>>>>>
>>>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <aso...@redhat.com> wrote:
>>>>>
>>>>>> Hi Nardus,
>>>>>> You might have hit an issue I have been hunting for some time ( [1]
>>>>>> and  [2] ).
>>>>>> [1] could not be properly resolved because at a time was not able to
>>>>>> recreate an issue on dev setup.
>>>>>> I suspect [2] is related.
>>>>>>
>>>>>> Would you be able to prepare a thread dump from your engine instance?
>>>>>> Additionally, please check for potential libvirt errors/warnings.
>>>>>> Can you also paste the output of:
>>>>>> sudo yum list installed | grep vdsm
>>>>>> sudo yum list installed | grep ovirt-engine
>>>>>> sudo yum list installed | grep libvirt
>>>>>>
>>>>>> Usually, according to previous reports, restarting the engine helps
>>>>>> to restore connectivity with hosts ... at least for some time.
>>>>>>
>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152
>>>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
>>>>>>
>>>>>> regards,
>>>>>> Artur
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nard...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Also see this in engine:
>>>>>>>
>>>>>>> Aug 6, 2020, 7:37:17 AM
>>>>>>> VDSM someserver command Get Host Capabilities failed: Message
>>>>>>> timeout which can be caused by communication issues
>>>>>>>
>>>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86...@yahoo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you fheck for errors on the affected host. Most probably you
>>>>>>>> need the vdsm logs.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Strahil Nikolov
>>>>>>>>
>>>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys <
>>>>>>>> nard...@gmail.com> написа:
>>>>>>>> >Hi Strahil
>>>>>>>> >
>>>>>>>> >Hope you are well. I get the following error when I tried to
>>>>>>>> confirm
>>>>>>>> >reboot:
>>>>>>>> >
>>>>>>>> >Error while executing action: Cannot confirm 'Host has been
>>>>>>>> rebooted'
>>>>>>>> >Host.
>>>>>>>> >Valid Host statuses are "Non operational", "Maintenance" or
>>>>>>>> >"Connecting".
>>>>>>>> >
>>>>>>>> >And I can't put it in maintenance, only option is "restart" or
>>>>>>>> "stop".
>>>>>>>> >
>>>>>>>> >Regards
>>>>>>>> >
>>>>>>>> >Nar
>>>>>>>> >
>>>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <
>>>>>>>> hunter86...@yahoo.com>
>>>>>>>> >wrote:
>>>>>>>> >
>>>>>>>> >> After rebooting the node, have you "marked" it that it was
>>>>>>>> rebooted ?
>>>>>>>> >>
>>>>>>>> >> Best Regards,
>>>>>>>> >> Strahil Nikolov
>>>>>>>> >>
>>>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys <
>>>>>>>> >> nard...@gmail.com> написа:
>>>>>>>> >> >Hi oVirt land
>>>>>>>> >> >
>>>>>>>> >> >Hope you are well. Got a bit of an issue, actually a big issue.
>>>>>>>> We
>>>>>>>> >had
>>>>>>>> >> >some
>>>>>>>> >> >sort of dip of some sort. All the VM's is still running, but
>>>>>>>> some of
>>>>>>>> >> >the
>>>>>>>> >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts
>>>>>>>> was
>>>>>>>> >> >showing
>>>>>>>> >> >UP and was fine before our dip. So I did increase
>>>>>>>> >vdsHeartbeatInSecond
>>>>>>>> >> >to
>>>>>>>> >> >240, no luck.
>>>>>>>> >> >
>>>>>>>> >> >I still get a timeout on the engine lock even thou I can
>>>>>>>> connect to
>>>>>>>> >> >that
>>>>>>>> >> >host from the engine using nc to test to port 54321. I also did
>>>>>>>> >restart
>>>>>>>> >> >vdsmd and also rebooted the host with no luck.
>>>>>>>> >> >
>>>>>>>> >> > nc -v someserver 54321
>>>>>>>> >> >Ncat: Version 7.50 ( https://nmap.org/ncat )
>>>>>>>> >> >Ncat: Connected to 172.40.2.172:54321.
>>>>>>>> >> >
>>>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR
>>>>>>>> >>
>>>>>>>>
>>>>>>>> >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>>> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID:
>>>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get
>>>>>>>> Host
>>>>>>>> >> >Capabilities failed: Message timeout which can be caused by
>>>>>>>> >> >communication
>>>>>>>> >> >issues
>>>>>>>> >> >
>>>>>>>> >> >Any troubleshoot ideas will be gladly appreciated.
>>>>>>>> >> >
>>>>>>>> >> >Regards
>>>>>>>> >> >
>>>>>>>> >> >Nar
>>>>>>>> >>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list -- users@ovirt.org
>>>>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>>>>>> oVirt Code of Conduct:
>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>> List Archives:
>>>>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2325Z4AV4VCCEKH4M3S/
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Artur Socha
>>>>>> Senior Software Engineer, RHV
>>>>>> Red Hat
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Artur Socha
>>>> Senior Software Engineer, RHV
>>>> Red Hat
>>>>
>>>
>>
>> --
>> Artur Socha
>> Senior Software Engineer, RHV
>> Red Hat
>>
>
>
> --
> Artur Socha
> Senior Software Engineer, RHV
> Red Hat
>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/HESQGCIGK53EF7YPGUIQXMMDQYMTJWAR/

Reply via email to