we have another failing patch with the same test:
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/

its obviously not related to the patch but there is something that is
causing these failures randomly.
>From what I can see in the current failed job, you are correct and host-1
and engine does not even try to deploy host-1.

i can see host-1 is getting error 127 in lago for ntpd and I can see
network manager errors in the host messages log

In the host messages log I can see several messages that I suspect may
cause issues in communication between engine and host:

Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2:
_sasldb_getkeyhandle has failed
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found


ces/12)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-UMJa2S is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-UMJa2S is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
[1553514613.7774] manager: (bondscan-UMJa2S): new Bond device
(/org/freedesktop/NetworkManager/Devices/13)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
net_device) (unregistering): Released all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
lacp_rate: mode dependency failed, not supported in mode balance-rr(0)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
xmit_hash_policy: invalid value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
[1553514613.8002] manager: (bondscan-liwvMR): new Bond device
(/org/freedesktop/NetworkManager/Devices/14)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
primary_reselect: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
arp_all_targets: invalid value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
net_device) (unregistering): Released all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
lacp_rate: mode dependency failed, not supported in mode active-backup(1)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
[1553514613.8429] manager: (bondscan-liwvMR): new Bond device
(/org/freedesktop/NetworkManager/Devices/15)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
xmit_hash_policy: invalid value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
primary_reselect: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option
arp_all_targets: invalid value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
net_device) (unregistering): Released all slaves


On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobc...@redhat.com> wrote:

> For the failed job, the engine didn't even try to deploy on host-1:
>
>
> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/
>
> Martin, do you know what could be the reason for that?
>
> I can see in the logs for both successful and unsuccessful basic-suite-4.3
> runs, that there is no 'ntpdate' on host-1:
>
> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 
> on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 
> on lago-basic-suite-4-3-host-1 returned with 127
> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 
> on lago-basic-suite-4-3-host-1  errors:
>  bash: ntpdate: command not found
>
> On host-0 everything is ok:
>
> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 
> on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 
> on lago-basic-suite-4-3-host-0 returned with 0
> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 
> on lago-basic-suite-4-3-host-0 output:
>  25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset 
> 0.017150 sec
>
> On 3/25/19 10:13 AM, Eyal Edri wrote:
>
> Still fails, now on a different component. ( ovirt-web-ui-extentions )
>
> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>
> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com> wrote:
>
>>
>>
>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobc...@redhat.com>
>> wrote:
>>
>>> Dafna,
>>>
>>> in 'verify_add_hosts' we specifically wait for single host to be up with
>>> a timeout:
>>>
>>>  144     up_hosts = hosts_service.list(search='datacenter={} AND 
>>> status=up'.format(DC_NAME))
>>>  145     if len(up_hosts):
>>>  146         return True
>>>
>>> The log files say, that it took ~50 secs for one of the hosts to be up
>>> (seems reasonable) and no timeout is being reported.
>>> Just after running 'verify_add_hosts', we run
>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function.
>>> That function does the exact same check, but it fails:
>>>
>>>  113     hosts = hosts_service.list(search='datacenter={} AND 
>>> status=up'.format(dc_name))
>>>  114     if hosts:
>>>  115         if random_host:
>>>  116             return random.choice(hosts)
>>>
>>> I don't think it is relevant to our current failure; but I consider
>> random_host=True as a bad practice. As if we do not have enough moving
>> parts, we are adding intentional randomness. Reproducibility is far more
>> important than coverage - particularly for a shared system test like OST.
>>
>>>  117         else:
>>>  118             return sorted(hosts, key=lambda host: host.name)
>>>  119     raise RuntimeError('Could not find hosts that are up in DC %s' % 
>>> dc_name)
>>>
>>>
>>> I'm also not able to reproduce this issue locally on my server. The
>>> investigation continues...
>>>
>>
>> I think that it would be fair to take the filtering by host state out of
>> Engine and into the test, where we can easily log the current status of
>> each host. Then we'd have better understanding on the next failure.
>>
>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
>>>
>>> Hi,
>>>
>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite for this
>>> patch manually and everything was ok.
>>> On 3/22/19 1:05 PM, Dafna Ron wrote:
>>>
>>> Hi,
>>>
>>> We are failing branch 4.3 for test:
>>> 002_bootstrap.add_master_storage_domain
>>>
>>> It seems that in one of the hosts, the vdsm is not starting
>>> there is nothing in vdsm.log or in supervdsm.log
>>>
>>> CQ identified this patch as the suspected root cause:
>>>
>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support for
>>> flow id
>>>
>>> Milan, Marcin, can you please have a look?
>>>
>>> full logs:
>>>
>>>
>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/
>>>
>>> the only error I can see is about host not being up (makes sense as vdsm
>>> is not running)
>>>
>>> Stacktrace
>>>
>>>   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>     testMethod()
>>>   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
>>>     self.test(*self.arg)
>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, 
>>> in wrapped_test
>>>     test()
>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in 
>>> wrapper
>>>     return func(get_test_prefix(), *args, **kwargs)
>>>   File 
>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>  line 417, in add_master_storage_domain
>>>     add_iscsi_storage_domain(prefix)
>>>   File 
>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>  line 561, in add_iscsi_storage_domain
>>>     host=_random_host_from_dc(api, DC_NAME),
>>>   File 
>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>  line 122, in _random_host_from_dc
>>>     return _hosts_in_dc(api, dc_name, True)
>>>   File 
>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>  line 119, in _hosts_in_dc
>>>     raise RuntimeError('Could not find hosts that are up in DC %s' % 
>>> dc_name)
>>> 'Could not find hosts that are up in DC test-dc\n-------------------- >> 
>>> begin captured logging << --------------------\nlago.ssh: DEBUG: start 
>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end 
>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on 
>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: 
>>> Command c07b5ee2 on lago-basic-suite-4-3-engine returned with 0\nlago.ssh: 
>>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine output:\n 
>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>
>>> _______________________________________________
>>> Devel mailing list -- devel@ovirt.org
>>> To unsubscribe send an email to devel-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/
>>>
>> _______________________________________________
>> Devel mailing list -- devel@ovirt.org
>> To unsubscribe send an email to devel-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/
>>
>
>
> --
>
> Eyal edri
>
>
> MANAGER
>
> RHV/CNV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA <https://www.redhat.com/>
> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Devel mailing list -- devel@ovirt.org
> To unsubscribe send an email to devel-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/
>
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SPALMYFTMZV6SVJVEZDV5PFECNFFWRVN/

Reply via email to