we have another failing patch with the same test: http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
its obviously not related to the patch but there is something that is causing these failures randomly. >From what I can see in the current failed job, you are correct and host-1 and engine does not even try to deploy host-1. i can see host-1 is getting error 127 in lago for ntpd and I can see network manager errors in the host messages log In the host messages log I can see several messages that I suspect may cause issues in communication between engine and host: Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: _sasldb_getkeyhandle has failed Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found ces/12) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-UMJa2S is being created... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-UMJa2S is being deleted... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device (/org/freedesktop/NetworkManager/Devices/13) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being created... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option fail_over_mac: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option ad_select: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in mode balance-rr(0) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_validate: invalid value (7) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option xmit_hash_policy: invalid value (5) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device (/org/freedesktop/NetworkManager/Devices/14) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option primary_reselect: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_all_targets: invalid value (2) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being deleted... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being created... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option fail_over_mac: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option ad_select: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in mode active-backup(1) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_validate: invalid value (7) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device (/org/freedesktop/NetworkManager/Devices/15) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option xmit_hash_policy: invalid value (5) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option primary_reselect: invalid value (3) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_all_targets: invalid value (2) Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being deleted... Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobc...@redhat.com> wrote: > For the failed job, the engine didn't even try to deploy on host-1: > > > https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ > > Martin, do you know what could be the reason for that? > > I can see in the logs for both successful and unsuccessful basic-suite-4.3 > runs, that there is no 'ntpdate' on host-1: > > 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 > on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine > 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 > on lago-basic-suite-4-3-host-1 returned with 127 > 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 > on lago-basic-suite-4-3-host-1 errors: > bash: ntpdate: command not found > > On host-0 everything is ok: > > 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 > on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine > 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 > on lago-basic-suite-4-3-host-0 returned with 0 > 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 > on lago-basic-suite-4-3-host-0 output: > 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset > 0.017150 sec > > On 3/25/19 10:13 AM, Eyal Edri wrote: > > Still fails, now on a different component. ( ovirt-web-ui-extentions ) > > https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ > > On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com> wrote: > >> >> >> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobc...@redhat.com> >> wrote: >> >>> Dafna, >>> >>> in 'verify_add_hosts' we specifically wait for single host to be up with >>> a timeout: >>> >>> 144 up_hosts = hosts_service.list(search='datacenter={} AND >>> status=up'.format(DC_NAME)) >>> 145 if len(up_hosts): >>> 146 return True >>> >>> The log files say, that it took ~50 secs for one of the hosts to be up >>> (seems reasonable) and no timeout is being reported. >>> Just after running 'verify_add_hosts', we run >>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>> That function does the exact same check, but it fails: >>> >>> 113 hosts = hosts_service.list(search='datacenter={} AND >>> status=up'.format(dc_name)) >>> 114 if hosts: >>> 115 if random_host: >>> 116 return random.choice(hosts) >>> >>> I don't think it is relevant to our current failure; but I consider >> random_host=True as a bad practice. As if we do not have enough moving >> parts, we are adding intentional randomness. Reproducibility is far more >> important than coverage - particularly for a shared system test like OST. >> >>> 117 else: >>> 118 return sorted(hosts, key=lambda host: host.name) >>> 119 raise RuntimeError('Could not find hosts that are up in DC %s' % >>> dc_name) >>> >>> >>> I'm also not able to reproduce this issue locally on my server. The >>> investigation continues... >>> >> >> I think that it would be fair to take the filtering by host state out of >> Engine and into the test, where we can easily log the current status of >> each host. Then we'd have better understanding on the next failure. >> >> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>> >>> Hi, >>> >>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite for this >>> patch manually and everything was ok. >>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>> >>> Hi, >>> >>> We are failing branch 4.3 for test: >>> 002_bootstrap.add_master_storage_domain >>> >>> It seems that in one of the hosts, the vdsm is not starting >>> there is nothing in vdsm.log or in supervdsm.log >>> >>> CQ identified this patch as the suspected root cause: >>> >>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support for >>> flow id >>> >>> Milan, Marcin, can you please have a look? >>> >>> full logs: >>> >>> >>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/ >>> >>> the only error I can see is about host not being up (makes sense as vdsm >>> is not running) >>> >>> Stacktrace >>> >>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>> testMethod() >>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest >>> self.test(*self.arg) >>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, >>> in wrapped_test >>> test() >>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in >>> wrapper >>> return func(get_test_prefix(), *args, **kwargs) >>> File >>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>> line 417, in add_master_storage_domain >>> add_iscsi_storage_domain(prefix) >>> File >>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>> line 561, in add_iscsi_storage_domain >>> host=_random_host_from_dc(api, DC_NAME), >>> File >>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>> line 122, in _random_host_from_dc >>> return _hosts_in_dc(api, dc_name, True) >>> File >>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>> line 119, in _hosts_in_dc >>> raise RuntimeError('Could not find hosts that are up in DC %s' % >>> dc_name) >>> 'Could not find hosts that are up in DC test-dc\n-------------------- >> >>> begin captured logging << --------------------\nlago.ssh: DEBUG: start >>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end >>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on >>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: >>> Command c07b5ee2 on lago-basic-suite-4-3-engine returned with 0\nlago.ssh: >>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine output:\n >>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>> >>> _______________________________________________ >>> Devel mailing list -- devel@ovirt.org >>> To unsubscribe send an email to devel-le...@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/ >>> >> _______________________________________________ >> Devel mailing list -- devel@ovirt.org >> To unsubscribe send an email to devel-le...@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/ >> > > > -- > > Eyal edri > > > MANAGER > > RHV/CNV DevOps > > EMEA VIRTUALIZATION R&D > > > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) > > _______________________________________________ > Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/ >
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SPALMYFTMZV6SVJVEZDV5PFECNFFWRVN/