and its only failing on one host as well.
On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobc...@redhat.com> wrote: > Hmmm, that would suggest, that 'vdsm-tool' command is not found on the > host. There was no work done around 'vdsm-tool's 'vdsm-id' recently. > On 3/25/19 2:08 PM, Dafna Ron wrote: > > Actually, I can see in engine that it is trying to start vdsm installation > on the host: > > 2019-03-25 07:48:33,123-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] > Entered SsoRestApiAuthFilter > 2019-03-25 07:48:33,123-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] > SsoRestApiAuthFilter authenticating with sso > 2019-03-25 07:48:33,123-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] > SsoRestApiAuthFilter authenticating using BEARER header > 2019-03-25 07:48:33,123-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] > SsoRestApiAuthFilter successfully authenticated using BEARER header > 2019-03-25 07:48:33,123-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default > task-1) [] Entered SsoRestApiNegotiationFilter > 2019-03-25 07:48:33,124-04 DEBUG > [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default > task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth > 2019-03-25 07:48:33,143-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored > procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}] > 2019-03-25 07:48:33,143-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for > procedure [GetClusterForUserByClusterName] compiled > 2019-03-25 07:48:33,145-04 DEBUG > [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, > params: [GetClusterByName, NameQueryParameters:{refresh='false', > filtered='false'}], timeElapsed: 7ms > 2019-03-25 07:48:33,161-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [] Compiled stored procedure. Call string is [{call > getvmstaticbyname(?)}] > 2019-03-25 07:48:33,161-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled > 2019-03-25 07:48:33,206-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored > procedure. Call string is [{call getvdsbyname(?)}] > 2019-03-25 07:48:33,206-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for > procedure [GetVdsByName] compiled > 2019-03-25 07:48:33,223-04 DEBUG > [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, > params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms > 2019-03-25 07:48:33,225-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored > procedure. Call string is [{call getvdsbyhostname(?)}] > 2019-03-25 07:48:33,225-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for > procedure [GetVdsByHostName] compiled > 2019-03-25 07:48:33,252-04 DEBUG > [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: > getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms > 2019-03-25 07:48:33,254-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored > procedure. Call string is [{call getstorage_poolsbyclusterid(?)}] > 2019-03-25 07:48:33,254-04 DEBUG > [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for > procedure [Getstorage_poolsByClusterId] compiled > 2019-03-25 07:48:33,262-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting > 'root@lago-basic-suite-4-3-host-1' > 2019-03-25 07:48:33,461-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: > 'root@lago-basic-suite-4-3-host-1' > 2019-03-25 07:48:33,467-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: > 'root@lago-basic-suite-4-3-host-1' > 2019-03-25 07:48:33,599-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: > 'root@lago-basic-suite-4-3-host-1' > 2019-03-25 07:48:33,600-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] > (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: > '/usr/bin/vdsm-tool vdsm-id' > 2019-03-25 07:48:33,902-04 WARN > [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) > [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on > host: Command returned failure code 127 during SSH session > 'root@lago-basic-suite-4-3-host-1' > 2019-03-25 07:48:33,920-04 INFO > [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) > [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand > internal: false. Entities affected : ID: > 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST > with role type ADMIN > > > > > On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com> wrote: > >> we have another failing patch with the same test: >> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/ >> >> its obviously not related to the patch but there is something that is >> causing these failures randomly. >> From what I can see in the current failed job, you are correct and host-1 >> and engine does not even try to deploy host-1. >> >> i can see host-1 is getting error 127 in lago for ntpd and I can see >> network manager errors in the host messages log >> >> In the host messages log I can see several messages that I suspect may >> cause issues in communication between engine and host: >> >> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: >> _sasldb_getkeyhandle has failed >> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting >> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found >> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting >> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found >> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting >> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found >> >> >> ces/12) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-UMJa2S is being created... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-UMJa2S is being deleted... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> >> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device >> (/org/freedesktop/NetworkManager/Devices/13) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >> net_device) (unregistering): Released all slaves >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-liwvMR is being created... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option fail_over_mac: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option ad_select: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option lacp_rate: mode dependency failed, not supported in mode >> balance-rr(0) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option arp_validate: invalid value (7) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option xmit_hash_policy: invalid value (5) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> >> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device >> (/org/freedesktop/NetworkManager/Devices/14) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option primary_reselect: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option arp_all_targets: invalid value (2) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-liwvMR is being deleted... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >> net_device) (unregistering): Released all slaves >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-liwvMR is being created... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option fail_over_mac: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option ad_select: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option lacp_rate: mode dependency failed, not supported in mode >> active-backup(1) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option arp_validate: invalid value (7) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info> >> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device >> (/org/freedesktop/NetworkManager/Devices/15) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option xmit_hash_policy: invalid value (5) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option primary_reselect: invalid value (3) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >> option arp_all_targets: invalid value (2) >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >> bondscan-liwvMR is being deleted... >> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >> net_device) (unregistering): Released all slaves >> >> >> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobc...@redhat.com> >> wrote: >> >>> For the failed job, the engine didn't even try to deploy on host-1: >>> >>> >>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ >>> >>> Martin, do you know what could be the reason for that? >>> >>> I can see in the logs for both successful and unsuccessful >>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1: >>> >>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 >>> on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine >>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 >>> on lago-basic-suite-4-3-host-1 returned with 127 >>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 >>> on lago-basic-suite-4-3-host-1 errors: >>> bash: ntpdate: command not found >>> >>> On host-0 everything is ok: >>> >>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 >>> on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine >>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 >>> on lago-basic-suite-4-3-host-0 returned with 0 >>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 >>> on lago-basic-suite-4-3-host-0 output: >>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset >>> 0.017150 sec >>> >>> On 3/25/19 10:13 AM, Eyal Edri wrote: >>> >>> Still fails, now on a different component. ( ovirt-web-ui-extentions ) >>> >>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ >>> >>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com> >>> wrote: >>> >>>> >>>> >>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobc...@redhat.com> >>>> wrote: >>>> >>>>> Dafna, >>>>> >>>>> in 'verify_add_hosts' we specifically wait for single host to be up >>>>> with a timeout: >>>>> >>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND >>>>> status=up'.format(DC_NAME)) >>>>> 145 if len(up_hosts): >>>>> 146 return True >>>>> >>>>> The log files say, that it took ~50 secs for one of the hosts to be up >>>>> (seems reasonable) and no timeout is being reported. >>>>> Just after running 'verify_add_hosts', we run >>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>>>> That function does the exact same check, but it fails: >>>>> >>>>> 113 hosts = hosts_service.list(search='datacenter={} AND >>>>> status=up'.format(dc_name)) >>>>> 114 if hosts: >>>>> 115 if random_host: >>>>> 116 return random.choice(hosts) >>>>> >>>>> I don't think it is relevant to our current failure; but I consider >>>> random_host=True as a bad practice. As if we do not have enough moving >>>> parts, we are adding intentional randomness. Reproducibility is far more >>>> important than coverage - particularly for a shared system test like OST. >>>> >>>>> 117 else: >>>>> 118 return sorted(hosts, key=lambda host: host.name) >>>>> 119 raise RuntimeError('Could not find hosts that are up in DC %s' % >>>>> dc_name) >>>>> >>>>> >>>>> I'm also not able to reproduce this issue locally on my server. The >>>>> investigation continues... >>>>> >>>> >>>> I think that it would be fair to take the filtering by host state out >>>> of Engine and into the test, where we can easily log the current status of >>>> each host. Then we'd have better understanding on the next failure. >>>> >>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>>>> >>>>> Hi, >>>>> >>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite for >>>>> this patch manually and everything was ok. >>>>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>>>> >>>>> Hi, >>>>> >>>>> We are failing branch 4.3 for test: >>>>> 002_bootstrap.add_master_storage_domain >>>>> >>>>> It seems that in one of the hosts, the vdsm is not starting >>>>> there is nothing in vdsm.log or in supervdsm.log >>>>> >>>>> CQ identified this patch as the suspected root cause: >>>>> >>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support for >>>>> flow id >>>>> >>>>> Milan, Marcin, can you please have a look? >>>>> >>>>> full logs: >>>>> >>>>> >>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/ >>>>> >>>>> the only error I can see is about host not being up (makes sense as >>>>> vdsm is not running) >>>>> >>>>> Stacktrace >>>>> >>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>>>> testMethod() >>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in >>>>> runTest >>>>> self.test(*self.arg) >>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, >>>>> in wrapped_test >>>>> test() >>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, >>>>> in wrapper >>>>> return func(get_test_prefix(), *args, **kwargs) >>>>> File >>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>> line 417, in add_master_storage_domain >>>>> add_iscsi_storage_domain(prefix) >>>>> File >>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>> line 561, in add_iscsi_storage_domain >>>>> host=_random_host_from_dc(api, DC_NAME), >>>>> File >>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>> line 122, in _random_host_from_dc >>>>> return _hosts_in_dc(api, dc_name, True) >>>>> File >>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>> line 119, in _hosts_in_dc >>>>> raise RuntimeError('Could not find hosts that are up in DC %s' % >>>>> dc_name) >>>>> 'Could not find hosts that are up in DC test-dc\n-------------------- >> >>>>> begin captured logging << --------------------\nlago.ssh: DEBUG: start >>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end >>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on >>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: >>>>> Command c07b5ee2 on lago-basic-suite-4-3-engine returned with >>>>> 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine >>>>> output:\n >>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>>>> >>>>> _______________________________________________ >>>>> Devel mailing list -- devel@ovirt.org >>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> oVirt Code of Conduct: >>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>> List Archives: >>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/ >>>>> >>>> _______________________________________________ >>>> Devel mailing list -- devel@ovirt.org >>>> To unsubscribe send an email to devel-le...@ovirt.org >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: >>>> https://www.ovirt.org/community/about/community-guidelines/ >>>> List Archives: >>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/ >>>> >>> >>> >>> -- >>> >>> Eyal edri >>> >>> >>> MANAGER >>> >>> RHV/CNV DevOps >>> >>> EMEA VIRTUALIZATION R&D >>> >>> >>> Red Hat EMEA <https://www.redhat.com/> >>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. >>> <https://redhat.com/trusted> >>> phone: +972-9-7692018 >>> irc: eedri (on #tlv #rhev-dev #rhev-integ) >>> >>> _______________________________________________ >>> Devel mailing list -- devel@ovirt.org >>> To unsubscribe send an email to devel-le...@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/ >>> >>
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JPMAXXPSZRK6IJS3IQ5PJZG2HWGEECTH/