Hmmm, that would suggest, that 'vdsm-tool' command is not found on the host. There was no work done around 'vdsm-tool's 'vdsm-id' recently.

On 3/25/19 2:08 PM, Dafna Ron wrote:
Actually, I can see in engine that it is trying to start vdsm installation on the host:

2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
Entered SsoRestApiAuthFilter
2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter authenticating with sso
2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter authenticating using BEARER header
2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter successfully authenticated using BEARER header
2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
task-1) [] Entered SsoRestApiNegotiationFilter
2019-03-25 07:48:33,124-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
2019-03-25 07:48:33,143-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}]
2019-03-25 07:48:33,143-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetClusterForUserByClusterName] compiled
2019-03-25 07:48:33,145-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: 
[GetClusterByName, NameQueryParameters:{refresh='false', filtered='false'}], 
timeElapsed: 7ms
2019-03-25 07:48:33,161-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [] Compiled stored procedure. Call string is [{call 
getvmstaticbyname(?)}]
2019-03-25 07:48:33,161-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
2019-03-25 07:48:33,206-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getvdsbyname(?)}]
2019-03-25 07:48:33,206-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetVdsByName] compiled
2019-03-25 07:48:33,223-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params: 
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
2019-03-25 07:48:33,225-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getvdsbyhostname(?)}]
2019-03-25 07:48:33,225-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetVdsByHostName] compiled
2019-03-25 07:48:33,252-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, 
params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms
2019-03-25 07:48:33,254-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
2019-03-25 07:48:33,254-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[Getstorage_poolsByClusterId] compiled
2019-03-25 07:48:33,262-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] 
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,461-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] 
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,467-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] 
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,599-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] 
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,600-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] 
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: 
'/usr/bin/vdsm-tool vdsm-id'
2019-03-25 07:48:33,902-04 WARN  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on 
host: Command returned failure code 127 during SSH session 
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,920-04 INFO  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: 
false. Entities affected :  ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: 
ClusterAction group CREATE_HOST with role type ADMIN



On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com <mailto:d...@redhat.com>> wrote:

    we have another failing patch with the same test:
    http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/

    its obviously not related to the patch but there is something that
    is causing these failures randomly.
    From what I can see in the current failed job, you are correct and
    host-1 and engine does not even try to deploy host-1.

    i can see host-1 is getting error 127 in lago for ntpd and I can
    see network manager errors in the host messages log

    In the host messages log I can see several messages that I suspect
    may cause issues in communication between engine and host:

    Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2:
    _sasldb_getkeyhandle has failed
    Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
    deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching
    key/data pair found
    Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
    deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching
    key/data pair found
    Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
    deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching
    key/data pair found


    ces/12)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-UMJa2S is being created...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-UMJa2S is being deleted...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
    <info> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond
    device (/org/freedesktop/NetworkManager/Devices/13)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
    net_device) (unregistering): Released all slaves
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-liwvMR is being created...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option fail_over_mac: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option ad_select: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option lacp_rate: mode dependency failed, not
    supported in mode balance-rr(0)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option arp_validate: invalid value (7)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
    <info> [1553514613.8002] manager: (bondscan-liwvMR): new Bond
    device (/org/freedesktop/NetworkManager/Devices/14)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option primary_reselect: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option arp_all_targets: invalid value (2)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-liwvMR is being deleted...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
    net_device) (unregistering): Released all slaves
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-liwvMR is being created...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option fail_over_mac: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option ad_select: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option lacp_rate: mode dependency failed, not
    supported in mode active-backup(1)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option arp_validate: invalid value (7)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
    <info> [1553514613.8429] manager: (bondscan-liwvMR): new Bond
    device (/org/freedesktop/NetworkManager/Devices/15)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option primary_reselect: invalid value (3)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
    bondscan-liwvMR: option arp_all_targets: invalid value (2)
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
    bondscan-liwvMR is being deleted...
    Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
    net_device) (unregistering): Released all slaves


    On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk
    <msobc...@redhat.com <mailto:msobc...@redhat.com>> wrote:

        For the failed job, the engine didn't even try to deploy on
        host-1:

        
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/

        Martin, do you know what could be the reason for that?

        I can see in the logs for both successful and unsuccessful
        basic-suite-4.3 runs, that there is no 'ntpdate' on host-1:

        2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running  
d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
        2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command  
d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127
        2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command  
d0c49b54 on lago-basic-suite-4-3-host-1  errors:
          bash: ntpdate: command not found

        On host-0 everything is ok:

        2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running  
d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
        2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command  
d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0
        2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command  
d11b2a64 on lago-basic-suite-4-3-host-0 output:
          25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 
offset 0.017150 sec

        On 3/25/19 10:13 AM, Eyal Edri wrote:
        Still fails, now on a different component. (
        ovirt-web-ui-extentions )

        https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/

        On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg
        <dan...@redhat.com <mailto:dan...@redhat.com>> wrote:



            On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk
            <msobc...@redhat.com <mailto:msobc...@redhat.com>> wrote:

                Dafna,

                in 'verify_add_hosts' we specifically wait for single
                host to be up with a timeout:

                  144     up_hosts = hosts_service.list(search='datacenter={} 
AND status=up'.format(DC_NAME))
                  145     if len(up_hosts):
                  146         return True

                The log files say, that it took ~50 secs for one of
                the hosts to be up (seems reasonable) and no timeout
                is being reported.
                Just after running 'verify_add_hosts', we run
                'add_master_storage_domain', which calls
                '_hosts_in_dc' function.
                That function does the exact same check, but it fails:

                  113     hosts = hosts_service.list(search='datacenter={} AND 
status=up'.format(dc_name))
                  114     if hosts:
                  115         if random_host:
                  116             return random.choice(hosts)

            I don't think it is relevant to our current failure; but
            I consider random_host=True as a bad practice. As if we
            do not have enough moving parts, we are adding
            intentional randomness. Reproducibility is far more
            important than coverage - particularly for a shared
            system test like OST.

                  117         else:
                  118             return sorted(hosts, key=lambda host:host.name  
<http://host.name>)
                  119     raise RuntimeError('Could not find hosts that are up 
in DC %s' % dc_name)

                I'm also not able to reproduce this issue locally on
                my server. The investigation continues...


            I think that it would be fair to take the filtering by
            host state out of Engine and into the test, where we can
            easily log the current status of each host. Then we'd
            have better understanding on the next failure.

                On 3/22/19 1:17 PM, Marcin Sobczyk wrote:

                Hi,

                sure, I'm on it - it's weird though, I did ran 4.3
                basic suite for this patch manually and everything
                was ok.

                On 3/22/19 1:05 PM, Dafna Ron wrote:
                Hi,

                We are failing branch 4.3 for test:
                002_bootstrap.add_master_storage_domain

                It seems that in one of the hosts, the vdsm is not
                starting
                there is nothing in vdsm.log or in supervdsm.log

                CQ identified this patch as the suspected root cause:

                https://gerrit.ovirt.org/#/c/98748/ - vdsm: client:
                Add support for flow id

                Milan, Marcin, can you please have a look?

                full logs:

                
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/

                the only error I can see is about host not being up
                (makes sense as vdsm is not running)


                      Stacktrace

                   File "/usr/lib64/python2.7/unittest/case.py", line 369, in 
run
                     testMethod()
                   File "/usr/lib/python2.7/site-packages/nose/case.py", line 
197, in runTest
                     self.test(*self.arg)
                   File 
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in 
wrapped_test
                     test()
                   File 
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
                     return func(get_test_prefix(), *args, **kwargs)
                   File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 417, in add_master_storage_domain
                     add_iscsi_storage_domain(prefix)
                   File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 561, in add_iscsi_storage_domain
                     host=_random_host_from_dc(api, DC_NAME),
                   File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 122, in _random_host_from_dc
                     return _hosts_in_dc(api, dc_name, True)
                   File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 119, in _hosts_in_dc
                     raise RuntimeError('Could not find hosts that are up in DC 
%s' % dc_name)
                'Could not find hosts that are up in DC test-dc\n-------------------- 
>> begin captured logging << --------------------\nlago.ssh: DEBUG: 
starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get  ssh client for 
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: 
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get  ssh client for 
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on 
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on 
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on 
lago-basic-suite-4-3-engine output:\n 
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
                _______________________________________________
                Devel mailing list -- devel@ovirt.org
                <mailto:devel@ovirt.org>
                To unsubscribe send an email to devel-le...@ovirt.org
                <mailto:devel-le...@ovirt.org>
                Privacy Statement:
                https://www.ovirt.org/site/privacy-policy/
                oVirt Code of Conduct:
                https://www.ovirt.org/community/about/community-guidelines/
                List Archives:
                
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/

            _______________________________________________
            Devel mailing list -- devel@ovirt.org
            <mailto:devel@ovirt.org>
            To unsubscribe send an email to devel-le...@ovirt.org
            <mailto:devel-le...@ovirt.org>
            Privacy Statement: https://www.ovirt.org/site/privacy-policy/
            oVirt Code of Conduct:
            https://www.ovirt.org/community/about/community-guidelines/
            List Archives:
            
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/



--
        Eyal edri


        MANAGER

        RHV/CNV DevOps

        EMEA VIRTUALIZATION R&D


        Red Hat EMEA <https://www.redhat.com/>

        <https://red.ht/sig>      TRIED. TESTED. TRUSTED.
        <https://redhat.com/trusted>

        phone: +972-9-7692018
        irc: eedri (on #tlv #rhev-dev #rhev-integ)
        _______________________________________________
        Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org>
        To unsubscribe send an email to devel-le...@ovirt.org
        <mailto:devel-le...@ovirt.org>
        Privacy Statement: https://www.ovirt.org/site/privacy-policy/
        oVirt Code of Conduct:
        https://www.ovirt.org/community/about/community-guidelines/
        List Archives:
        
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/

_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/F524OVQDB6LSIHBMOSSDMSQINKUPC3SW/

Reply via email to