Ok, maybe I've found something. I was comparing two sets of log files - one for basic-suite-4.3 run that succeeded and one that did not.

The deployment procedure for host-1 is not complete in the unsuccessful case. The last log entry is:

2019-03-22 17:54:49,973-04 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host 
lago-basic-suite-4-3-host-1. Starting vdsm.

In the successful case we can see:

2019-03-25 06:16:31,789-04 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host 
lago-basic-suite-4-3-host-1. Starting vdsm.
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
(VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD 
otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start 
(odeploycons.packages.vdsm.started)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
(VdsDeploy) [3add0cac] Got: **%EventStart STAGE closeup METHOD 
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
(VdsDeploy) [3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
(VdsDeploy) [3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,261-04 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host 
lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
(VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD 
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
...

It looks like ovirt-host-deploy plugin did not respond?

The timing of:

https://gerrit.ovirt.org/#/c/98762/

kinda aligns with my patch which was merged for vdsm-4.3 and was a suspect:

https://gerrit.ovirt.org/#/c/98748/

Sandro, are you aware of any issues on ovirt-host-deploy plugin that might have caused these?

Dafna, can we temporarily blacklist the latest version of ovirt-host-deploy to see if it fixes CQ?

Marcin

On 3/25/19 2:45 PM, Dafna Ron wrote:
yes. I suspect this is the problem.
In the host I can see in the yum log the times the vdsm packages are installed: [dron@dron post-002_bootstrap.py]$ less lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch
Mar 25 07:48:52 Installed: vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed: vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:42 Installed: vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch Mar 25 07:49:57 Installed: vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:58 Installed: vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch

and in the engine you can see that the engine already failed installation:
2019-03-25 07:48:33,920-04 INFO  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: 
false. Entities affected :  ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: 
ClusterAction group CREATE_HOST with role type ADMIN

On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <d...@redhat.com <mailto:d...@redhat.com>> wrote:

    I am also adding didi
    should this not be logged in host-deploy?


    On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk
    <msobc...@redhat.com <mailto:msobc...@redhat.com>> wrote:

        Indeed in my failed job, I can see, that 'vdsm-python' is
        installed *after* engine's attempt to run 'vdsm-tool':

        lago-basic-suite-4-3-host-1/_var_log/yum.log
        403:Mar 22 17:54:12 Installed: 
vdsm-python-4.30.11-23.git276b602.el7.noarch

        lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
        1580:2019-03-22 17:53:03,793-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
        1581:2019-03-22 17:53:04,043-04 WARN  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on 
host: Command returned failure code 127 during SSH session 
'root@lago-basic-suite-4-3-host-1'


        On 3/25/19 2:22 PM, Dafna Ron wrote:
        and its only failing on one host as well.


        On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk
        <msobc...@redhat.com <mailto:msobc...@redhat.com>> wrote:

            Hmmm, that would suggest, that 'vdsm-tool' command is not
            found on the host. There was no work done around
            'vdsm-tool's 'vdsm-id' recently.

            On 3/25/19 2:08 PM, Dafna Ron wrote:
            Actually, I can see in engine that it is trying to start
            vdsm installation on the host:

            2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
Entered SsoRestApiAuthFilter
            2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter authenticating with sso
            2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter authenticating using BEARER header
            2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] 
SsoRestApiAuthFilter successfully authenticated using BEARER header
            2019-03-25 07:48:33,123-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
task-1) [] Entered SsoRestApiNegotiationFilter
            2019-03-25 07:48:33,124-04 DEBUG 
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
            2019-03-25 07:48:33,143-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}]
            2019-03-25 07:48:33,143-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetClusterForUserByClusterName] compiled
            2019-03-25 07:48:33,145-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: 
[GetClusterByName, NameQueryParameters:{refresh='false', filtered='false'}], 
timeElapsed: 7ms
            2019-03-25 07:48:33,161-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [] Compiled stored procedure. Call string is [{call 
getvmstaticbyname(?)}]
            2019-03-25 07:48:33,161-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
            2019-03-25 07:48:33,206-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getvdsbyname(?)}]
            2019-03-25 07:48:33,206-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetVdsByName] compiled
            2019-03-25 07:48:33,223-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params: 
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
            2019-03-25 07:48:33,225-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getvdsbyhostname(?)}]
            2019-03-25 07:48:33,225-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[GetVdsByHostName] compiled
            2019-03-25 07:48:33,252-04 DEBUG 
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default 
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, 
params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms
            2019-03-25 07:48:33,254-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
            2019-03-25 07:48:33,254-04 DEBUG 
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
 (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure 
[Getstorage_poolsByClusterId] compiled
            2019-03-25 07:48:33,262-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 
'root@lago-basic-suite-4-3-host-1'
            2019-03-25 07:48:33,461-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 
'root@lago-basic-suite-4-3-host-1'
            2019-03-25 07:48:33,467-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 
'root@lago-basic-suite-4-3-host-1'
            2019-03-25 07:48:33,599-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 
'root@lago-basic-suite-4-3-host-1'
            2019-03-25 07:48:33,600-04 DEBUG 
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
            2019-03-25 07:48:33,902-04 WARN  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on 
host: Command returned failure code 127 during SSH session 
'root@lago-basic-suite-4-3-host-1'
            2019-03-25 07:48:33,920-04 INFO  
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: 
false. Entities affected :  ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: 
ClusterAction group CREATE_HOST with role type ADMIN



            On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron
            <d...@redhat.com <mailto:d...@redhat.com>> wrote:

                we have another failing patch with the same test:
                http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/

                its obviously not related to the patch but there is
                something that is causing these failures randomly.
                From what I can see in the current failed job, you
                are correct and host-1 and engine does not even try
                to deploy host-1.

                i can see host-1 is getting error 127 in lago for
                ntpd and I can see network manager errors in the
                host messages log

                In the host messages log I can see several messages
                that I suspect may cause issues in communication
                between engine and host:

                Mar 25 07:50:09 lago-basic-suite-4-3-host-1
                sasldblistusers2: _sasldb_getkeyhandle has failed
                Mar 25 07:50:10 lago-basic-suite-4-3-host-1
                saslpasswd2: error deleting entry from sasldb:
                BDB0073 DB_NOTFOUND: No matching key/data pair found
                Mar 25 07:50:10 lago-basic-suite-4-3-host-1
                saslpasswd2: error deleting entry from sasldb:
                BDB0073 DB_NOTFOUND: No matching key/data pair found
                Mar 25 07:50:10 lago-basic-suite-4-3-host-1
                saslpasswd2: error deleting entry from sasldb:
                BDB0073 DB_NOTFOUND: No matching key/data pair found


                ces/12)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-UMJa2S is being created...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-UMJa2S is being deleted...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1
                NetworkManager[2658]: <info>  [1553514613.7774]
                manager: (bondscan-UMJa2S): new Bond device
                (/org/freedesktop/NetworkManager/Devices/13)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                (unregistered net_device) (unregistering): Released
                all slaves
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-liwvMR is being created...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option fail_over_mac: invalid value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option ad_select: invalid value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option lacp_rate: mode dependency
                failed, not supported in mode balance-rr(0)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option arp_validate: invalid value (7)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option xmit_hash_policy: invalid
                value (5)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1
                NetworkManager[2658]: <info>  [1553514613.8002]
                manager: (bondscan-liwvMR): new Bond device
                (/org/freedesktop/NetworkManager/Devices/14)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option primary_reselect: invalid
                value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option arp_all_targets: invalid
                value (2)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-liwvMR is being deleted...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                (unregistered net_device) (unregistering): Released
                all slaves
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-liwvMR is being created...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option fail_over_mac: invalid value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option ad_select: invalid value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option lacp_rate: mode dependency
                failed, not supported in mode active-backup(1)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option arp_validate: invalid value (7)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1
                NetworkManager[2658]: <info>  [1553514613.8429]
                manager: (bondscan-liwvMR): new Bond device
                (/org/freedesktop/NetworkManager/Devices/15)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option xmit_hash_policy: invalid
                value (5)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option primary_reselect: invalid
                value (3)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bondscan-liwvMR: option arp_all_targets: invalid
                value (2)
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                bonding: bondscan-liwvMR is being deleted...
                Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
                (unregistered net_device) (unregistering): Released
                all slaves


                On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk
                <msobc...@redhat.com <mailto:msobc...@redhat.com>>
                wrote:

                    For the failed job, the engine didn't even try
                    to deploy on host-1:

                    
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/

                    Martin, do you know what could be the reason for
                    that?

                    I can see in the logs for both successful and
                    unsuccessful basic-suite-4.3 runs, that there is
                    no 'ntpdate' on host-1:

                    2019-03-25 
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running  d0c49b54 on 
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
                    2019-03-25 
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command  d0c49b54 on 
lago-basic-suite-4-3-host-1 returned with 127
                    2019-03-25 
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command  d0c49b54 on 
lago-basic-suite-4-3-host-1  errors:
                      bash: ntpdate: command not found

                    On host-0 everything is ok:

                    2019-03-25 
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running  d11b2a64 on 
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
                    2019-03-25 
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command  d11b2a64 on 
lago-basic-suite-4-3-host-0 returned with 0
                    2019-03-25 
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command  d11b2a64 on 
lago-basic-suite-4-3-host-0 output:
                      25 Mar 06:14:53 ntpdate[6646]: adjust time server 
192.168.202.2 offset 0.017150 sec

                    On 3/25/19 10:13 AM, Eyal Edri wrote:
                    Still fails, now on a different component. (
                    ovirt-web-ui-extentions )

                    
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/

                    On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg
                    <dan...@redhat.com <mailto:dan...@redhat.com>>
                    wrote:



                        On Fri, Mar 22, 2019 at 3:21 PM Marcin
                        Sobczyk <msobc...@redhat.com
                        <mailto:msobc...@redhat.com>> wrote:

                            Dafna,

                            in 'verify_add_hosts' we specifically
                            wait for single host to be up with a
                            timeout:

                              144     up_hosts = 
hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))
                              145     if len(up_hosts):
                              146         return True

                            The log files say, that it took ~50
                            secs for one of the hosts to be up
                            (seems reasonable) and no timeout is
                            being reported.
                            Just after running 'verify_add_hosts',
                            we run 'add_master_storage_domain',
                            which calls '_hosts_in_dc' function.
                            That function does the exact same
                            check, but it fails:

                              113     hosts = 
hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
                              114     if hosts:
                              115         if random_host:
                              116             return random.choice(hosts)

                        I don't think it is relevant to our current
                        failure; but I consider random_host=True as
                        a bad practice. As if we do not have enough
                        moving parts, we are adding intentional
                        randomness. Reproducibility is far more
                        important than coverage - particularly for
                        a shared system test like OST.

                              117         else:
                              118             return sorted(hosts, key=lambda 
host:host.name  <http://host.name>)
                              119     raise RuntimeError('Could not find hosts 
that are up in DC %s' % dc_name)

                            I'm also not able to reproduce this
                            issue locally on my server. The
                            investigation continues...


                        I think that it would be fair to take the
                        filtering by host state out of Engine and
                        into the test, where we can easily log the
                        current status of each host. Then we'd have
                        better understanding on the next failure.

                            On 3/22/19 1:17 PM, Marcin Sobczyk wrote:

                            Hi,

                            sure, I'm on it - it's weird though, I
                            did ran 4.3 basic suite for this patch
                            manually and everything was ok.

                            On 3/22/19 1:05 PM, Dafna Ron wrote:
                            Hi,

                            We are failing branch 4.3 for test:
                            002_bootstrap.add_master_storage_domain

                            It seems that in one of the hosts,
                            the vdsm is not starting
                            there is nothing in vdsm.log or in
                            supervdsm.log

                            CQ identified this patch as the
                            suspected root cause:

                            https://gerrit.ovirt.org/#/c/98748/ -
                            vdsm: client: Add support for flow id

                            Milan, Marcin, can you please have a
                            look?

                            full logs:

                            
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/

                            the only error I can see is about
                            host not being up (makes sense as
                            vdsm is not running)


                                  Stacktrace

                               File "/usr/lib64/python2.7/unittest/case.py", 
line 369, in run
                                 testMethod()
                               File 
"/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
                                 self.test(*self.arg)
                               File 
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in 
wrapped_test
                                 test()
                               File 
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
                                 return func(get_test_prefix(), *args, **kwargs)
                               File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 417, in add_master_storage_domain
                                 add_iscsi_storage_domain(prefix)
                               File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 561, in add_iscsi_storage_domain
                                 host=_random_host_from_dc(api, DC_NAME),
                               File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 122, in _random_host_from_dc
                                 return _hosts_in_dc(api, dc_name, True)
                               File 
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
 line 119, in _hosts_in_dc
                                 raise RuntimeError('Could not find hosts that 
are up in DC %s' % dc_name)
                            'Could not find hosts that are up in DC 
test-dc\n-------------------- >> begin captured logging << 
--------------------\nlago.ssh: DEBUG: starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get  
ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: 
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get  ssh client for 
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on 
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on 
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on 
lago-basic-suite-4-3-engine output:\n 
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
                            _______________________________________________
                            Devel mailing list -- devel@ovirt.org
                            <mailto:devel@ovirt.org>
                            To unsubscribe send an email to
                            devel-le...@ovirt.org
                            <mailto:devel-le...@ovirt.org>
                            Privacy Statement:
                            https://www.ovirt.org/site/privacy-policy/
                            oVirt Code of Conduct:
                            
https://www.ovirt.org/community/about/community-guidelines/
                            List Archives:
                            
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/

                        _______________________________________________
                        Devel mailing list -- devel@ovirt.org
                        <mailto:devel@ovirt.org>
                        To unsubscribe send an email to
                        devel-le...@ovirt.org
                        <mailto:devel-le...@ovirt.org>
                        Privacy Statement:
                        https://www.ovirt.org/site/privacy-policy/
                        oVirt Code of Conduct:
                        
https://www.ovirt.org/community/about/community-guidelines/
                        List Archives:
                        
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/



--
                    Eyal edri


                    MANAGER

                    RHV/CNV DevOps

                    EMEA VIRTUALIZATION R&D


                    Red Hat EMEA <https://www.redhat.com/>

                    <https://red.ht/sig>  TRIED. TESTED. TRUSTED.
                    <https://redhat.com/trusted>

                    phone: +972-9-7692018
                    irc: eedri (on #tlv #rhev-dev #rhev-integ)
                    _______________________________________________
                    Devel mailing list -- devel@ovirt.org
                    <mailto:devel@ovirt.org>
                    To unsubscribe send an email to
                    devel-le...@ovirt.org <mailto:devel-le...@ovirt.org>
                    Privacy Statement:
                    https://www.ovirt.org/site/privacy-policy/
                    oVirt Code of Conduct:
                    https://www.ovirt.org/community/about/community-guidelines/
                    List Archives:
                    
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/

_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/QN4KJRS575NB326W7YTU7BEKVHVXYSHG/

Reply via email to