Thank you Marcin :) I am also adding Didi to see if he knows of any
changes.
Anton, who from Engine/api side can help to take it from here?


On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk <msobc...@redhat.com> wrote:

> Hi,
>
> so the thing boils down to this:
>
>
> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_etc_httpd/logs/ssl_access_log/*view*/
> This is 'httpd' log file from the most recent job that failed
> 'add_master_storage_domain'.
> At the end of the log file, we can see:
>
> 192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
>  HTTP/1.1" 200 11845
>
> 192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
>  HTTP/1.1" 200 11847
>
> 192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
>  HTTP/1.1" 200 11847
>
> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
>  HTTP/1.1" 200 5959
>
> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 
> 200 7826
>
> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api 
> HTTP/1.1" 200 5112
>
> 192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET 
> /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 
> 200 65
>
> 192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST 
> /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223
>
> The first batch of 'status=installing or status=initializing' requests and
> the first 'status=up' request is 'verify_add_hosts'.
> The second 'status=up' request is the one that comes from '_hosts_in_dc'
> call from 'add_master_storage_domain'.
> They are executed with a 1 second time difference, yet the results are not
> the same - 7826 bytes vs 65 bytes.
> I went through every line of 'engine.log' from that 1 sec interval and
> couldn't find anything, that would indicate the host is going down.
> I've asked Ravi for help yesterday and I've found out, that the results of
> these requests are queries to our postgres db and they are somehow cached.
> I see 2 options here - either something really changed the db contents
> during that 1 second or there's a bug in our API/cache that returns the
> wrong info.
> I think we can eliminate the first option by adding collection of postgres
> db to our OST suites and looking at it's history after a failure.
> I don't really have a solution for the second possibility - any help from
> someone who knows engine's code better would be appreciated.
>
>
> On 3/26/19 5:37 PM, Dafna Ron wrote:
>
> We never remove any packages from tested :)
> I am actually suspecting a problem that is caused by something that is
> shared by all projects like the api.
> we currently have 3 different tests that are randomly failing because of
> engine/host communication and I am starting to think they are all
> connected.
>
>
> On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <ee...@redhat.com> wrote:
>
>>
>>
>> On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola <sbona...@redhat.com>
>> wrote:
>>
>>>
>>>
>>> Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk <
>>> msobc...@redhat.com> ha scritto:
>>>
>>>> Ok, maybe I've found something. I was comparing two sets of log files -
>>>> one for basic-suite-4.3 run that succeeded and one that did not.
>>>>
>>>> The deployment procedure for host-1 is not complete in the unsuccessful
>>>> case. The last log entry is:
>>>>
>>>> 2019-03-22 17:54:49,973-04 INFO  
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>>>> (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
>>>> Host lago-basic-suite-4-3-host-1. Starting vdsm.
>>>>
>>>> In the successful case we can see:
>>>>
>>>> 2019-03-25 06:16:31,789-04 INFO  
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>>>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
>>>> Host lago-basic-suite-4-3-host-1. Starting vdsm.
>>>> 2019-03-25 06:16:38,257-04 DEBUG 
>>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: 
>>>> **%EventEnd STAGE closeup METHOD 
>>>> otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start 
>>>> (odeploycons.packages.vdsm.started)
>>>> 2019-03-25 06:16:38,257-04 DEBUG 
>>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: 
>>>> **%EventStart STAGE closeup METHOD 
>>>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
>>>> 2019-03-25 06:16:38,257-04 DEBUG 
>>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: 
>>>> ***L:INFO Starting ovirt-vmconsole-host-sshd
>>>> 2019-03-25 06:16:38,257-04 DEBUG 
>>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] 
>>>> nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd
>>>> 2019-03-25 06:16:38,261-04 INFO  
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>>>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
>>>> Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
>>>> 2019-03-25 06:16:39,479-04 DEBUG 
>>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: 
>>>> **%EventEnd STAGE closeup METHOD 
>>>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
>>>> ...
>>>>
>>>>
>>>> It looks like ovirt-host-deploy plugin did not respond?
>>>>
>>>> The timing of:
>>>>
>>>> https://gerrit.ovirt.org/#/c/98762/
>>>>
>>>> kinda aligns with my patch which was merged for vdsm-4.3 and was a
>>>> suspect:
>>>>
>>>> https://gerrit.ovirt.org/#/c/98748/
>>>>
>>>> Sandro, are you aware of any issues on ovirt-host-deploy plugin that
>>>> might have caused these?
>>>>
>>>
>>> I don't really see how a patch touching the automation configuratin for
>>> STDCI v2 can be the cause of this but please go ahead and blacklist last
>>> build.
>>> It should differ from previous one only by the timestamp of the rpm :-)
>>>
>>
> Ehhhh, yeah, I'm so tired of looking at log files I'm starting to see
> patterns everywhere...
>
>
>
>> If you want to verify a specific version of an RPM works , you can use
>> the manual job for it.
>>
>>
>>>
>>>
>>>> Dafna, can we temporarily blacklist the latest version of
>>>> ovirt-host-deploy to see if it fixes CQ?
>>>>
>>>> Marcin
>>>> On 3/25/19 2:45 PM, Dafna Ron wrote:
>>>>
>>>> yes. I suspect this is the problem.
>>>> In the host I can see in the yum log the times the vdsm packages are
>>>> installed:
>>>> [dron@dron post-002_bootstrap.py]$ less
>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
>>>> Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:48:52 Installed:
>>>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:00 Installed:
>>>> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64
>>>> Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:42 Installed:
>>>> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:57 Installed:
>>>> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64
>>>> Mar 25 07:49:58 Installed:
>>>> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
>>>> Mar 25 07:49:58 Installed:
>>>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
>>>>
>>>> and in the engine you can see that the engine already failed
>>>> installation:
>>>>
>>>> 2019-03-25 07:48:33,920-04 INFO  
>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand 
>>>> internal: false. Entities affected :  ID: 
>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST 
>>>> with role type ADMIN
>>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <d...@redhat.com> wrote:
>>>>
>>>>> I am also adding didi
>>>>> should this not be logged in host-deploy?
>>>>>
>>>>>
>>>>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <msobc...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Indeed in my failed job, I can see, that 'vdsm-python' is installed
>>>>>> *after* engine's attempt to run 'vdsm-tool':
>>>>>>
>>>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log
>>>>>> 403:Mar 22 17:54:12 Installed: 
>>>>>> vdsm-python-4.30.11-23.git276b602.el7.noarch
>>>>>>
>>>>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
>>>>>> 1580:2019-03-22 17:53:03,793-04 DEBUG 
>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool 
>>>>>> vdsm-id'
>>>>>> 1581:2019-03-22 17:53:04,043-04 WARN  
>>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id 
>>>>>> request on host: Command returned failure code 127 during SSH session 
>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>
>>>>>>
>>>>>> On 3/25/19 2:22 PM, Dafna Ron wrote:
>>>>>>
>>>>>> and its only failing on one host as well.
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobc...@redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not found on
>>>>>>> the host. There was no work done around 'vdsm-tool's 'vdsm-id' recently.
>>>>>>> On 3/25/19 2:08 PM, Dafna Ron wrote:
>>>>>>>
>>>>>>> Actually, I can see in engine that it is trying to start vdsm
>>>>>>> installation on the host:
>>>>>>>
>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default 
>>>>>>> task-1) [] Entered SsoRestApiAuthFilter
>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default 
>>>>>>> task-1) [] SsoRestApiAuthFilter authenticating with sso
>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default 
>>>>>>> task-1) [] SsoRestApiAuthFilter authenticating using BEARER header
>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default 
>>>>>>> task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER 
>>>>>>> header
>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] 
>>>>>>> (default task-1) [] Entered SsoRestApiNegotiationFilter
>>>>>>> 2019-03-25 07:48:33,124-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] 
>>>>>>> (default task-1) [] SsoRestApiNegotiationFilter Not performing 
>>>>>>> Negotiate Auth
>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled 
>>>>>>> stored procedure. Call string is [{call 
>>>>>>> getclusterforuserbyclustername(?, ?, ?)}]
>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>>>>> procedure [GetClusterForUserByClusterName] compiled
>>>>>>> 2019-03-25 07:48:33,145-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: 
>>>>>>> runQuery, params: [GetClusterByName, 
>>>>>>> NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed: 
>>>>>>> 7ms
>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [] Compiled stored procedure. Call string is [{call 
>>>>>>> getvmstaticbyname(?)}]
>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled 
>>>>>>> stored procedure. Call string is [{call getvdsbyname(?)}]
>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>>>>> procedure [GetVdsByName] compiled
>>>>>>> 2019-03-25 07:48:33,223-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: 
>>>>>>> getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled 
>>>>>>> stored procedure. Call string is [{call getvdsbyhostname(?)}]
>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>>>>> procedure [GetVdsByHostName] compiled
>>>>>>> 2019-03-25 07:48:33,252-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: 
>>>>>>> getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: 
>>>>>>> 28ms
>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled 
>>>>>>> stored procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>>>>> procedure [Getstorage_poolsByClusterId] compiled
>>>>>>> 2019-03-25 07:48:33,262-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 
>>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>> 2019-03-25 07:48:33,461-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 
>>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>> 2019-03-25 07:48:33,467-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 
>>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>> 2019-03-25 07:48:33,599-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 
>>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>> 2019-03-25 07:48:33,600-04 DEBUG 
>>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool 
>>>>>>> vdsm-id'
>>>>>>> 2019-03-25 07:48:33,902-04 WARN  
>>>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id 
>>>>>>> request on host: Command returned failure code 127 during SSH session 
>>>>>>> 'root@lago-basic-suite-4-3-host-1'
>>>>>>> 2019-03-25 07:48:33,920-04 INFO  
>>>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand 
>>>>>>> internal: false. Entities affected :  ID: 
>>>>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group 
>>>>>>> CREATE_HOST with role type ADMIN
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com> wrote:
>>>>>>>
>>>>>>>> we have another failing patch with the same test:
>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>>>>>>>
>>>>>>>> its obviously not related to the patch but there is something that
>>>>>>>> is causing these failures randomly.
>>>>>>>> From what I can see in the current failed job, you are correct and
>>>>>>>> host-1 and engine does not even try to deploy host-1.
>>>>>>>>
>>>>>>>> i can see host-1 is getting error 127 in lago for ntpd and I can
>>>>>>>> see network manager errors in the host messages log
>>>>>>>>
>>>>>>>> In the host messages log I can see several messages that I suspect
>>>>>>>> may cause issues in communication between engine and host:
>>>>>>>>
>>>>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2:
>>>>>>>> _sasldb_getkeyhandle has failed
>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data 
>>>>>>>> pair
>>>>>>>> found
>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data 
>>>>>>>> pair
>>>>>>>> found
>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data 
>>>>>>>> pair
>>>>>>>> found
>>>>>>>>
>>>>>>>>
>>>>>>>> ces/12)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-UMJa2S is being created...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-UMJa2S is being deleted...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>>>>> <info>  [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device
>>>>>>>> (/org/freedesktop/NetworkManager/Devices/13)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>>>>> net_device) (unregistering): Released all slaves
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-liwvMR is being created...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not 
>>>>>>>> supported in
>>>>>>>> mode balance-rr(0)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>>>>> <info>  [1553514613.8002] manager: (bondscan-liwvMR): new Bond device
>>>>>>>> (/org/freedesktop/NetworkManager/Devices/14)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-liwvMR is being deleted...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>>>>> net_device) (unregistering): Released all slaves
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-liwvMR is being created...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not 
>>>>>>>> supported in
>>>>>>>> mode active-backup(1)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>>>>> <info>  [1553514613.8429] manager: (bondscan-liwvMR): new Bond device
>>>>>>>> (/org/freedesktop/NetworkManager/Devices/15)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2)
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>>>>> bondscan-liwvMR is being deleted...
>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>>>>> net_device) (unregistering): Released all slaves
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <
>>>>>>>> msobc...@redhat.com> wrote:
>>>>>>>>
>>>>>>>>> For the failed job, the engine didn't even try to deploy on host-1:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/
>>>>>>>>>
>>>>>>>>> Martin, do you know what could be the reason for that?
>>>>>>>>>
>>>>>>>>> I can see in the logs for both successful and unsuccessful
>>>>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1:
>>>>>>>>>
>>>>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running 
>>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 
>>>>>>>>> lago-basic-suite-4-3-engine
>>>>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command 
>>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127
>>>>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command 
>>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1  errors:
>>>>>>>>>  bash: ntpdate: command not found
>>>>>>>>>
>>>>>>>>> On host-0 everything is ok:
>>>>>>>>>
>>>>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running 
>>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 
>>>>>>>>> lago-basic-suite-4-3-engine
>>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command 
>>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0
>>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command 
>>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 output:
>>>>>>>>>  25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 
>>>>>>>>> offset 0.017150 sec
>>>>>>>>>
>>>>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>>>>>>>>
>>>>>>>>> Still fails, now on a different component. (
>>>>>>>>> ovirt-web-ui-extentions )
>>>>>>>>>
>>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>>>>>>>
>>>>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <
>>>>>>>>>> msobc...@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dafna,
>>>>>>>>>>>
>>>>>>>>>>> in 'verify_add_hosts' we specifically wait for single host to be
>>>>>>>>>>> up with a timeout:
>>>>>>>>>>>
>>>>>>>>>>>  144     up_hosts = hosts_service.list(search='datacenter={} AND 
>>>>>>>>>>> status=up'.format(DC_NAME))
>>>>>>>>>>>  145     if len(up_hosts):
>>>>>>>>>>>  146         return True
>>>>>>>>>>>
>>>>>>>>>>> The log files say, that it took ~50 secs for one of the hosts to
>>>>>>>>>>> be up (seems reasonable) and no timeout is being reported.
>>>>>>>>>>> Just after running 'verify_add_hosts', we run
>>>>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function.
>>>>>>>>>>> That function does the exact same check, but it fails:
>>>>>>>>>>>
>>>>>>>>>>>  113     hosts = hosts_service.list(search='datacenter={} AND 
>>>>>>>>>>> status=up'.format(dc_name))
>>>>>>>>>>>  114     if hosts:
>>>>>>>>>>>  115         if random_host:
>>>>>>>>>>>  116             return random.choice(hosts)
>>>>>>>>>>>
>>>>>>>>>>> I don't think it is relevant to our current failure; but I
>>>>>>>>>> consider random_host=True as a bad practice. As if we do not have 
>>>>>>>>>> enough
>>>>>>>>>> moving parts, we are adding intentional randomness. Reproducibility 
>>>>>>>>>> is far
>>>>>>>>>> more important than coverage - particularly for a shared system test 
>>>>>>>>>> like
>>>>>>>>>> OST.
>>>>>>>>>>
>>>>>>>>>>>  117         else:
>>>>>>>>>>>  118             return sorted(hosts, key=lambda host: host.name)
>>>>>>>>>>>  119     raise RuntimeError('Could not find hosts that are up in DC 
>>>>>>>>>>> %s' % dc_name)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm also not able to reproduce this issue locally on my server.
>>>>>>>>>>> The investigation continues...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think that it would be fair to take the filtering by host state
>>>>>>>>>> out of Engine and into the test, where we can easily log the current 
>>>>>>>>>> status
>>>>>>>>>> of each host. Then we'd have better understanding on the next 
>>>>>>>>>> failure.
>>>>>>>>>>
>>>>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite
>>>>>>>>>>> for this patch manually and everything was ok.
>>>>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We are failing branch 4.3 for test:
>>>>>>>>>>> 002_bootstrap.add_master_storage_domain
>>>>>>>>>>>
>>>>>>>>>>> It seems that in one of the hosts, the vdsm is not starting
>>>>>>>>>>> there is nothing in vdsm.log or in supervdsm.log
>>>>>>>>>>>
>>>>>>>>>>> CQ identified this patch as the suspected root cause:
>>>>>>>>>>>
>>>>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support
>>>>>>>>>>> for flow id
>>>>>>>>>>>
>>>>>>>>>>> Milan, Marcin, can you please have a look?
>>>>>>>>>>>
>>>>>>>>>>> full logs:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/
>>>>>>>>>>>
>>>>>>>>>>> the only error I can see is about host not being up (makes sense
>>>>>>>>>>> as vdsm is not running)
>>>>>>>>>>>
>>>>>>>>>>> Stacktrace
>>>>>>>>>>>
>>>>>>>>>>>   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>>>>>>>>     testMethod()
>>>>>>>>>>>   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, 
>>>>>>>>>>> in runTest
>>>>>>>>>>>     self.test(*self.arg)
>>>>>>>>>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", 
>>>>>>>>>>> line 142, in wrapped_test
>>>>>>>>>>>     test()
>>>>>>>>>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", 
>>>>>>>>>>> line 60, in wrapper
>>>>>>>>>>>     return func(get_test_prefix(), *args, **kwargs)
>>>>>>>>>>>   File 
>>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>>>>  line 417, in add_master_storage_domain
>>>>>>>>>>>     add_iscsi_storage_domain(prefix)
>>>>>>>>>>>   File 
>>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>>>>  line 561, in add_iscsi_storage_domain
>>>>>>>>>>>     host=_random_host_from_dc(api, DC_NAME),
>>>>>>>>>>>   File 
>>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>>>>  line 122, in _random_host_from_dc
>>>>>>>>>>>     return _hosts_in_dc(api, dc_name, True)
>>>>>>>>>>>   File 
>>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>>>>  line 119, in _hosts_in_dc
>>>>>>>>>>>     raise RuntimeError('Could not find hosts that are up in DC %s' 
>>>>>>>>>>> % dc_name)
>>>>>>>>>>> 'Could not find hosts that are up in DC 
>>>>>>>>>>> test-dc\n-------------------- >> begin captured logging << 
>>>>>>>>>>> --------------------\nlago.ssh: DEBUG: start 
>>>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end 
>>>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on 
>>>>>>>>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: 
>>>>>>>>>>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned 
>>>>>>>>>>> with 0\nlago.ssh: DEBUG: Command c07b5ee2 on 
>>>>>>>>>>> lago-basic-suite-4-3-engine output:\n 
>>>>>>>>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>>>> List Archives:
>>>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>>> List Archives:
>>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Eyal edri
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> MANAGER
>>>>>>>>>
>>>>>>>>> RHV/CNV DevOps
>>>>>>>>>
>>>>>>>>> EMEA VIRTUALIZATION R&D
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Red Hat EMEA <https://www.redhat.com/>
>>>>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED.
>>>>>>>>> <https://redhat.com/trusted>
>>>>>>>>> phone: +972-9-7692018
>>>>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>> List Archives:
>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/
>>>>>>>>>
>>>>>>>>
>>>
>>> --
>>>
>>> SANDRO BONAZZOLA
>>>
>>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>>>
>>> Red Hat EMEA <https://www.redhat.com/>
>>>
>>> sbona...@redhat.com
>>> <https://red.ht/sig>
>>> _______________________________________________
>>> Devel mailing list -- devel@ovirt.org
>>> To unsubscribe send an email to devel-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/
>>>
>>
>>
>> --
>>
>> Eyal edri
>>
>>
>> MANAGER
>>
>> RHV/CNV DevOps
>>
>> EMEA VIRTUALIZATION R&D
>>
>>
>> Red Hat EMEA <https://www.redhat.com/>
>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
>> phone: +972-9-7692018
>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>> _______________________________________________
>> Devel mailing list -- devel@ovirt.org
>> To unsubscribe send an email to devel-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ABB6UAM37JOSSDIJLHQC3RUER7ADT3GZ/
>>
>
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/DMBBLXJFPBIFEM2M7FVPMXAKXISYPNMB/

Reply via email to