Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk <msobc...@redhat.com>
ha scritto:

> Ok, maybe I've found something. I was comparing two sets of log files -
> one for basic-suite-4.3 run that succeeded and one that did not.
>
> The deployment procedure for host-1 is not complete in the unsuccessful
> case. The last log entry is:
>
> 2019-03-22 17:54:49,973-04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
> Host lago-basic-suite-4-3-host-1. Starting vdsm.
>
> In the successful case we can see:
>
> 2019-03-25 06:16:31,789-04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
> Host lago-basic-suite-4-3-host-1. Starting vdsm.
> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
> (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD 
> otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start 
> (odeploycons.packages.vdsm.started)
> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
> (VdsDeploy) [3add0cac] Got: **%EventStart STAGE closeup METHOD 
> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
> (VdsDeploy) [3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd
> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
> (VdsDeploy) [3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd
> 2019-03-25 06:16:38,261-04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing 
> Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
> 2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] 
> (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD 
> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
> ...
>
>
> It looks like ovirt-host-deploy plugin did not respond?
>
> The timing of:
>
> https://gerrit.ovirt.org/#/c/98762/
>
> kinda aligns with my patch which was merged for vdsm-4.3 and was a suspect:
>
> https://gerrit.ovirt.org/#/c/98748/
>
> Sandro, are you aware of any issues on ovirt-host-deploy plugin that might
> have caused these?
>

I don't really see how a patch touching the automation configuratin for
STDCI v2 can be the cause of this but please go ahead and blacklist last
build.
It should differ from previous one only by the timestamp of the rpm :-)


> Dafna, can we temporarily blacklist the latest version of
> ovirt-host-deploy to see if it fixes CQ?
>
> Marcin
> On 3/25/19 2:45 PM, Dafna Ron wrote:
>
> yes. I suspect this is the problem.
> In the host I can see in the yum log the times the vdsm packages are
> installed:
> [dron@dron post-002_bootstrap.py]$ less
> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
> Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:48:52 Installed: vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:00 Installed:
> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64
> Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:42 Installed:
> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:57 Installed:
> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64
> Mar 25 07:49:58 Installed:
> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:58 Installed: vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
>
> and in the engine you can see that the engine already failed installation:
>
> 2019-03-25 07:48:33,920-04 INFO  
> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand 
> internal: false. Entities affected :  ID: 
> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST 
> with role type ADMIN
>
>
> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <d...@redhat.com> wrote:
>
>> I am also adding didi
>> should this not be logged in host-deploy?
>>
>>
>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <msobc...@redhat.com>
>> wrote:
>>
>>> Indeed in my failed job, I can see, that 'vdsm-python' is installed
>>> *after* engine's attempt to run 'vdsm-tool':
>>>
>>> lago-basic-suite-4-3-host-1/_var_log/yum.log
>>> 403:Mar 22 17:54:12 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
>>>
>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
>>> 1580:2019-03-22 17:53:03,793-04 DEBUG 
>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool 
>>> vdsm-id'
>>> 1581:2019-03-22 17:53:04,043-04 WARN  
>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request 
>>> on host: Command returned failure code 127 during SSH session 
>>> 'root@lago-basic-suite-4-3-host-1'
>>>
>>>
>>> On 3/25/19 2:22 PM, Dafna Ron wrote:
>>>
>>> and its only failing on one host as well.
>>>
>>>
>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobc...@redhat.com>
>>> wrote:
>>>
>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not found on the
>>>> host. There was no work done around 'vdsm-tool's 'vdsm-id' recently.
>>>> On 3/25/19 2:08 PM, Dafna Ron wrote:
>>>>
>>>> Actually, I can see in engine that it is trying to start vdsm
>>>> installation on the host:
>>>>
>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) 
>>>> [] Entered SsoRestApiAuthFilter
>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) 
>>>> [] SsoRestApiAuthFilter authenticating with sso
>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) 
>>>> [] SsoRestApiAuthFilter authenticating using BEARER header
>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) 
>>>> [] SsoRestApiAuthFilter successfully authenticated using BEARER header
>>>> 2019-03-25 07:48:33,123-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
>>>> task-1) [] Entered SsoRestApiNegotiationFilter
>>>> 2019-03-25 07:48:33,124-04 DEBUG 
>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default 
>>>> task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
>>>> 2019-03-25 07:48:33,143-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
>>>> procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}]
>>>> 2019-03-25 07:48:33,143-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>> procedure [GetClusterForUserByClusterName] compiled
>>>> 2019-03-25 07:48:33,145-04 DEBUG 
>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, 
>>>> params: [GetClusterByName, NameQueryParameters:{refresh='false', 
>>>> filtered='false'}], timeElapsed: 7ms
>>>> 2019-03-25 07:48:33,161-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [] Compiled stored procedure. Call string is [{call 
>>>> getvmstaticbyname(?)}]
>>>> 2019-03-25 07:48:33,161-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>>>> 2019-03-25 07:48:33,206-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
>>>> procedure. Call string is [{call getvdsbyname(?)}]
>>>> 2019-03-25 07:48:33,206-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>> procedure [GetVdsByName] compiled
>>>> 2019-03-25 07:48:33,223-04 DEBUG 
>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, 
>>>> params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>>>> 2019-03-25 07:48:33,225-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
>>>> procedure. Call string is [{call getvdsbyhostname(?)}]
>>>> 2019-03-25 07:48:33,225-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>> procedure [GetVdsByHostName] compiled
>>>> 2019-03-25 07:48:33,252-04 DEBUG 
>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] 
>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: 
>>>> getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms
>>>> 2019-03-25 07:48:33,254-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored 
>>>> procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
>>>> 2019-03-25 07:48:33,254-04 DEBUG 
>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>>>>  (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for 
>>>> procedure [Getstorage_poolsByClusterId] compiled
>>>> 2019-03-25 07:48:33,262-04 DEBUG 
>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 
>>>> 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,461-04 DEBUG 
>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 
>>>> 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,467-04 DEBUG 
>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 
>>>> 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,599-04 DEBUG 
>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 
>>>> 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,600-04 DEBUG 
>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool 
>>>> vdsm-id'
>>>> 2019-03-25 07:48:33,902-04 WARN  
>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request 
>>>> on host: Command returned failure code 127 during SSH session 
>>>> 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,920-04 INFO  
>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) 
>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand 
>>>> internal: false. Entities affected :  ID: 
>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST 
>>>> with role type ADMIN
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com> wrote:
>>>>
>>>>> we have another failing patch with the same test:
>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>>>>
>>>>> its obviously not related to the patch but there is something that is
>>>>> causing these failures randomly.
>>>>> From what I can see in the current failed job, you are correct and
>>>>> host-1 and engine does not even try to deploy host-1.
>>>>>
>>>>> i can see host-1 is getting error 127 in lago for ntpd and I can see
>>>>> network manager errors in the host messages log
>>>>>
>>>>> In the host messages log I can see several messages that I suspect may
>>>>> cause issues in communication between engine and host:
>>>>>
>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2:
>>>>> _sasldb_getkeyhandle has failed
>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair
>>>>> found
>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair
>>>>> found
>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error
>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair
>>>>> found
>>>>>
>>>>>
>>>>> ces/12)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-UMJa2S is being created...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-UMJa2S is being deleted...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>> <info>  [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device
>>>>> (/org/freedesktop/NetworkManager/Devices/13)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>> net_device) (unregistering): Released all slaves
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-liwvMR is being created...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option fail_over_mac: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option ad_select: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option lacp_rate: mode dependency failed, not supported in mode
>>>>> balance-rr(0)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option arp_validate: invalid value (7)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option xmit_hash_policy: invalid value (5)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>> <info>  [1553514613.8002] manager: (bondscan-liwvMR): new Bond device
>>>>> (/org/freedesktop/NetworkManager/Devices/14)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option primary_reselect: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option arp_all_targets: invalid value (2)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-liwvMR is being deleted...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>> net_device) (unregistering): Released all slaves
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-liwvMR is being created...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option fail_over_mac: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option ad_select: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option lacp_rate: mode dependency failed, not supported in mode
>>>>> active-backup(1)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option arp_validate: invalid value (7)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]:
>>>>> <info>  [1553514613.8429] manager: (bondscan-liwvMR): new Bond device
>>>>> (/org/freedesktop/NetworkManager/Devices/15)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option xmit_hash_policy: invalid value (5)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option primary_reselect: invalid value (3)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
>>>>> option arp_all_targets: invalid value (2)
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
>>>>> bondscan-liwvMR is being deleted...
>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
>>>>> net_device) (unregistering): Released all slaves
>>>>>
>>>>>
>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobc...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>> For the failed job, the engine didn't even try to deploy on host-1:
>>>>>>
>>>>>>
>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/
>>>>>>
>>>>>> Martin, do you know what could be the reason for that?
>>>>>>
>>>>>> I can see in the logs for both successful and unsuccessful
>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1:
>>>>>>
>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running 
>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 
>>>>>> lago-basic-suite-4-3-engine
>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command 
>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127
>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command 
>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1  errors:
>>>>>>  bash: ntpdate: command not found
>>>>>>
>>>>>> On host-0 everything is ok:
>>>>>>
>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running 
>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 
>>>>>> lago-basic-suite-4-3-engine
>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command 
>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0
>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command 
>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 output:
>>>>>>  25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset 
>>>>>> 0.017150 sec
>>>>>>
>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>>>>>
>>>>>> Still fails, now on a different component. ( ovirt-web-ui-extentions
>>>>>> )
>>>>>>
>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>>>>
>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobc...@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dafna,
>>>>>>>>
>>>>>>>> in 'verify_add_hosts' we specifically wait for single host to be up
>>>>>>>> with a timeout:
>>>>>>>>
>>>>>>>>  144     up_hosts = hosts_service.list(search='datacenter={} AND 
>>>>>>>> status=up'.format(DC_NAME))
>>>>>>>>  145     if len(up_hosts):
>>>>>>>>  146         return True
>>>>>>>>
>>>>>>>> The log files say, that it took ~50 secs for one of the hosts to be
>>>>>>>> up (seems reasonable) and no timeout is being reported.
>>>>>>>> Just after running 'verify_add_hosts', we run
>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function.
>>>>>>>> That function does the exact same check, but it fails:
>>>>>>>>
>>>>>>>>  113     hosts = hosts_service.list(search='datacenter={} AND 
>>>>>>>> status=up'.format(dc_name))
>>>>>>>>  114     if hosts:
>>>>>>>>  115         if random_host:
>>>>>>>>  116             return random.choice(hosts)
>>>>>>>>
>>>>>>>> I don't think it is relevant to our current failure; but I consider
>>>>>>> random_host=True as a bad practice. As if we do not have enough moving
>>>>>>> parts, we are adding intentional randomness. Reproducibility is far more
>>>>>>> important than coverage - particularly for a shared system test like 
>>>>>>> OST.
>>>>>>>
>>>>>>>>  117         else:
>>>>>>>>  118             return sorted(hosts, key=lambda host: host.name)
>>>>>>>>  119     raise RuntimeError('Could not find hosts that are up in DC 
>>>>>>>> %s' % dc_name)
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm also not able to reproduce this issue locally on my server. The
>>>>>>>> investigation continues...
>>>>>>>>
>>>>>>>
>>>>>>> I think that it would be fair to take the filtering by host state
>>>>>>> out of Engine and into the test, where we can easily log the current 
>>>>>>> status
>>>>>>> of each host. Then we'd have better understanding on the next failure.
>>>>>>>
>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite for
>>>>>>>> this patch manually and everything was ok.
>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We are failing branch 4.3 for test:
>>>>>>>> 002_bootstrap.add_master_storage_domain
>>>>>>>>
>>>>>>>> It seems that in one of the hosts, the vdsm is not starting
>>>>>>>> there is nothing in vdsm.log or in supervdsm.log
>>>>>>>>
>>>>>>>> CQ identified this patch as the suspected root cause:
>>>>>>>>
>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support
>>>>>>>> for flow id
>>>>>>>>
>>>>>>>> Milan, Marcin, can you please have a look?
>>>>>>>>
>>>>>>>> full logs:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/
>>>>>>>>
>>>>>>>> the only error I can see is about host not being up (makes sense as
>>>>>>>> vdsm is not running)
>>>>>>>>
>>>>>>>> Stacktrace
>>>>>>>>
>>>>>>>>   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>>>>>     testMethod()
>>>>>>>>   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in 
>>>>>>>> runTest
>>>>>>>>     self.test(*self.arg)
>>>>>>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 
>>>>>>>> 142, in wrapped_test
>>>>>>>>     test()
>>>>>>>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 
>>>>>>>> 60, in wrapper
>>>>>>>>     return func(get_test_prefix(), *args, **kwargs)
>>>>>>>>   File 
>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>  line 417, in add_master_storage_domain
>>>>>>>>     add_iscsi_storage_domain(prefix)
>>>>>>>>   File 
>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>  line 561, in add_iscsi_storage_domain
>>>>>>>>     host=_random_host_from_dc(api, DC_NAME),
>>>>>>>>   File 
>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>  line 122, in _random_host_from_dc
>>>>>>>>     return _hosts_in_dc(api, dc_name, True)
>>>>>>>>   File 
>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
>>>>>>>>  line 119, in _hosts_in_dc
>>>>>>>>     raise RuntimeError('Could not find hosts that are up in DC %s' % 
>>>>>>>> dc_name)
>>>>>>>> 'Could not find hosts that are up in DC test-dc\n-------------------- 
>>>>>>>> >> begin captured logging << --------------------\nlago.ssh: DEBUG: 
>>>>>>>> start task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end 
>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for 
>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on 
>>>>>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: 
>>>>>>>> Command c07b5ee2 on lago-basic-suite-4-3-engine returned with 
>>>>>>>> 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine 
>>>>>>>> output:\n 
>>>>>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>> oVirt Code of Conduct:
>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>> List Archives:
>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>> oVirt Code of Conduct:
>>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>>> List Archives:
>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Eyal edri
>>>>>>
>>>>>>
>>>>>> MANAGER
>>>>>>
>>>>>> RHV/CNV DevOps
>>>>>>
>>>>>> EMEA VIRTUALIZATION R&D
>>>>>>
>>>>>>
>>>>>> Red Hat EMEA <https://www.redhat.com/>
>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED.
>>>>>> <https://redhat.com/trusted>
>>>>>> phone: +972-9-7692018
>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>>>>>>
>>>>>> _______________________________________________
>>>>>> Devel mailing list -- devel@ovirt.org
>>>>>> To unsubscribe send an email to devel-le...@ovirt.org
>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>> oVirt Code of Conduct:
>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>> List Archives:
>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/
>>>>>>
>>>>>

-- 

SANDRO BONAZZOLA

MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV

Red Hat EMEA <https://www.redhat.com/>

sbona...@redhat.com
<https://red.ht/sig>
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/

Reply via email to