Thank you Marcin :) I am also adding Didi to see if he knows of any changes. Anton, who from Engine/api side can help to take it from here?
On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk <msobc...@redhat.com> wrote: > Hi, > > so the thing boils down to this: > > > https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_etc_httpd/logs/ssl_access_log/*view*/ > This is 'httpd' log file from the most recent job that failed > 'add_master_storage_domain'. > At the end of the log file, we can see: > > 192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing > HTTP/1.1" 200 11845 > > 192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing > HTTP/1.1" 200 11847 > > 192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing > HTTP/1.1" 200 11847 > > 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing > HTTP/1.1" 200 5959 > > 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" > 200 7826 > > 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api > HTTP/1.1" 200 5112 > > 192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET > /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" > 200 65 > > 192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST > /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223 > > The first batch of 'status=installing or status=initializing' requests and > the first 'status=up' request is 'verify_add_hosts'. > The second 'status=up' request is the one that comes from '_hosts_in_dc' > call from 'add_master_storage_domain'. > They are executed with a 1 second time difference, yet the results are not > the same - 7826 bytes vs 65 bytes. > I went through every line of 'engine.log' from that 1 sec interval and > couldn't find anything, that would indicate the host is going down. > I've asked Ravi for help yesterday and I've found out, that the results of > these requests are queries to our postgres db and they are somehow cached. > I see 2 options here - either something really changed the db contents > during that 1 second or there's a bug in our API/cache that returns the > wrong info. > I think we can eliminate the first option by adding collection of postgres > db to our OST suites and looking at it's history after a failure. > I don't really have a solution for the second possibility - any help from > someone who knows engine's code better would be appreciated. > > > On 3/26/19 5:37 PM, Dafna Ron wrote: > > We never remove any packages from tested :) > I am actually suspecting a problem that is caused by something that is > shared by all projects like the api. > we currently have 3 different tests that are randomly failing because of > engine/host communication and I am starting to think they are all > connected. > > > On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <ee...@redhat.com> wrote: > >> >> >> On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola <sbona...@redhat.com> >> wrote: >> >>> >>> >>> Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk < >>> msobc...@redhat.com> ha scritto: >>> >>>> Ok, maybe I've found something. I was comparing two sets of log files - >>>> one for basic-suite-4.3 run that succeeded and one that did not. >>>> >>>> The deployment procedure for host-1 is not complete in the unsuccessful >>>> case. The last log entry is: >>>> >>>> 2019-03-22 17:54:49,973-04 INFO >>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>> (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>>> Host lago-basic-suite-4-3-host-1. Starting vdsm. >>>> >>>> In the successful case we can see: >>>> >>>> 2019-03-25 06:16:31,789-04 INFO >>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>>> Host lago-basic-suite-4-3-host-1. Starting vdsm. >>>> 2019-03-25 06:16:38,257-04 DEBUG >>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>>> **%EventEnd STAGE closeup METHOD >>>> otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start >>>> (odeploycons.packages.vdsm.started) >>>> 2019-03-25 06:16:38,257-04 DEBUG >>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>>> **%EventStart STAGE closeup METHOD >>>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>>> 2019-03-25 06:16:38,257-04 DEBUG >>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>>> ***L:INFO Starting ovirt-vmconsole-host-sshd >>>> 2019-03-25 06:16:38,257-04 DEBUG >>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] >>>> nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd >>>> 2019-03-25 06:16:38,261-04 INFO >>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>>> Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd. >>>> 2019-03-25 06:16:39,479-04 DEBUG >>>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>>> **%EventEnd STAGE closeup METHOD >>>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>>> ... >>>> >>>> >>>> It looks like ovirt-host-deploy plugin did not respond? >>>> >>>> The timing of: >>>> >>>> https://gerrit.ovirt.org/#/c/98762/ >>>> >>>> kinda aligns with my patch which was merged for vdsm-4.3 and was a >>>> suspect: >>>> >>>> https://gerrit.ovirt.org/#/c/98748/ >>>> >>>> Sandro, are you aware of any issues on ovirt-host-deploy plugin that >>>> might have caused these? >>>> >>> >>> I don't really see how a patch touching the automation configuratin for >>> STDCI v2 can be the cause of this but please go ahead and blacklist last >>> build. >>> It should differ from previous one only by the timestamp of the rpm :-) >>> >> > Ehhhh, yeah, I'm so tired of looking at log files I'm starting to see > patterns everywhere... > > > >> If you want to verify a specific version of an RPM works , you can use >> the manual job for it. >> >> >>> >>> >>>> Dafna, can we temporarily blacklist the latest version of >>>> ovirt-host-deploy to see if it fixes CQ? >>>> >>>> Marcin >>>> On 3/25/19 2:45 PM, Dafna Ron wrote: >>>> >>>> yes. I suspect this is the problem. >>>> In the host I can see in the yum log the times the vdsm packages are >>>> installed: >>>> [dron@dron post-002_bootstrap.py]$ less >>>> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm >>>> Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:48:52 Installed: >>>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:00 Installed: >>>> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64 >>>> Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:42 Installed: >>>> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:57 Installed: >>>> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64 >>>> Mar 25 07:49:58 Installed: >>>> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch >>>> Mar 25 07:49:58 Installed: >>>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch >>>> >>>> and in the engine you can see that the engine already failed >>>> installation: >>>> >>>> 2019-03-25 07:48:33,920-04 INFO >>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >>>> internal: false. Entities affected : ID: >>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST >>>> with role type ADMIN >>>> >>>> >>>> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <d...@redhat.com> wrote: >>>> >>>>> I am also adding didi >>>>> should this not be logged in host-deploy? >>>>> >>>>> >>>>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <msobc...@redhat.com> >>>>> wrote: >>>>> >>>>>> Indeed in my failed job, I can see, that 'vdsm-python' is installed >>>>>> *after* engine's attempt to run 'vdsm-tool': >>>>>> >>>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log >>>>>> 403:Mar 22 17:54:12 Installed: >>>>>> vdsm-python-4.30.11-23.git276b602.el7.noarch >>>>>> >>>>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log >>>>>> 1580:2019-03-22 17:53:03,793-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool >>>>>> vdsm-id' >>>>>> 1581:2019-03-22 17:53:04,043-04 WARN >>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id >>>>>> request on host: Command returned failure code 127 during SSH session >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> >>>>>> >>>>>> On 3/25/19 2:22 PM, Dafna Ron wrote: >>>>>> >>>>>> and its only failing on one host as well. >>>>>> >>>>>> >>>>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobc...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not found on >>>>>>> the host. There was no work done around 'vdsm-tool's 'vdsm-id' recently. >>>>>>> On 3/25/19 2:08 PM, Dafna Ron wrote: >>>>>>> >>>>>>> Actually, I can see in engine that it is trying to start vdsm >>>>>>> installation on the host: >>>>>>> >>>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>>> task-1) [] Entered SsoRestApiAuthFilter >>>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>>> task-1) [] SsoRestApiAuthFilter authenticating with sso >>>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>>> task-1) [] SsoRestApiAuthFilter authenticating using BEARER header >>>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>>> task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER >>>>>>> header >>>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] >>>>>>> (default task-1) [] Entered SsoRestApiNegotiationFilter >>>>>>> 2019-03-25 07:48:33,124-04 DEBUG >>>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] >>>>>>> (default task-1) [] SsoRestApiNegotiationFilter Not performing >>>>>>> Negotiate Auth >>>>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled >>>>>>> stored procedure. Call string is [{call >>>>>>> getclusterforuserbyclustername(?, ?, ?)}] >>>>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>>> procedure [GetClusterForUserByClusterName] compiled >>>>>>> 2019-03-25 07:48:33,145-04 DEBUG >>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>>> runQuery, params: [GetClusterByName, >>>>>>> NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed: >>>>>>> 7ms >>>>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [] Compiled stored procedure. Call string is [{call >>>>>>> getvmstaticbyname(?)}] >>>>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled >>>>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled >>>>>>> stored procedure. Call string is [{call getvdsbyname(?)}] >>>>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>>> procedure [GetVdsByName] compiled >>>>>>> 2019-03-25 07:48:33,223-04 DEBUG >>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>>> getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms >>>>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled >>>>>>> stored procedure. Call string is [{call getvdsbyhostname(?)}] >>>>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>>> procedure [GetVdsByHostName] compiled >>>>>>> 2019-03-25 07:48:33,252-04 DEBUG >>>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>>> getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: >>>>>>> 28ms >>>>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled >>>>>>> stored procedure. Call string is [{call getstorage_poolsbyclusterid(?)}] >>>>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>>> procedure [Getstorage_poolsByClusterId] compiled >>>>>>> 2019-03-25 07:48:33,262-04 DEBUG >>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting >>>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>>> 2019-03-25 07:48:33,461-04 DEBUG >>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: >>>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>>> 2019-03-25 07:48:33,467-04 DEBUG >>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: >>>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>>> 2019-03-25 07:48:33,599-04 DEBUG >>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: >>>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>>> 2019-03-25 07:48:33,600-04 DEBUG >>>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool >>>>>>> vdsm-id' >>>>>>> 2019-03-25 07:48:33,902-04 WARN >>>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id >>>>>>> request on host: Command returned failure code 127 during SSH session >>>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>>> 2019-03-25 07:48:33,920-04 INFO >>>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >>>>>>> internal: false. Entities affected : ID: >>>>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group >>>>>>> CREATE_HOST with role type ADMIN >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com> wrote: >>>>>>> >>>>>>>> we have another failing patch with the same test: >>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/ >>>>>>>> >>>>>>>> its obviously not related to the patch but there is something that >>>>>>>> is causing these failures randomly. >>>>>>>> From what I can see in the current failed job, you are correct and >>>>>>>> host-1 and engine does not even try to deploy host-1. >>>>>>>> >>>>>>>> i can see host-1 is getting error 127 in lago for ntpd and I can >>>>>>>> see network manager errors in the host messages log >>>>>>>> >>>>>>>> In the host messages log I can see several messages that I suspect >>>>>>>> may cause issues in communication between engine and host: >>>>>>>> >>>>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: >>>>>>>> _sasldb_getkeyhandle has failed >>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>> pair >>>>>>>> found >>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>> pair >>>>>>>> found >>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>> pair >>>>>>>> found >>>>>>>> >>>>>>>> >>>>>>>> ces/12) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-UMJa2S is being created... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-UMJa2S is being deleted... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>>> <info> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device >>>>>>>> (/org/freedesktop/NetworkManager/Devices/13) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>>> net_device) (unregistering): Released all slaves >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-liwvMR is being created... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not >>>>>>>> supported in >>>>>>>> mode balance-rr(0) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>>> <info> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device >>>>>>>> (/org/freedesktop/NetworkManager/Devices/14) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-liwvMR is being deleted... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>>> net_device) (unregistering): Released all slaves >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-liwvMR is being created... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not >>>>>>>> supported in >>>>>>>> mode active-backup(1) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>>> <info> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device >>>>>>>> (/org/freedesktop/NetworkManager/Devices/15) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2) >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>> bondscan-liwvMR is being deleted... >>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>>> net_device) (unregistering): Released all slaves >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk < >>>>>>>> msobc...@redhat.com> wrote: >>>>>>>> >>>>>>>>> For the failed job, the engine didn't even try to deploy on host-1: >>>>>>>>> >>>>>>>>> >>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ >>>>>>>>> >>>>>>>>> Martin, do you know what could be the reason for that? >>>>>>>>> >>>>>>>>> I can see in the logs for both successful and unsuccessful >>>>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1: >>>>>>>>> >>>>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 >>>>>>>>> lago-basic-suite-4-3-engine >>>>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127 >>>>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command >>>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 errors: >>>>>>>>> bash: ntpdate: command not found >>>>>>>>> >>>>>>>>> On host-0 everything is ok: >>>>>>>>> >>>>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 >>>>>>>>> lago-basic-suite-4-3-engine >>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0 >>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command >>>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 output: >>>>>>>>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 >>>>>>>>> offset 0.017150 sec >>>>>>>>> >>>>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote: >>>>>>>>> >>>>>>>>> Still fails, now on a different component. ( >>>>>>>>> ovirt-web-ui-extentions ) >>>>>>>>> >>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ >>>>>>>>> >>>>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk < >>>>>>>>>> msobc...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Dafna, >>>>>>>>>>> >>>>>>>>>>> in 'verify_add_hosts' we specifically wait for single host to be >>>>>>>>>>> up with a timeout: >>>>>>>>>>> >>>>>>>>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>>>> status=up'.format(DC_NAME)) >>>>>>>>>>> 145 if len(up_hosts): >>>>>>>>>>> 146 return True >>>>>>>>>>> >>>>>>>>>>> The log files say, that it took ~50 secs for one of the hosts to >>>>>>>>>>> be up (seems reasonable) and no timeout is being reported. >>>>>>>>>>> Just after running 'verify_add_hosts', we run >>>>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>>>>>>>>>> That function does the exact same check, but it fails: >>>>>>>>>>> >>>>>>>>>>> 113 hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>>>> status=up'.format(dc_name)) >>>>>>>>>>> 114 if hosts: >>>>>>>>>>> 115 if random_host: >>>>>>>>>>> 116 return random.choice(hosts) >>>>>>>>>>> >>>>>>>>>>> I don't think it is relevant to our current failure; but I >>>>>>>>>> consider random_host=True as a bad practice. As if we do not have >>>>>>>>>> enough >>>>>>>>>> moving parts, we are adding intentional randomness. Reproducibility >>>>>>>>>> is far >>>>>>>>>> more important than coverage - particularly for a shared system test >>>>>>>>>> like >>>>>>>>>> OST. >>>>>>>>>> >>>>>>>>>>> 117 else: >>>>>>>>>>> 118 return sorted(hosts, key=lambda host: host.name) >>>>>>>>>>> 119 raise RuntimeError('Could not find hosts that are up in DC >>>>>>>>>>> %s' % dc_name) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm also not able to reproduce this issue locally on my server. >>>>>>>>>>> The investigation continues... >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think that it would be fair to take the filtering by host state >>>>>>>>>> out of Engine and into the test, where we can easily log the current >>>>>>>>>> status >>>>>>>>>> of each host. Then we'd have better understanding on the next >>>>>>>>>> failure. >>>>>>>>>> >>>>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite >>>>>>>>>>> for this patch manually and everything was ok. >>>>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> We are failing branch 4.3 for test: >>>>>>>>>>> 002_bootstrap.add_master_storage_domain >>>>>>>>>>> >>>>>>>>>>> It seems that in one of the hosts, the vdsm is not starting >>>>>>>>>>> there is nothing in vdsm.log or in supervdsm.log >>>>>>>>>>> >>>>>>>>>>> CQ identified this patch as the suspected root cause: >>>>>>>>>>> >>>>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support >>>>>>>>>>> for flow id >>>>>>>>>>> >>>>>>>>>>> Milan, Marcin, can you please have a look? >>>>>>>>>>> >>>>>>>>>>> full logs: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/ >>>>>>>>>>> >>>>>>>>>>> the only error I can see is about host not being up (makes sense >>>>>>>>>>> as vdsm is not running) >>>>>>>>>>> >>>>>>>>>>> Stacktrace >>>>>>>>>>> >>>>>>>>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>>>>>>>>>> testMethod() >>>>>>>>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, >>>>>>>>>>> in runTest >>>>>>>>>>> self.test(*self.arg) >>>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", >>>>>>>>>>> line 142, in wrapped_test >>>>>>>>>>> test() >>>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", >>>>>>>>>>> line 60, in wrapper >>>>>>>>>>> return func(get_test_prefix(), *args, **kwargs) >>>>>>>>>>> File >>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>>> line 417, in add_master_storage_domain >>>>>>>>>>> add_iscsi_storage_domain(prefix) >>>>>>>>>>> File >>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>>> line 561, in add_iscsi_storage_domain >>>>>>>>>>> host=_random_host_from_dc(api, DC_NAME), >>>>>>>>>>> File >>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>>> line 122, in _random_host_from_dc >>>>>>>>>>> return _hosts_in_dc(api, dc_name, True) >>>>>>>>>>> File >>>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>>> line 119, in _hosts_in_dc >>>>>>>>>>> raise RuntimeError('Could not find hosts that are up in DC %s' >>>>>>>>>>> % dc_name) >>>>>>>>>>> 'Could not find hosts that are up in DC >>>>>>>>>>> test-dc\n-------------------- >> begin captured logging << >>>>>>>>>>> --------------------\nlago.ssh: DEBUG: start >>>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end >>>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on >>>>>>>>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: >>>>>>>>>>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned >>>>>>>>>>> with 0\nlago.ssh: DEBUG: Command c07b5ee2 on >>>>>>>>>>> lago-basic-suite-4-3-engine output:\n >>>>>>>>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>>> List Archives: >>>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/ >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>> List Archives: >>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/ >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Eyal edri >>>>>>>>> >>>>>>>>> >>>>>>>>> MANAGER >>>>>>>>> >>>>>>>>> RHV/CNV DevOps >>>>>>>>> >>>>>>>>> EMEA VIRTUALIZATION R&D >>>>>>>>> >>>>>>>>> >>>>>>>>> Red Hat EMEA <https://www.redhat.com/> >>>>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. >>>>>>>>> <https://redhat.com/trusted> >>>>>>>>> phone: +972-9-7692018 >>>>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ) >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>> oVirt Code of Conduct: >>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>> List Archives: >>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/ >>>>>>>>> >>>>>>>> >>> >>> -- >>> >>> SANDRO BONAZZOLA >>> >>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV >>> >>> Red Hat EMEA <https://www.redhat.com/> >>> >>> sbona...@redhat.com >>> <https://red.ht/sig> >>> _______________________________________________ >>> Devel mailing list -- devel@ovirt.org >>> To unsubscribe send an email to devel-le...@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/ >>> >> >> >> -- >> >> Eyal edri >> >> >> MANAGER >> >> RHV/CNV DevOps >> >> EMEA VIRTUALIZATION R&D >> >> >> Red Hat EMEA <https://www.redhat.com/> >> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> >> phone: +972-9-7692018 >> irc: eedri (on #tlv #rhev-dev #rhev-integ) >> _______________________________________________ >> Devel mailing list -- devel@ovirt.org >> To unsubscribe send an email to devel-le...@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ABB6UAM37JOSSDIJLHQC3RUER7ADT3GZ/ >> >
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/DMBBLXJFPBIFEM2M7FVPMXAKXISYPNMB/