Actually, I can see in engine that it is trying to start
vdsm installation on the host:
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
Entered SsoRestApiAuthFilter
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating with sso
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating using BEARER header
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter successfully authenticated using BEARER header
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default
task-1) [] Entered SsoRestApiNegotiationFilter
2019-03-25 07:48:33,124-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default
task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored
procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}]
2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
2019-03-25 07:48:33,145-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params:
[GetClusterByName, NameQueryParameters:{refresh='false', filtered='false'}],
timeElapsed: 7ms
2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored
procedure. Call string is [{call getvdsbyname(?)}]
2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
2019-03-25 07:48:33,223-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored
procedure. Call string is [{call getvdsbyhostname(?)}]
2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
2019-03-25 07:48:33,252-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname,
params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms
2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored
procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
2019-03-25 07:48:33,262-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,461-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,467-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,599-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,600-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
2019-03-25 07:48:33,902-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on
host: Command returned failure code 127 during SSH session
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal:
false. Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type:
ClusterAction group CREATE_HOST with role type ADMIN
On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron
<d...@redhat.com <mailto:d...@redhat.com>> wrote:
we have another failing patch with the same test:
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
its obviously not related to the patch but there is
something that is causing these failures randomly.
From what I can see in the current failed job, you
are correct and host-1 and engine does not even try
to deploy host-1.
i can see host-1 is getting error 127 in lago for
ntpd and I can see network manager errors in the
host messages log
In the host messages log I can see several messages
that I suspect may cause issues in communication
between engine and host:
Mar 25 07:50:09 lago-basic-suite-4-3-host-1
sasldblistusers2: _sasldb_getkeyhandle has failed
Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error deleting entry from sasldb:
BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error deleting entry from sasldb:
BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error deleting entry from sasldb:
BDB0073 DB_NOTFOUND: No matching key/data pair found
ces/12)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-UMJa2S is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-UMJa2S is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1
NetworkManager[2658]: <info> [1553514613.7774]
manager: (bondscan-UMJa2S): new Bond device
(/org/freedesktop/NetworkManager/Devices/13)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
(unregistered net_device) (unregistering): Released
all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option lacp_rate: mode dependency
failed, not supported in mode balance-rr(0)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option xmit_hash_policy: invalid
value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1
NetworkManager[2658]: <info> [1553514613.8002]
manager: (bondscan-liwvMR): new Bond device
(/org/freedesktop/NetworkManager/Devices/14)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option primary_reselect: invalid
value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option arp_all_targets: invalid
value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
(unregistered net_device) (unregistering): Released
all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option lacp_rate: mode dependency
failed, not supported in mode active-backup(1)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1
NetworkManager[2658]: <info> [1553514613.8429]
manager: (bondscan-liwvMR): new Bond device
(/org/freedesktop/NetworkManager/Devices/15)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option xmit_hash_policy: invalid
value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option primary_reselect: invalid
value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bondscan-liwvMR: option arp_all_targets: invalid
value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
bonding: bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
(unregistered net_device) (unregistering): Released
all slaves
On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk
<msobc...@redhat.com <mailto:msobc...@redhat.com>>
wrote:
For the failed job, the engine didn't even try
to deploy on host-1:
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/
Martin, do you know what could be the reason for
that?
I can see in the logs for both successful and
unsuccessful basic-suite-4.3 runs, that there is
no 'ntpdate' on host-1:
2019-03-25
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
2019-03-25
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
2019-03-25
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
bash: ntpdate: command not found
On host-0 everything is ok:
2019-03-25
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
2019-03-25
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
2019-03-25
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
25 Mar 06:14:53 ntpdate[6646]: adjust time server
192.168.202.2 offset 0.017150 sec
On 3/25/19 10:13 AM, Eyal Edri wrote:
Still fails, now on a different component. (
ovirt-web-ui-extentions )
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg
<dan...@redhat.com <mailto:dan...@redhat.com>>
wrote:
On Fri, Mar 22, 2019 at 3:21 PM Marcin
Sobczyk <msobc...@redhat.com
<mailto:msobc...@redhat.com>> wrote:
Dafna,
in 'verify_add_hosts' we specifically
wait for single host to be up with a
timeout:
144 up_hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))
145 if len(up_hosts):
146 return True
The log files say, that it took ~50
secs for one of the hosts to be up
(seems reasonable) and no timeout is
being reported.
Just after running 'verify_add_hosts',
we run 'add_master_storage_domain',
which calls '_hosts_in_dc' function.
That function does the exact same
check, but it fails:
113 hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
114 if hosts:
115 if random_host:
116 return random.choice(hosts)
I don't think it is relevant to our current
failure; but I consider random_host=True as
a bad practice. As if we do not have enough
moving parts, we are adding intentional
randomness. Reproducibility is far more
important than coverage - particularly for
a shared system test like OST.
117 else:
118 return sorted(hosts, key=lambda
host:host.name <http://host.name>)
119 raise RuntimeError('Could not find hosts
that are up in DC %s' % dc_name)
I'm also not able to reproduce this
issue locally on my server. The
investigation continues...
I think that it would be fair to take the
filtering by host state out of Engine and
into the test, where we can easily log the
current status of each host. Then we'd have
better understanding on the next failure.
On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
Hi,
sure, I'm on it - it's weird though, I
did ran 4.3 basic suite for this patch
manually and everything was ok.
On 3/22/19 1:05 PM, Dafna Ron wrote:
Hi,
We are failing branch 4.3 for test:
002_bootstrap.add_master_storage_domain
It seems that in one of the hosts,
the vdsm is not starting
there is nothing in vdsm.log or in
supervdsm.log
CQ identified this patch as the
suspected root cause:
https://gerrit.ovirt.org/#/c/98748/ -
vdsm: client: Add support for flow id
Milan, Marcin, can you please have a
look?
full logs:
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/
the only error I can see is about
host not being up (makes sense as
vdsm is not running)
Stacktrace
File "/usr/lib64/python2.7/unittest/case.py",
line 369, in run
testMethod()
File
"/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in
wrapped_test
test()
File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
return func(get_test_prefix(), *args, **kwargs)
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
add_iscsi_storage_domain(prefix)
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
host=_random_host_from_dc(api, DC_NAME),
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
return _hosts_in_dc(api, dc_name, True)
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
raise RuntimeError('Could not find hosts that
are up in DC %s' % dc_name)
'Could not find hosts that are up in DC
test-dc\n-------------------- >> begin captured logging <<
--------------------\nlago.ssh: DEBUG: starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get
ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG:
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
_______________________________________________
Devel mailing list -- devel@ovirt.org
<mailto:devel@ovirt.org>
To unsubscribe send an email to
devel-le...@ovirt.org
<mailto:devel-le...@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/
_______________________________________________
Devel mailing list -- devel@ovirt.org
<mailto:devel@ovirt.org>
To unsubscribe send an email to
devel-le...@ovirt.org
<mailto:devel-le...@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/
--
Eyal edri
MANAGER
RHV/CNV DevOps
EMEA VIRTUALIZATION R&D
Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED.
<https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________
Devel mailing list -- devel@ovirt.org
<mailto:devel@ovirt.org>
To unsubscribe send an email to
devel-le...@ovirt.org <mailto:devel-le...@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/