Hello everyone,

You may have noticed there is a large Zuul job backlog and changes are not 
getting CI reports as quickly as you might expect. There are several factors 
interacting with each other to make this the case. The short version is that 
one of our clouds is performing upgrades and has been removed from service, and 
we have a large number of gate failures which cause things to reset and start 
over. We have fewer resources than normal and are using them inefficiently. 
Zuul is operating as expected.

Continue reading if you'd like to understand the technical details and find out 
how you can help make this better.

Zuul gates related projects in shared queues. Changes enter these queues and 
are ordered in a speculative future state that Zuul assumes will pass because 
multiple humans have reviewed the changes and said they are good (also they had 
to pass check testing first). Problems arise when tests fail forcing Zuul to 
evict changes from the speculative future state, build a new state, then start 
jobs over again for this new future.

Typically this doesn't happen often and we merge many changes at a time, 
quickly pushing code into our repos. Unfortunately, the results are painful 
when we fail often as we end up rebuilding future states and restarting jobs 
often. Currently we have the gate and release jobs set to the highest priority 
as well so they run jobs before other queues. This means the gate can starve 
other work if it is flaky. We've configured things this way because the gate is 
not supposed to be flaky since we've reviewed things and already passed check 
testing. One of the tools we have in place to make this less painful is each 
gate queue operates on a window that grows and shrinks similar to how TCP 
slowstart. As changes merge we increase the size of the window and when they 
fail to merge we decrease it. This reduces the size of the future state that 
must be rebuilt and retested on failure when things are persistently flaky.

The best way to make this better is to fix the bugs in our software, whether 
that is in the CI system itself or the software being tested. The first step in 
doing that is to identify and track the bugs that we are dealing with. We have 
a tool called elastic-recheck that does this using indexed logs from the jobs. 
The idea there is to go through the list of unclassified failures [0] and 
fingerprint them so that we can track them [1]. With that data available we can 
then prioritize fixing the bugs that have the biggest impact.

Unfortunately, right now our classification rate is very poor (only 15%), which 
makes it difficult to know what exactly is causing these failures. Mriedem and 
I have quickly scanned the unclassified list, and it appears there is a db 
migration testing issue causing these tests to timeout across several projects. 
Mriedem is working to get this classified and tracked which should help, but we 
will also need to fix the bug. On top of that it appears that Glance has flaky 
functional tests (both python2 and python3) which are causing resets and should 
be looked into.

If you'd like to help, let mriedem or myself know and we'll gladly work with 
you to get elasticsearch queries added to elastic-recheck. We are likely less 
help when it comes to fixing functional tests in Glance, but I'm happy to point 
people in the right direction for that as much as I can. If you can take a few 
minutes to do this before/after you issue a recheck it does help quite a bit.

One general thing I've found would be helpful is if projects can clean up the 
deprecation warnings in their log outputs. The persistent "WARNING you used the 
old name for a thing" messages make the logs large and much harder to read to 
find the actual failures.

As a final note this is largely targeted at the OpenStack Integrated gate 
(Nova, Glance, Cinder, Keystone, Swift, Neutron) since that appears to be 
particularly flaky at the moment. The Zuul behavior applies to other gate 
pipelines (OSA, Tripleo, Airship, etc) as does elastic-recheck and related 
tooling. If you find your particular pipeline is flaky I'm more than happy to 
help in that context as well.

[0] http://status.openstack.org/elastic-recheck/data/integrated_gate.html
[1] http://status.openstack.org/elastic-recheck/gate.html

Thank you,
Clark

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to