Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Angus Salkeld Thu, 24 Jul 2014 15:16:18 -0700

On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
> OpenStack has a substantial CI system that is core to its development
> process.  The goals of the system are to facilitate merging good code,
> prevent regressions, and ensure that there is at least one configuration
> of upstream OpenStack that we know works as a whole.  The "project
> gating" technique that we use is effective at preventing many kinds of
> regressions from landing, however more subtle, non-deterministic bugs
> can still get through, and these are the bugs that are currently
> plaguing developers with seemingly random test failures.
> 
> Most of these bugs are not failures of the test system; they are real
> bugs.  Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests.  That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures.  We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.
> 
> The CI system and project infrastructure are not static.  They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now.  The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth.  This post from
> Sean Dague goes into a bit of the background: [1].  The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
> 
> [1] https://dague.net/2014/07/22/openstack-failures/
> 
> ==Things we're already doing==
> 
> The elastic-recheck tool[2] is used to identify "random" failures in
> test runs.  It tries to match failures to known bugs using signatures
> created from log messages.  It helps developers prioritize bugs by how
> frequently they manifest as test failures.  It also collects information
> on unclassified errors -- we can see how many (and which) test runs
> failed for an unknown reason and our overall progress on finding
> fingerprints for random failures.
> 
> [2] http://status.openstack.org/elastic-recheck/
> 
> We added a feature to Zuul that lets us manually "promote" changes to
> the top of the Gate pipeline.  When the QA team identifies a change that
> fixes a bug that is affecting overall gate stability, we can move that
> change to the top of the queue so that it may merge more quickly.
> 
> We added the clean check facility in reaction to the January gate break
> down. While it does mean that any individual patch might see more tests
> run on it, it's now largely kept the gate queue at a countable number of
> hours, instead of regularly growing to more than a work day in
> length. It also means that a developer can Approve a code merge before
> tests have returned, and not ruin it for everyone else if there turned
> out to be a bug that the tests could catch.
> 
> ==Future changes==
> 
> ===Communication===
> We used to be better at communicating about the CI system.  As it and
> the project grew, we incrementally added to our institutional knowledge,
> but we haven't been good about maintaining that information in a form
> that new or existing contributors can consume to understand what's going
> on and why.
> 
> We have started on a major effort in that direction that we call the
> "infra-manual" project -- it's designed to be a comprehensive "user
> manual" for the project infrastructure, including the CI process.  Even
> before that project is complete, we will write a document that
> summarizes the CI system and ensure it is included in new developer
> documentation and linked to from test results.
> 
> There are also a number of ways for people to get involved in the CI
> system, whether focused on Infrastructure or QA, but it is not always
> clear how to do so.  We will improve our documentation to highlight how
> to contribute.
> 
> ===Fixing Faster===
> 
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
> 
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> 
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the queues it will be given higher priority, including bubbling
> up to the top of the gate queue automatically. The manual promote
> process should no longer be needed, and instead bugs fixing elastic
> recheck tracked issues will be promoted automatically.
> 
> At the same time we'll also promote review on critical gate bugs through
> making them visible in a number of different channels (like on elastic
> recheck pages, review day, and in the gerrit dashboards). The idea here
> again is to make the reviews that fix key bugs pop to the top of
> everyone's views.
> 
> ===Testing more tactically===
> 
> One of the challenges that exists today is that we've got basically 2
> levels of testing in most of OpenStack: unit tests, and running a whole
> OpenStack cloud. Over time we've focused on adding more and more
> configurations and tests to the latter, but as we've seen, when things
> fail in a whole OpenStack cloud, getting to the root cause is often
> quite hard. So hard in fact that most people throw up their hands and
> just run 'recheck'. If a test run fails, and no one looks at why, does
> it provide any value?
> 
> We need to get to a balance where we are testing that OpenStack works as
> a whole in some configuration, but as we've seen, even our best and
> brightest can't seem to make OpenStack reliably boot a compute that has
> working networking 100% the time if we happen to be running more than 1
> API request at once.
> 
> Getting there is a multi party process:
> 
>   * Reduce the gating configurations down to some gold standard
>     configuration(s). This will be a small number of configurations that
>     we all agree that everything will gate on. This means things like
>     postgresql, cells, different environments will all get dropped from
>     the gate as we know it.
> 
>   * Put the burden for a bunch of these tests back on the projects as
>     "functional" tests. Basically a custom devstack environment that a
>     project can create with a set of services that they minimally need
>     to do their job. These functional tests will live in the project
>     tree, not in Tempest, so can be atomically landed as part of the
>     project normal development process.


We do this in Solum and I really like it. It's nice for the same
reviewers to see the functional tests and the code the implements a
feature.

One downside is we have had failures due to tempest reworking their
client code. This hasn't happened for a while, but it would be good
for tempest to recognize that people are using tempest as a library
and will maintain API.

-Angus

> 
>   * For all non gold standard configurations, we'll dedicate a part of
>     our infrastructure to running them in a continuous background loop,
>     as well as making these configs available as experimental jobs. The
>     idea here is that we'll actually be able to provide more
>     configurations that are operating in a more traditional CI (post
>     merge) context. People that are interested in keeping these bits
>     functional can monitor those jobs and help with fixes when needed.
>     The experimental jobs mean that if developers are concerned about
>     the effect of a particular change on one of these configs, it's easy
>     to request a pre-merge test run.  In the near term we might imagine
>     this would allow for things like ceph, mongodb, docker, and possibly
>     very new libvirt to be validated in some way upstream.
> 
>   * Provide some kind of easy to view dashboards of these jobs, as well
>     as a policy that if some job is failing for > some period of time,
>     it's removed from the system. We want to provide whatever feedback
>     we can to engaged parties, but people do need to realize that
>     engagement is key. The biggest part of putting tests into OpenStack
>     isn't landing the tests, but dealing with their failures.
> 
>   * Encourage projects to specifically land interface tests in other
>     projects when they depend on certain behavior.
> 
> Let's imagine an example of how this works in the real world.
> 
>   * The heat-slow job is deleted.
> 
>   * The heat team creates a specific functional job which tests some of
>     their deeper function in Heat, all the tests live in Heat, and
>     because of these the tests can include white/grey box testing of the
>     DB and queues while things are progressing.
> 
>   * Nova lands a change which neither Tempest or our configs exercise,
>     but breaks Heat.
> 
>   * The Heat project can now decide if it's more important to keep the
>     test in place (preventing them from landing code), or to skip it to
>     get back to work.
> 
>   * The Heat team then works on the right fix for Nova, or communicates
>     with the Nova team on the issue at hand. The fix to Nova *also*
>     should include tests which locks down that interface so that Nova
>     won't break them again in the future (the ironic team did this with
>     their test_ironic_contract patch). These tests could be unit tests,
>     if they are testable that way, or functional tests in the Nova tree.
> 
>   * The Heat team then is back in business.
> 
> This approach brings more control of when a project is blocked back into
> their own project. Tempest remains a final integration test to ensure
> that basics of the whole stack work together, but each project has a
> vertical testing stack which is specific to them as well.
> 
> ==Final thoughts==
> 
> The current rate of test failures and subsequent rechecks is not
> sustainable in the long term.  It's not good for contributors,
> reveiewers, or the overall project quality.  While these bugs do need to
> be addressed, it's unlikely that the current process will cause that to
> happen.  Instead, we want to push more substantial testing into the
> projects themselves with functional and interface testing, and depend
> less on devstack-gate integration tests to catch all bugs.  This should
> help us catch bugs closer to the source and in an environment where
> debugging is easier.  We also want to reduce the scope of devstack gate
> tests to a gold standard while running tests of other configurations in
> a traditional CI process so that people interested in those
> configurations can focus on ensuring they work.
> 
> Thanks,
> 
> Jim and Sean
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

signature.asc
Description: This is a digitally signed message part

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Reply via email to