On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote: > OpenStack has a substantial CI system that is core to its development > process. The goals of the system are to facilitate merging good code, > prevent regressions, and ensure that there is at least one configuration > of upstream OpenStack that we know works as a whole. The "project > gating" technique that we use is effective at preventing many kinds of > regressions from landing, however more subtle, non-deterministic bugs > can still get through, and these are the bugs that are currently > plaguing developers with seemingly random test failures. > > Most of these bugs are not failures of the test system; they are real > bugs. Many of them have even been in OpenStack for a long time, but are > only becoming visible now due to improvements in our tests. That's not > much help to developers whose patches are being hit with negative test > results from unrelated failures. We need to find a way to address the > non-deterministic bugs that are lurking in OpenStack without making it > easier for new bugs to creep in. > > The CI system and project infrastructure are not static. They have > evolved with the project to get to where they are today, and the > challenge now is to continue to evolve them to address the problems > we're seeing now. The QA and Infrastructure teams recently hosted a > sprint where we discussed some of these issues in depth. This post from > Sean Dague goes into a bit of the background: [1]. The rest of this > email outlines the medium and long-term changes we would like to make to > address these problems. > > [1] https://dague.net/2014/07/22/openstack-failures/ > > ==Things we're already doing== > > The elastic-recheck tool[2] is used to identify "random" failures in > test runs. It tries to match failures to known bugs using signatures > created from log messages. It helps developers prioritize bugs by how > frequently they manifest as test failures. It also collects information > on unclassified errors -- we can see how many (and which) test runs > failed for an unknown reason and our overall progress on finding > fingerprints for random failures. > > [2] http://status.openstack.org/elastic-recheck/ > > We added a feature to Zuul that lets us manually "promote" changes to > the top of the Gate pipeline. When the QA team identifies a change that > fixes a bug that is affecting overall gate stability, we can move that > change to the top of the queue so that it may merge more quickly. > > We added the clean check facility in reaction to the January gate break > down. While it does mean that any individual patch might see more tests > run on it, it's now largely kept the gate queue at a countable number of > hours, instead of regularly growing to more than a work day in > length. It also means that a developer can Approve a code merge before > tests have returned, and not ruin it for everyone else if there turned > out to be a bug that the tests could catch. > > ==Future changes== > > ===Communication=== > We used to be better at communicating about the CI system. As it and > the project grew, we incrementally added to our institutional knowledge, > but we haven't been good about maintaining that information in a form > that new or existing contributors can consume to understand what's going > on and why. > > We have started on a major effort in that direction that we call the > "infra-manual" project -- it's designed to be a comprehensive "user > manual" for the project infrastructure, including the CI process. Even > before that project is complete, we will write a document that > summarizes the CI system and ensure it is included in new developer > documentation and linked to from test results. > > There are also a number of ways for people to get involved in the CI > system, whether focused on Infrastructure or QA, but it is not always > clear how to do so. We will improve our documentation to highlight how > to contribute. > > ===Fixing Faster=== > > We introduce bugs to OpenStack at some constant rate, which piles up > over time. Our systems currently treat all changes as equally risky and > important to the health of the system, which makes landing code changes > to fix key bugs slow when we're at a high reset rate. We've got a manual > process of promoting changes today to get around this, but that's > actually quite costly in people time, and takes getting all the right > people together at once to promote changes. You can see a number of the > changes we promoted during the gate storm in June [3], and it was no > small number of fixes to get us back to a reasonably passing gate. We > think that optimizing this system will help us land fixes to critical > bugs faster. > > [3] https://etherpad.openstack.org/p/gatetriage-june2014 > > The basic idea is to use the data from elastic recheck to identify that > a patch is fixing a critical gate related bug. When one of these is > found in the queues it will be given higher priority, including bubbling > up to the top of the gate queue automatically. The manual promote > process should no longer be needed, and instead bugs fixing elastic > recheck tracked issues will be promoted automatically. > > At the same time we'll also promote review on critical gate bugs through > making them visible in a number of different channels (like on elastic > recheck pages, review day, and in the gerrit dashboards). The idea here > again is to make the reviews that fix key bugs pop to the top of > everyone's views. > > ===Testing more tactically=== > > One of the challenges that exists today is that we've got basically 2 > levels of testing in most of OpenStack: unit tests, and running a whole > OpenStack cloud. Over time we've focused on adding more and more > configurations and tests to the latter, but as we've seen, when things > fail in a whole OpenStack cloud, getting to the root cause is often > quite hard. So hard in fact that most people throw up their hands and > just run 'recheck'. If a test run fails, and no one looks at why, does > it provide any value? > > We need to get to a balance where we are testing that OpenStack works as > a whole in some configuration, but as we've seen, even our best and > brightest can't seem to make OpenStack reliably boot a compute that has > working networking 100% the time if we happen to be running more than 1 > API request at once. > > Getting there is a multi party process: > > * Reduce the gating configurations down to some gold standard > configuration(s). This will be a small number of configurations that > we all agree that everything will gate on. This means things like > postgresql, cells, different environments will all get dropped from > the gate as we know it. > > * Put the burden for a bunch of these tests back on the projects as > "functional" tests. Basically a custom devstack environment that a > project can create with a set of services that they minimally need > to do their job. These functional tests will live in the project > tree, not in Tempest, so can be atomically landed as part of the > project normal development process.
We do this in Solum and I really like it. It's nice for the same reviewers to see the functional tests and the code the implements a feature. One downside is we have had failures due to tempest reworking their client code. This hasn't happened for a while, but it would be good for tempest to recognize that people are using tempest as a library and will maintain API. -Angus > > * For all non gold standard configurations, we'll dedicate a part of > our infrastructure to running them in a continuous background loop, > as well as making these configs available as experimental jobs. The > idea here is that we'll actually be able to provide more > configurations that are operating in a more traditional CI (post > merge) context. People that are interested in keeping these bits > functional can monitor those jobs and help with fixes when needed. > The experimental jobs mean that if developers are concerned about > the effect of a particular change on one of these configs, it's easy > to request a pre-merge test run. In the near term we might imagine > this would allow for things like ceph, mongodb, docker, and possibly > very new libvirt to be validated in some way upstream. > > * Provide some kind of easy to view dashboards of these jobs, as well > as a policy that if some job is failing for > some period of time, > it's removed from the system. We want to provide whatever feedback > we can to engaged parties, but people do need to realize that > engagement is key. The biggest part of putting tests into OpenStack > isn't landing the tests, but dealing with their failures. > > * Encourage projects to specifically land interface tests in other > projects when they depend on certain behavior. > > Let's imagine an example of how this works in the real world. > > * The heat-slow job is deleted. > > * The heat team creates a specific functional job which tests some of > their deeper function in Heat, all the tests live in Heat, and > because of these the tests can include white/grey box testing of the > DB and queues while things are progressing. > > * Nova lands a change which neither Tempest or our configs exercise, > but breaks Heat. > > * The Heat project can now decide if it's more important to keep the > test in place (preventing them from landing code), or to skip it to > get back to work. > > * The Heat team then works on the right fix for Nova, or communicates > with the Nova team on the issue at hand. The fix to Nova *also* > should include tests which locks down that interface so that Nova > won't break them again in the future (the ironic team did this with > their test_ironic_contract patch). These tests could be unit tests, > if they are testable that way, or functional tests in the Nova tree. > > * The Heat team then is back in business. > > This approach brings more control of when a project is blocked back into > their own project. Tempest remains a final integration test to ensure > that basics of the whole stack work together, but each project has a > vertical testing stack which is specific to them as well. > > ==Final thoughts== > > The current rate of test failures and subsequent rechecks is not > sustainable in the long term. It's not good for contributors, > reveiewers, or the overall project quality. While these bugs do need to > be addressed, it's unlikely that the current process will cause that to > happen. Instead, we want to push more substantial testing into the > projects themselves with functional and interface testing, and depend > less on devstack-gate integration tests to catch all bugs. This should > help us catch bugs closer to the source and in an environment where > debugging is easier. We also want to reduce the scope of devstack gate > tests to a gold standard while running tests of other configurations in > a traditional CI process so that people interested in those > configurations can focus on ensuring they work. > > Thanks, > > Jim and Sean > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
signature.asc
Description: This is a digitally signed message part
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev