On 07/24/2014 06:15 PM, Angus Salkeld wrote: > On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote: >> OpenStack has a substantial CI system that is core to its development >> process. The goals of the system are to facilitate merging good code, >> prevent regressions, and ensure that there is at least one configuration >> of upstream OpenStack that we know works as a whole. The "project >> gating" technique that we use is effective at preventing many kinds of >> regressions from landing, however more subtle, non-deterministic bugs >> can still get through, and these are the bugs that are currently >> plaguing developers with seemingly random test failures. >> >> Most of these bugs are not failures of the test system; they are real >> bugs. Many of them have even been in OpenStack for a long time, but are >> only becoming visible now due to improvements in our tests. That's not >> much help to developers whose patches are being hit with negative test >> results from unrelated failures. We need to find a way to address the >> non-deterministic bugs that are lurking in OpenStack without making it >> easier for new bugs to creep in. >> >> The CI system and project infrastructure are not static. They have >> evolved with the project to get to where they are today, and the >> challenge now is to continue to evolve them to address the problems >> we're seeing now. The QA and Infrastructure teams recently hosted a >> sprint where we discussed some of these issues in depth. This post from >> Sean Dague goes into a bit of the background: [1]. The rest of this >> email outlines the medium and long-term changes we would like to make to >> address these problems. >> >> [1] https://dague.net/2014/07/22/openstack-failures/ >> >> ==Things we're already doing== >> >> The elastic-recheck tool[2] is used to identify "random" failures in >> test runs. It tries to match failures to known bugs using signatures >> created from log messages. It helps developers prioritize bugs by how >> frequently they manifest as test failures. It also collects information >> on unclassified errors -- we can see how many (and which) test runs >> failed for an unknown reason and our overall progress on finding >> fingerprints for random failures. >> >> [2] http://status.openstack.org/elastic-recheck/ >> >> We added a feature to Zuul that lets us manually "promote" changes to >> the top of the Gate pipeline. When the QA team identifies a change that >> fixes a bug that is affecting overall gate stability, we can move that >> change to the top of the queue so that it may merge more quickly. >> >> We added the clean check facility in reaction to the January gate break >> down. While it does mean that any individual patch might see more tests >> run on it, it's now largely kept the gate queue at a countable number of >> hours, instead of regularly growing to more than a work day in >> length. It also means that a developer can Approve a code merge before >> tests have returned, and not ruin it for everyone else if there turned >> out to be a bug that the tests could catch. >> >> ==Future changes== >> >> ===Communication=== >> We used to be better at communicating about the CI system. As it and >> the project grew, we incrementally added to our institutional knowledge, >> but we haven't been good about maintaining that information in a form >> that new or existing contributors can consume to understand what's going >> on and why. >> >> We have started on a major effort in that direction that we call the >> "infra-manual" project -- it's designed to be a comprehensive "user >> manual" for the project infrastructure, including the CI process. Even >> before that project is complete, we will write a document that >> summarizes the CI system and ensure it is included in new developer >> documentation and linked to from test results. >> >> There are also a number of ways for people to get involved in the CI >> system, whether focused on Infrastructure or QA, but it is not always >> clear how to do so. We will improve our documentation to highlight how >> to contribute. >> >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use the data from elastic recheck to identify that >> a patch is fixing a critical gate related bug. When one of these is >> found in the queues it will be given higher priority, including bubbling >> up to the top of the gate queue automatically. The manual promote >> process should no longer be needed, and instead bugs fixing elastic >> recheck tracked issues will be promoted automatically. >> >> At the same time we'll also promote review on critical gate bugs through >> making them visible in a number of different channels (like on elastic >> recheck pages, review day, and in the gerrit dashboards). The idea here >> again is to make the reviews that fix key bugs pop to the top of >> everyone's views. >> >> ===Testing more tactically=== >> >> One of the challenges that exists today is that we've got basically 2 >> levels of testing in most of OpenStack: unit tests, and running a whole >> OpenStack cloud. Over time we've focused on adding more and more >> configurations and tests to the latter, but as we've seen, when things >> fail in a whole OpenStack cloud, getting to the root cause is often >> quite hard. So hard in fact that most people throw up their hands and >> just run 'recheck'. If a test run fails, and no one looks at why, does >> it provide any value? >> >> We need to get to a balance where we are testing that OpenStack works as >> a whole in some configuration, but as we've seen, even our best and >> brightest can't seem to make OpenStack reliably boot a compute that has >> working networking 100% the time if we happen to be running more than 1 >> API request at once. >> >> Getting there is a multi party process: >> >> * Reduce the gating configurations down to some gold standard >> configuration(s). This will be a small number of configurations that >> we all agree that everything will gate on. This means things like >> postgresql, cells, different environments will all get dropped from >> the gate as we know it. >> >> * Put the burden for a bunch of these tests back on the projects as >> "functional" tests. Basically a custom devstack environment that a >> project can create with a set of services that they minimally need >> to do their job. These functional tests will live in the project >> tree, not in Tempest, so can be atomically landed as part of the >> project normal development process. > > We do this in Solum and I really like it. It's nice for the same > reviewers to see the functional tests and the code the implements a > feature. > > One downside is we have had failures due to tempest reworking their > client code. This hasn't happened for a while, but it would be good > for tempest to recognize that people are using tempest as a library > and will maintain API.
To be clear, the functional tests will not be Tempest tests. This is a different class of testing, it's really another tox target that needs a devstack to run. A really good initial transition would be things like the CLI testing. Also, the Tempest team has gone out of it's way to tell people it's not a stable interface, and don't do that. Contributions to help make parts of Tempest into a stable library would be appreciated. -Sean -- Sean Dague http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev