Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
> Sean Dague wrote: > > To be clear, the functional tests will not be Tempest tests. This is a > different class of testing, it's really another tox target that needs a > devstack to run. A really good initial transition would be things like > the CLI testing. > > Also, the Tempest team has gone out of it's way to tell people it's not > a stable interface, and don't do that. Contributions to help make parts > of Tempest into a stable library would be appreciated. Well, in a perfect world, this "libification" of the re-usable bits from Tempest would be nicely advanced *before* the projects all rush in to implement their own in-tree functional testing mechanisms. But as we know, we all live in a highly imperfect world ... So do we expect the tempest-lib to be fleshed out in an emergent fashion, as the projects dig into implementing their own in-tree func tests? Or is it seen as an upfront seeding process, that the QA team members with Tempest domain knowledge are expecting to drive? Cheers, Eoghan ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Mon, Jul 28, 2014 at 02:28:56PM +0200, Thierry Carrez wrote: > James E. Blair wrote: > > [...] > > Most of these bugs are not failures of the test system; they are real > > bugs. Many of them have even been in OpenStack for a long time, but are > > only becoming visible now due to improvements in our tests. That's not > > much help to developers whose patches are being hit with negative test > > results from unrelated failures. We need to find a way to address the > > non-deterministic bugs that are lurking in OpenStack without making it > > easier for new bugs to creep in. > > I think that's a critical point. As a community, we need to move from a > perspective where we see the gate as a process step and failure there > being described as "the gate is broken". > > Although in some cases the failures are indeed coming from a gate bug, > in most cases the failures are coming from a pileup of race conditions > and other rare errors in OpenStack itself. In other words, the gate is > not broken, *OpenStack* is broken. If you can't get the tests to pass on > a proposed change due to test failures, that means OpenStack itself has > reached a level where it just doesn't work. The gate is just a thermometer. > > Those type of problems need to be solved, even if changes can be > introduced in the CI/gate system to mitigate some of their most painful > side-effects. However, currently, only a handful of developers actually > work on fixing such issues -- and today those developers are completely > overwhelmed and burnt out. > > We need to have more people working on those bugs. We need to > communicate this key type of strategic contribution to our corporate > sponsors. We need to make it practical to work on those bugs, by > providing all the tools we can to help in the debugging. We need to make > it rewarding to work on those bugs: some of those bugs will be the most > complex bugs you can find in OpenStack -- they should be viewed as an > intellectual challenge for our best minds, rather than as cleaning up a > sewer that other people continuously contribute to fill. I recall it was suggested elsewhere recently, but I think that perhaps we should consider having much more regular bug squashing days. eg could say we have "bug squash wednesdays" every 2 weeks or so where we explicitly encourage people to focus their attention exclusively on bug fixes and ignore all feature related stuff. Core reviewers could set the tone by not reviewing any patches which were not tagged with a bug on those days and encouraging discussions around the bugs in IRC. The bug triage and gate teams could help prime it by providing a couple of lists of bugs, each list targetted to suit some skill level, to make it easy for people to pick off bugs to attack on those days. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
James E. Blair wrote: > [...] > Most of these bugs are not failures of the test system; they are real > bugs. Many of them have even been in OpenStack for a long time, but are > only becoming visible now due to improvements in our tests. That's not > much help to developers whose patches are being hit with negative test > results from unrelated failures. We need to find a way to address the > non-deterministic bugs that are lurking in OpenStack without making it > easier for new bugs to creep in. I think that's a critical point. As a community, we need to move from a perspective where we see the gate as a process step and failure there being described as "the gate is broken". Although in some cases the failures are indeed coming from a gate bug, in most cases the failures are coming from a pileup of race conditions and other rare errors in OpenStack itself. In other words, the gate is not broken, *OpenStack* is broken. If you can't get the tests to pass on a proposed change due to test failures, that means OpenStack itself has reached a level where it just doesn't work. The gate is just a thermometer. Those type of problems need to be solved, even if changes can be introduced in the CI/gate system to mitigate some of their most painful side-effects. However, currently, only a handful of developers actually work on fixing such issues -- and today those developers are completely overwhelmed and burnt out. We need to have more people working on those bugs. We need to communicate this key type of strategic contribution to our corporate sponsors. We need to make it practical to work on those bugs, by providing all the tools we can to help in the debugging. We need to make it rewarding to work on those bugs: some of those bugs will be the most complex bugs you can find in OpenStack -- they should be viewed as an intellectual challenge for our best minds, rather than as cleaning up a sewer that other people continuously contribute to fill. > The CI system and project infrastructure are not static. They have > evolved with the project to get to where they are today, and the > challenge now is to continue to evolve them to address the problems > we're seeing now. The QA and Infrastructure teams recently hosted a > sprint where we discussed some of these issues in depth. This post from > Sean Dague goes into a bit of the background: [1]. The rest of this > email outlines the medium and long-term changes we would like to make to > address these problems. > [...] I like all the options suggested there, and I enjoyed the discussion that followed. -- Thierry Carrez (ttx) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 06:36 PM, John Dickinson wrote: On Jul 24, 2014, at 3:25 PM, Sean Dague wrote: On 07/24/2014 06:15 PM, Angus Salkeld wrote: We do this in Solum and I really like it. It's nice for the same reviewers to see the functional tests and the code the implements a feature. One downside is we have had failures due to tempest reworking their client code. This hasn't happened for a while, but it would be good for tempest to recognize that people are using tempest as a library and will maintain API. To be clear, the functional tests will not be Tempest tests. This is a different class of testing, it's really another tox target that needs a devstack to run. A really good initial transition would be things like the CLI testing. I too love this idea. In addition to the current Tempest tests that are run against every patch, Swift has in-tree unit, functional[1], and probe[2] tests. This makes it quite easy to test locally before submitting patches and makes keeping test coverage high much easier too. I'm really happy to hear that this will be the future direction of testing in OpenStack. And Glance has had functional tests in-tree for 3 years: http://git.openstack.org/cgit/openstack/glance/tree/glance/tests/functional -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 26 July 2014 08:20, Matthew Treinish wrote: >> This is also more of a pragmatic organic approach to figuring out the >> interfaces we need to lock down. When one projects breaks depending on >> an interface in another project, that should trigger this kind of >> contract growth, which hopefully formally turns into a document later >> for a stable interface. > > So notifications are a good example of this, but I think how we handled this > is also an example of what not to do. The order was backwards, there should > have been a stability guarantee upfront, with a versioning mechanism on > notifications when another project started relying on using them. The fact > that > there are at least 2 ML threads on how to fix and test this at this point in > ceilometer's life seems like a poor way to handle it. I don't want to see us > repeat this by allowing cross-project interactions to depend on unstable > interfaces. +1 > I agree that there is a scaling issue, our variable testing quality and > coverage > between all the projects in the tempest tree is proof enough of this. I just > don't want to see us lose the protection we have against inadvertent changes. > Having the friction of something like the tempest two step is important, we've > blocked a lot of breaking api changes because of this. > > The other thing to consider is that when we adopted branchless tempest part of > the goal there was to ensure the consistency between release boundaries. If > we're really advocating dropping most of the API coverage out of tempest part > of > the story needs to be around how we prevent things from slipping between > release > boundaries too. I'm also worried about the impact on TripleO - we run everything together functionally, and we've been aiming at the gate since forever: we need more stability, and I'm worried that this may lead to less. I don't think more lock-down and a bigger matrix is needed - and I support doing an experiment to see if we end up in a better place. Still worried :). > But, having worked on this stuff for ~2 years I can say from personal > experience > that every project slips when it comes to API stability, despite the best > intentions, unless there was test coverage for it. I don't want to see us open > the flood gates on this just because we've gotten ourselves into a bad > situation > with the state of the gate. +1 >> Our current model leans far too much on the idea of the only time we >> ever try to test things for real is when we throw all 1 million lines of >> source code into one pot and stir. It really shouldn't be surprising how >> many bugs shake out there. And this is the wrong layer to debug from, so >> I firmly believe we need to change this back to something we can >> actually manage to shake the bugs out with. Because right now we're >> finding them, but our infrastructure isn't optimized for fixing them, >> and we need to change that. >> > > I agree a layered approach is best, I'm not disagreeing on that point. I just > am > not sure how much we really should be decreasing the scope of Tempest as the > top > layer around the api tests. I don't think we should too much just because we > beefing up the middle with improved functional testing. In my view having some > duplication between the layers is fine and desirable actually. > > Anyway, I feel like I'm diverging this thread off into a different area, so > I'll > shoot off a separate thread on the topic of scale and scope of Tempest and the > new in-tree project specific functional tests. But to summarize, what I think > we > should be clear about at the high level for this thread is that for right now > is > that for the short term we aren't changing the scope of Tempest. Instead we > should just be vigilant in managing tempest's growth (which we've been trying > to > do already) We can revisit the discussion of decreasing Tempest's size once > everyone's figured out the per project functional testing. This will also give > us time to collect longer term data about test stability in the gate so we can > figure out which things are actually valuable to have in tempest. I think this > is what probably got lost in the noise here but has been discussed elsewhere. I'm pretty interested in having contract tests within each project, I think thats the right responsibility for them - my specific concern is the recovery process / time to recovery when a regression does get through. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Thu, Jul 24, 2014 at 06:54:38PM -0400, Sean Dague wrote: > On 07/24/2014 05:57 PM, Matthew Treinish wrote: > > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> OpenStack has a substantial CI system that is core to its development > >> process. The goals of the system are to facilitate merging good code, > >> prevent regressions, and ensure that there is at least one configuration > >> of upstream OpenStack that we know works as a whole. The "project > >> gating" technique that we use is effective at preventing many kinds of > >> regressions from landing, however more subtle, non-deterministic bugs > >> can still get through, and these are the bugs that are currently > >> plaguing developers with seemingly random test failures. > >> > >> Most of these bugs are not failures of the test system; they are real > >> bugs. Many of them have even been in OpenStack for a long time, but are > >> only becoming visible now due to improvements in our tests. That's not > >> much help to developers whose patches are being hit with negative test > >> results from unrelated failures. We need to find a way to address the > >> non-deterministic bugs that are lurking in OpenStack without making it > >> easier for new bugs to creep in. > >> > >> The CI system and project infrastructure are not static. They have > >> evolved with the project to get to where they are today, and the > >> challenge now is to continue to evolve them to address the problems > >> we're seeing now. The QA and Infrastructure teams recently hosted a > >> sprint where we discussed some of these issues in depth. This post from > >> Sean Dague goes into a bit of the background: [1]. The rest of this > >> email outlines the medium and long-term changes we would like to make to > >> address these problems. > >> > >> [1] https://dague.net/2014/07/22/openstack-failures/ > >> > >> ==Things we're already doing== > >> > >> The elastic-recheck tool[2] is used to identify "random" failures in > >> test runs. It tries to match failures to known bugs using signatures > >> created from log messages. It helps developers prioritize bugs by how > >> frequently they manifest as test failures. It also collects information > >> on unclassified errors -- we can see how many (and which) test runs > >> failed for an unknown reason and our overall progress on finding > >> fingerprints for random failures. > >> > >> [2] http://status.openstack.org/elastic-recheck/ > >> > >> We added a feature to Zuul that lets us manually "promote" changes to > >> the top of the Gate pipeline. When the QA team identifies a change that > >> fixes a bug that is affecting overall gate stability, we can move that > >> change to the top of the queue so that it may merge more quickly. > >> > >> We added the clean check facility in reaction to the January gate break > >> down. While it does mean that any individual patch might see more tests > >> run on it, it's now largely kept the gate queue at a countable number of > >> hours, instead of regularly growing to more than a work day in > >> length. It also means that a developer can Approve a code merge before > >> tests have returned, and not ruin it for everyone else if there turned > >> out to be a bug that the tests could catch. > >> > >> ==Future changes== > >> > >> ===Communication=== > >> We used to be better at communicating about the CI system. As it and > >> the project grew, we incrementally added to our institutional knowledge, > >> but we haven't been good about maintaining that information in a form > >> that new or existing contributors can consume to understand what's going > >> on and why. > >> > >> We have started on a major effort in that direction that we call the > >> "infra-manual" project -- it's designed to be a comprehensive "user > >> manual" for the project infrastructure, including the CI process. Even > >> before that project is complete, we will write a document that > >> summarizes the CI system and ensure it is included in new developer > >> documentation and linked to from test results. > >> > >> There are also a number of ways for people to get involved in the CI > >> system, whether focused on Infrastructure or QA, but it is not always > >> clear how to do so. We will improve our documentation to highlight how > >> to contribute. > >> > >> ===Fixing Faster=== > >> > >> We introduce bugs to OpenStack at some constant rate, which piles up > >> over time. Our systems currently treat all changes as equally risky and > >> important to the health of the system, which makes landing code changes > >> to fix key bugs slow when we're at a high reset rate. We've got a manual > >> process of promoting changes today to get around this, but that's > >> actually quite costly in people time, and takes getting all the right > >> people together at once to promote changes. You can see a number of the > >> changes we promoted during the gate storm in June [3], and it was no > >> small number of fixe
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague wrote: > On 07/24/2014 05:57 PM, Matthew Treinish wrote: > > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> OpenStack has a substantial CI system that is core to its development > >> process. The goals of the system are to facilitate merging good code, > >> prevent regressions, and ensure that there is at least one configuration > >> of upstream OpenStack that we know works as a whole. The "project > >> gating" technique that we use is effective at preventing many kinds of > >> regressions from landing, however more subtle, non-deterministic bugs > >> can still get through, and these are the bugs that are currently > >> plaguing developers with seemingly random test failures. > >> > >> Most of these bugs are not failures of the test system; they are real > >> bugs. Many of them have even been in OpenStack for a long time, but are > >> only becoming visible now due to improvements in our tests. That's not > >> much help to developers whose patches are being hit with negative test > >> results from unrelated failures. We need to find a way to address the > >> non-deterministic bugs that are lurking in OpenStack without making it > >> easier for new bugs to creep in. > >> > >> The CI system and project infrastructure are not static. They have > >> evolved with the project to get to where they are today, and the > >> challenge now is to continue to evolve them to address the problems > >> we're seeing now. The QA and Infrastructure teams recently hosted a > >> sprint where we discussed some of these issues in depth. This post from > >> Sean Dague goes into a bit of the background: [1]. The rest of this > >> email outlines the medium and long-term changes we would like to make to > >> address these problems. > >> > >> [1] https://dague.net/2014/07/22/openstack-failures/ > >> > >> ==Things we're already doing== > >> > >> The elastic-recheck tool[2] is used to identify "random" failures in > >> test runs. It tries to match failures to known bugs using signatures > >> created from log messages. It helps developers prioritize bugs by how > >> frequently they manifest as test failures. It also collects information > >> on unclassified errors -- we can see how many (and which) test runs > >> failed for an unknown reason and our overall progress on finding > >> fingerprints for random failures. > >> > >> [2] http://status.openstack.org/elastic-recheck/ > >> > >> We added a feature to Zuul that lets us manually "promote" changes to > >> the top of the Gate pipeline. When the QA team identifies a change that > >> fixes a bug that is affecting overall gate stability, we can move that > >> change to the top of the queue so that it may merge more quickly. > >> > >> We added the clean check facility in reaction to the January gate break > >> down. While it does mean that any individual patch might see more tests > >> run on it, it's now largely kept the gate queue at a countable number of > >> hours, instead of regularly growing to more than a work day in > >> length. It also means that a developer can Approve a code merge before > >> tests have returned, and not ruin it for everyone else if there turned > >> out to be a bug that the tests could catch. > >> > >> ==Future changes== > >> > >> ===Communication=== > >> We used to be better at communicating about the CI system. As it and > >> the project grew, we incrementally added to our institutional knowledge, > >> but we haven't been good about maintaining that information in a form > >> that new or existing contributors can consume to understand what's going > >> on and why. > >> > >> We have started on a major effort in that direction that we call the > >> "infra-manual" project -- it's designed to be a comprehensive "user > >> manual" for the project infrastructure, including the CI process. Even > >> before that project is complete, we will write a document that > >> summarizes the CI system and ensure it is included in new developer > >> documentation and linked to from test results. > >> > >> There are also a number of ways for people to get involved in the CI > >> system, whether focused on Infrastructure or QA, but it is not always > >> clear how to do so. We will improve our documentation to highlight how > >> to contribute. > >> > >> ===Fixing Faster=== > >> > >> We introduce bugs to OpenStack at some constant rate, which piles up > >> over time. Our systems currently treat all changes as equally risky and > >> important to the health of the system, which makes landing code changes > >> to fix key bugs slow when we're at a high reset rate. We've got a manual > >> process of promoting changes today to get around this, but that's > >> actually quite costly in people time, and takes getting all the right > >> people together at once to promote changes. You can see a number of the > >> changes we promoted during the gate storm in June [3], and it was no > >> small number of fixes to ge
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 25/07/14 11:18, Sean Dague wrote: > On 07/25/2014 10:01 AM, Steven Hardy wrote: >> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: >> >>> * Put the burden for a bunch of these tests back on the projects as >>> "functional" tests. Basically a custom devstack environment that a >>> project can create with a set of services that they minimally need >>> to do their job. These functional tests will live in the project >>> tree, not in Tempest, so can be atomically landed as part of the >>> project normal development process. >> +1 - FWIW I don't think the current process where we require tempest >> cores to review our project test cases is working well, so allowing >> projects to own their own tests will be a major improvement. >> >> In terms of how this works in practice, will the in-tree tests still be run >> via tempest, e.g will there be a (relatively) stable tempest api we can >> develop the tests against, as Angus has already mentioned? > No, not run by tempest, not using tempest code. > > The vision is that you'd have: > > heat/tests/functional/ > > And tox -e functional would run them. It would require some config for > end points. But the point being it would be fully owned by the project > team. That it could do both blackbox/whitebox testing (and because it's > in the project tree would know things like the data model and could poke > behind the scenes). > > The tight coupling of everything is part of what's gotten us into these > deadlocks, decoupling here is really required in order to reduce the > fragility of the system. > > Since the tempest scenario orchestration tests use heatclient then hopefully it wouldn't be too much effort to forklift them into heat/tests/functional without any tempest dependencies. We can leave the orchestration api tests where they are until the tempest-lib process results in something ready to use. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/25/2014 10:01 AM, Steven Hardy wrote: > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> * Put the burden for a bunch of these tests back on the projects as >> "functional" tests. Basically a custom devstack environment that a >> project can create with a set of services that they minimally need >> to do their job. These functional tests will live in the project >> tree, not in Tempest, so can be atomically landed as part of the >> project normal development process. > > +1 - FWIW I don't think the current process where we require tempest > cores to review our project test cases is working well, so allowing > projects to own their own tests will be a major improvement. > > In terms of how this works in practice, will the in-tree tests still be run > via tempest, e.g will there be a (relatively) stable tempest api we can > develop the tests against, as Angus has already mentioned? No, not run by tempest, not using tempest code. The vision is that you'd have: heat/tests/functional/ And tox -e functional would run them. It would require some config for end points. But the point being it would be fully owned by the project team. That it could do both blackbox/whitebox testing (and because it's in the project tree would know things like the data model and could poke behind the scenes). The tight coupling of everything is part of what's gotten us into these deadlocks, decoupling here is really required in order to reduce the fragility of the system. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/25/2014 10:01 AM, Steven Hardy wrote: On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: * Put the burden for a bunch of these tests back on the projects as "functional" tests. Basically a custom devstack environment that a project can create with a set of services that they minimally need to do their job. These functional tests will live in the project tree, not in Tempest, so can be atomically landed as part of the project normal development process. +1 - FWIW I don't think the current process where we require tempest cores to review our project test cases is working well, so allowing projects to own their own tests will be a major improvement. ++ We will still need some way to make sure it is difficult to break api compatibility by submitting a change to both code and its tests, which currently requires a "tempest two-step". Also, tempest will still need to retain integration testing of apis that use apis from other projects. In terms of how this works in practice, will the in-tree tests still be run via tempest, e.g will there be a (relatively) stable tempest api we can develop the tests against, as Angus has already mentioned? That is a really good question. I hope the answer is that they can still be run by tempest, but don't have to be. I tried to address this in a message within the last hour http://lists.openstack.org/pipermail/openstack-dev/2014-July/041244.html -David Steve ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > * Put the burden for a bunch of these tests back on the projects as > "functional" tests. Basically a custom devstack environment that a > project can create with a set of services that they minimally need > to do their job. These functional tests will live in the project > tree, not in Tempest, so can be atomically landed as part of the > project normal development process. +1 - FWIW I don't think the current process where we require tempest cores to review our project test cases is working well, so allowing projects to own their own tests will be a major improvement. In terms of how this works in practice, will the in-tree tests still be run via tempest, e.g will there be a (relatively) stable tempest api we can develop the tests against, as Angus has already mentioned? Steve ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Thu, Jul 24, 2014 at 04:01:39PM -0400, Sean Dague wrote: > On 07/24/2014 12:40 PM, Daniel P. Berrange wrote: > > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > > > >> ==Future changes== > > > >> ===Fixing Faster=== > >> > >> We introduce bugs to OpenStack at some constant rate, which piles up > >> over time. Our systems currently treat all changes as equally risky and > >> important to the health of the system, which makes landing code changes > >> to fix key bugs slow when we're at a high reset rate. We've got a manual > >> process of promoting changes today to get around this, but that's > >> actually quite costly in people time, and takes getting all the right > >> people together at once to promote changes. You can see a number of the > >> changes we promoted during the gate storm in June [3], and it was no > >> small number of fixes to get us back to a reasonably passing gate. We > >> think that optimizing this system will help us land fixes to critical > >> bugs faster. > >> > >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 > >> > >> The basic idea is to use the data from elastic recheck to identify that > >> a patch is fixing a critical gate related bug. When one of these is > >> found in the queues it will be given higher priority, including bubbling > >> up to the top of the gate queue automatically. The manual promote > >> process should no longer be needed, and instead bugs fixing elastic > >> recheck tracked issues will be promoted automatically. > >> > >> At the same time we'll also promote review on critical gate bugs through > >> making them visible in a number of different channels (like on elastic > >> recheck pages, review day, and in the gerrit dashboards). The idea here > >> again is to make the reviews that fix key bugs pop to the top of > >> everyone's views. > > > > In some of the harder gate bugs I've looked at (especially the infamous > > 'live snapshot' timeout bug), it has been damn hard to actually figure > > out what's wrong. AFAIK, no one has ever been able to reproduce it > > outside of the gate infrastructure. I've even gone as far as setting up > > identical Ubuntu VMs to the ones used in the gate on a local cloud, and > > running the tempest tests multiple times, but still can't reproduce what > > happens on the gate machines themselves :-( As such we're relying on > > code inspection and the collected log messages to try and figure out > > what might be wrong. > > > > The gate collects alot of info and publishes it, but in this case I > > have found the published logs to be insufficient - I needed to get > > the more verbose libvirtd.log file. devstack has the ability to turn > > this on via an environment variable, but it is disabled by default > > because it would add 3% to the total size of logs collected per gate > > job. > > Right now we're at 95% full on 14 TB (which is the max # of volumes you > can attach to a single system in RAX), so every gig is sacred. There has > been a big push, which included the sprint last week in Darmstadt, to > get log data into swift, at which point our available storage goes way up. > > So for right now, we're a little squashed. Hopefully within a month > we'll have the full solution. > > As soon as we get those kinks out, I'd say we're in a position to flip > on that logging in devstack by default. I don't particularly mind if we don't have libvirtdd.log verbose debugging enabled by default, if there were a way to turn it on for individual reviews we're debugging with. > > There's no way for me to get that environment variable for devstack > > turned on for a specific review I want to test with. In the end I > > uploaded a change to nova which abused rootwrap to elevate privileges, > > install extra deb packages, reconfigure libvirtd logging and restart > > the libvirtd daemon. > > > > > > https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters > > https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py > > > > This let me get further, but still not resolve it. My next attack is > > to build a custom QEMU binary and hack nova further so that it can > > download my custom QEMU binary from a website onto the gate machine > > and run the test with it. Failing that I'm going to be hacking things > > to try to attach to QEMU in the gate with GDB and get stack traces. > > Anything is doable thanks to rootwrap giving us a way to elevate > > privileges from Nova, but it is a somewhat tedious approach. > > > > I'd like us to think about whether they is anything we can do to make > > life easier in these kind of hard debugging scenarios where the regular > > logs are not sufficient. > > Agreed. Honestly, though we do also need to figure out first fail > detection on our logs as well. Because realistically if we can't debug > failures from those, then I really don't understand how we're ever going > to expect large users to. Ultimately there's always going
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 05:57 PM, Matthew Treinish wrote: > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: >> OpenStack has a substantial CI system that is core to its development >> process. The goals of the system are to facilitate merging good code, >> prevent regressions, and ensure that there is at least one configuration >> of upstream OpenStack that we know works as a whole. The "project >> gating" technique that we use is effective at preventing many kinds of >> regressions from landing, however more subtle, non-deterministic bugs >> can still get through, and these are the bugs that are currently >> plaguing developers with seemingly random test failures. >> >> Most of these bugs are not failures of the test system; they are real >> bugs. Many of them have even been in OpenStack for a long time, but are >> only becoming visible now due to improvements in our tests. That's not >> much help to developers whose patches are being hit with negative test >> results from unrelated failures. We need to find a way to address the >> non-deterministic bugs that are lurking in OpenStack without making it >> easier for new bugs to creep in. >> >> The CI system and project infrastructure are not static. They have >> evolved with the project to get to where they are today, and the >> challenge now is to continue to evolve them to address the problems >> we're seeing now. The QA and Infrastructure teams recently hosted a >> sprint where we discussed some of these issues in depth. This post from >> Sean Dague goes into a bit of the background: [1]. The rest of this >> email outlines the medium and long-term changes we would like to make to >> address these problems. >> >> [1] https://dague.net/2014/07/22/openstack-failures/ >> >> ==Things we're already doing== >> >> The elastic-recheck tool[2] is used to identify "random" failures in >> test runs. It tries to match failures to known bugs using signatures >> created from log messages. It helps developers prioritize bugs by how >> frequently they manifest as test failures. It also collects information >> on unclassified errors -- we can see how many (and which) test runs >> failed for an unknown reason and our overall progress on finding >> fingerprints for random failures. >> >> [2] http://status.openstack.org/elastic-recheck/ >> >> We added a feature to Zuul that lets us manually "promote" changes to >> the top of the Gate pipeline. When the QA team identifies a change that >> fixes a bug that is affecting overall gate stability, we can move that >> change to the top of the queue so that it may merge more quickly. >> >> We added the clean check facility in reaction to the January gate break >> down. While it does mean that any individual patch might see more tests >> run on it, it's now largely kept the gate queue at a countable number of >> hours, instead of regularly growing to more than a work day in >> length. It also means that a developer can Approve a code merge before >> tests have returned, and not ruin it for everyone else if there turned >> out to be a bug that the tests could catch. >> >> ==Future changes== >> >> ===Communication=== >> We used to be better at communicating about the CI system. As it and >> the project grew, we incrementally added to our institutional knowledge, >> but we haven't been good about maintaining that information in a form >> that new or existing contributors can consume to understand what's going >> on and why. >> >> We have started on a major effort in that direction that we call the >> "infra-manual" project -- it's designed to be a comprehensive "user >> manual" for the project infrastructure, including the CI process. Even >> before that project is complete, we will write a document that >> summarizes the CI system and ensure it is included in new developer >> documentation and linked to from test results. >> >> There are also a number of ways for people to get involved in the CI >> system, whether focused on Infrastructure or QA, but it is not always >> clear how to do so. We will improve our documentation to highlight how >> to contribute. >> >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use t
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Jul 24, 2014, at 3:25 PM, Sean Dague wrote: > On 07/24/2014 06:15 PM, Angus Salkeld wrote: >> On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote: >>> OpenStack has a substantial CI system that is core to its development >>> process. The goals of the system are to facilitate merging good code, >>> prevent regressions, and ensure that there is at least one configuration >>> of upstream OpenStack that we know works as a whole. The "project >>> gating" technique that we use is effective at preventing many kinds of >>> regressions from landing, however more subtle, non-deterministic bugs >>> can still get through, and these are the bugs that are currently >>> plaguing developers with seemingly random test failures. >>> >>> Most of these bugs are not failures of the test system; they are real >>> bugs. Many of them have even been in OpenStack for a long time, but are >>> only becoming visible now due to improvements in our tests. That's not >>> much help to developers whose patches are being hit with negative test >>> results from unrelated failures. We need to find a way to address the >>> non-deterministic bugs that are lurking in OpenStack without making it >>> easier for new bugs to creep in. >>> >>> The CI system and project infrastructure are not static. They have >>> evolved with the project to get to where they are today, and the >>> challenge now is to continue to evolve them to address the problems >>> we're seeing now. The QA and Infrastructure teams recently hosted a >>> sprint where we discussed some of these issues in depth. This post from >>> Sean Dague goes into a bit of the background: [1]. The rest of this >>> email outlines the medium and long-term changes we would like to make to >>> address these problems. >>> >>> [1] https://dague.net/2014/07/22/openstack-failures/ >>> >>> ==Things we're already doing== >>> >>> The elastic-recheck tool[2] is used to identify "random" failures in >>> test runs. It tries to match failures to known bugs using signatures >>> created from log messages. It helps developers prioritize bugs by how >>> frequently they manifest as test failures. It also collects information >>> on unclassified errors -- we can see how many (and which) test runs >>> failed for an unknown reason and our overall progress on finding >>> fingerprints for random failures. >>> >>> [2] http://status.openstack.org/elastic-recheck/ >>> >>> We added a feature to Zuul that lets us manually "promote" changes to >>> the top of the Gate pipeline. When the QA team identifies a change that >>> fixes a bug that is affecting overall gate stability, we can move that >>> change to the top of the queue so that it may merge more quickly. >>> >>> We added the clean check facility in reaction to the January gate break >>> down. While it does mean that any individual patch might see more tests >>> run on it, it's now largely kept the gate queue at a countable number of >>> hours, instead of regularly growing to more than a work day in >>> length. It also means that a developer can Approve a code merge before >>> tests have returned, and not ruin it for everyone else if there turned >>> out to be a bug that the tests could catch. >>> >>> ==Future changes== >>> >>> ===Communication=== >>> We used to be better at communicating about the CI system. As it and >>> the project grew, we incrementally added to our institutional knowledge, >>> but we haven't been good about maintaining that information in a form >>> that new or existing contributors can consume to understand what's going >>> on and why. >>> >>> We have started on a major effort in that direction that we call the >>> "infra-manual" project -- it's designed to be a comprehensive "user >>> manual" for the project infrastructure, including the CI process. Even >>> before that project is complete, we will write a document that >>> summarizes the CI system and ensure it is included in new developer >>> documentation and linked to from test results. >>> >>> There are also a number of ways for people to get involved in the CI >>> system, whether focused on Infrastructure or QA, but it is not always >>> clear how to do so. We will improve our documentation to highlight how >>> to contribute. >>> >>> ===Fixing Faster=== >>> >>> We introduce bugs to OpenStack at some constant rate, which piles up >>> over time. Our systems currently treat all changes as equally risky and >>> important to the health of the system, which makes landing code changes >>> to fix key bugs slow when we're at a high reset rate. We've got a manual >>> process of promoting changes today to get around this, but that's >>> actually quite costly in people time, and takes getting all the right >>> people together at once to promote changes. You can see a number of the >>> changes we promoted during the gate storm in June [3], and it was no >>> small number of fixes to get us back to a reasonably passing gate. We >>> think that optimizing this system will
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 06:15 PM, Angus Salkeld wrote: > On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote: >> OpenStack has a substantial CI system that is core to its development >> process. The goals of the system are to facilitate merging good code, >> prevent regressions, and ensure that there is at least one configuration >> of upstream OpenStack that we know works as a whole. The "project >> gating" technique that we use is effective at preventing many kinds of >> regressions from landing, however more subtle, non-deterministic bugs >> can still get through, and these are the bugs that are currently >> plaguing developers with seemingly random test failures. >> >> Most of these bugs are not failures of the test system; they are real >> bugs. Many of them have even been in OpenStack for a long time, but are >> only becoming visible now due to improvements in our tests. That's not >> much help to developers whose patches are being hit with negative test >> results from unrelated failures. We need to find a way to address the >> non-deterministic bugs that are lurking in OpenStack without making it >> easier for new bugs to creep in. >> >> The CI system and project infrastructure are not static. They have >> evolved with the project to get to where they are today, and the >> challenge now is to continue to evolve them to address the problems >> we're seeing now. The QA and Infrastructure teams recently hosted a >> sprint where we discussed some of these issues in depth. This post from >> Sean Dague goes into a bit of the background: [1]. The rest of this >> email outlines the medium and long-term changes we would like to make to >> address these problems. >> >> [1] https://dague.net/2014/07/22/openstack-failures/ >> >> ==Things we're already doing== >> >> The elastic-recheck tool[2] is used to identify "random" failures in >> test runs. It tries to match failures to known bugs using signatures >> created from log messages. It helps developers prioritize bugs by how >> frequently they manifest as test failures. It also collects information >> on unclassified errors -- we can see how many (and which) test runs >> failed for an unknown reason and our overall progress on finding >> fingerprints for random failures. >> >> [2] http://status.openstack.org/elastic-recheck/ >> >> We added a feature to Zuul that lets us manually "promote" changes to >> the top of the Gate pipeline. When the QA team identifies a change that >> fixes a bug that is affecting overall gate stability, we can move that >> change to the top of the queue so that it may merge more quickly. >> >> We added the clean check facility in reaction to the January gate break >> down. While it does mean that any individual patch might see more tests >> run on it, it's now largely kept the gate queue at a countable number of >> hours, instead of regularly growing to more than a work day in >> length. It also means that a developer can Approve a code merge before >> tests have returned, and not ruin it for everyone else if there turned >> out to be a bug that the tests could catch. >> >> ==Future changes== >> >> ===Communication=== >> We used to be better at communicating about the CI system. As it and >> the project grew, we incrementally added to our institutional knowledge, >> but we haven't been good about maintaining that information in a form >> that new or existing contributors can consume to understand what's going >> on and why. >> >> We have started on a major effort in that direction that we call the >> "infra-manual" project -- it's designed to be a comprehensive "user >> manual" for the project infrastructure, including the CI process. Even >> before that project is complete, we will write a document that >> summarizes the CI system and ensure it is included in new developer >> documentation and linked to from test results. >> >> There are also a number of ways for people to get involved in the CI >> system, whether focused on Infrastructure or QA, but it is not always >> clear how to do so. We will improve our documentation to highlight how >> to contribute. >> >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use the data fr
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote: > OpenStack has a substantial CI system that is core to its development > process. The goals of the system are to facilitate merging good code, > prevent regressions, and ensure that there is at least one configuration > of upstream OpenStack that we know works as a whole. The "project > gating" technique that we use is effective at preventing many kinds of > regressions from landing, however more subtle, non-deterministic bugs > can still get through, and these are the bugs that are currently > plaguing developers with seemingly random test failures. > > Most of these bugs are not failures of the test system; they are real > bugs. Many of them have even been in OpenStack for a long time, but are > only becoming visible now due to improvements in our tests. That's not > much help to developers whose patches are being hit with negative test > results from unrelated failures. We need to find a way to address the > non-deterministic bugs that are lurking in OpenStack without making it > easier for new bugs to creep in. > > The CI system and project infrastructure are not static. They have > evolved with the project to get to where they are today, and the > challenge now is to continue to evolve them to address the problems > we're seeing now. The QA and Infrastructure teams recently hosted a > sprint where we discussed some of these issues in depth. This post from > Sean Dague goes into a bit of the background: [1]. The rest of this > email outlines the medium and long-term changes we would like to make to > address these problems. > > [1] https://dague.net/2014/07/22/openstack-failures/ > > ==Things we're already doing== > > The elastic-recheck tool[2] is used to identify "random" failures in > test runs. It tries to match failures to known bugs using signatures > created from log messages. It helps developers prioritize bugs by how > frequently they manifest as test failures. It also collects information > on unclassified errors -- we can see how many (and which) test runs > failed for an unknown reason and our overall progress on finding > fingerprints for random failures. > > [2] http://status.openstack.org/elastic-recheck/ > > We added a feature to Zuul that lets us manually "promote" changes to > the top of the Gate pipeline. When the QA team identifies a change that > fixes a bug that is affecting overall gate stability, we can move that > change to the top of the queue so that it may merge more quickly. > > We added the clean check facility in reaction to the January gate break > down. While it does mean that any individual patch might see more tests > run on it, it's now largely kept the gate queue at a countable number of > hours, instead of regularly growing to more than a work day in > length. It also means that a developer can Approve a code merge before > tests have returned, and not ruin it for everyone else if there turned > out to be a bug that the tests could catch. > > ==Future changes== > > ===Communication=== > We used to be better at communicating about the CI system. As it and > the project grew, we incrementally added to our institutional knowledge, > but we haven't been good about maintaining that information in a form > that new or existing contributors can consume to understand what's going > on and why. > > We have started on a major effort in that direction that we call the > "infra-manual" project -- it's designed to be a comprehensive "user > manual" for the project infrastructure, including the CI process. Even > before that project is complete, we will write a document that > summarizes the CI system and ensure it is included in new developer > documentation and linked to from test results. > > There are also a number of ways for people to get involved in the CI > system, whether focused on Infrastructure or QA, but it is not always > clear how to do so. We will improve our documentation to highlight how > to contribute. > > ===Fixing Faster=== > > We introduce bugs to OpenStack at some constant rate, which piles up > over time. Our systems currently treat all changes as equally risky and > important to the health of the system, which makes landing code changes > to fix key bugs slow when we're at a high reset rate. We've got a manual > process of promoting changes today to get around this, but that's > actually quite costly in people time, and takes getting all the right > people together at once to promote changes. You can see a number of the > changes we promoted during the gate storm in June [3], and it was no > small number of fixes to get us back to a reasonably passing gate. We > think that optimizing this system will help us land fixes to critical > bugs faster. > > [3] https://etherpad.openstack.org/p/gatetriage-june2014 > > The basic idea is to use the data from elastic recheck to identify that > a patch is fixing a critical gate related bug. When one of these is > found in the q
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > OpenStack has a substantial CI system that is core to its development > process. The goals of the system are to facilitate merging good code, > prevent regressions, and ensure that there is at least one configuration > of upstream OpenStack that we know works as a whole. The "project > gating" technique that we use is effective at preventing many kinds of > regressions from landing, however more subtle, non-deterministic bugs > can still get through, and these are the bugs that are currently > plaguing developers with seemingly random test failures. > > Most of these bugs are not failures of the test system; they are real > bugs. Many of them have even been in OpenStack for a long time, but are > only becoming visible now due to improvements in our tests. That's not > much help to developers whose patches are being hit with negative test > results from unrelated failures. We need to find a way to address the > non-deterministic bugs that are lurking in OpenStack without making it > easier for new bugs to creep in. > > The CI system and project infrastructure are not static. They have > evolved with the project to get to where they are today, and the > challenge now is to continue to evolve them to address the problems > we're seeing now. The QA and Infrastructure teams recently hosted a > sprint where we discussed some of these issues in depth. This post from > Sean Dague goes into a bit of the background: [1]. The rest of this > email outlines the medium and long-term changes we would like to make to > address these problems. > > [1] https://dague.net/2014/07/22/openstack-failures/ > > ==Things we're already doing== > > The elastic-recheck tool[2] is used to identify "random" failures in > test runs. It tries to match failures to known bugs using signatures > created from log messages. It helps developers prioritize bugs by how > frequently they manifest as test failures. It also collects information > on unclassified errors -- we can see how many (and which) test runs > failed for an unknown reason and our overall progress on finding > fingerprints for random failures. > > [2] http://status.openstack.org/elastic-recheck/ > > We added a feature to Zuul that lets us manually "promote" changes to > the top of the Gate pipeline. When the QA team identifies a change that > fixes a bug that is affecting overall gate stability, we can move that > change to the top of the queue so that it may merge more quickly. > > We added the clean check facility in reaction to the January gate break > down. While it does mean that any individual patch might see more tests > run on it, it's now largely kept the gate queue at a countable number of > hours, instead of regularly growing to more than a work day in > length. It also means that a developer can Approve a code merge before > tests have returned, and not ruin it for everyone else if there turned > out to be a bug that the tests could catch. > > ==Future changes== > > ===Communication=== > We used to be better at communicating about the CI system. As it and > the project grew, we incrementally added to our institutional knowledge, > but we haven't been good about maintaining that information in a form > that new or existing contributors can consume to understand what's going > on and why. > > We have started on a major effort in that direction that we call the > "infra-manual" project -- it's designed to be a comprehensive "user > manual" for the project infrastructure, including the CI process. Even > before that project is complete, we will write a document that > summarizes the CI system and ensure it is included in new developer > documentation and linked to from test results. > > There are also a number of ways for people to get involved in the CI > system, whether focused on Infrastructure or QA, but it is not always > clear how to do so. We will improve our documentation to highlight how > to contribute. > > ===Fixing Faster=== > > We introduce bugs to OpenStack at some constant rate, which piles up > over time. Our systems currently treat all changes as equally risky and > important to the health of the system, which makes landing code changes > to fix key bugs slow when we're at a high reset rate. We've got a manual > process of promoting changes today to get around this, but that's > actually quite costly in people time, and takes getting all the right > people together at once to promote changes. You can see a number of the > changes we promoted during the gate storm in June [3], and it was no > small number of fixes to get us back to a reasonably passing gate. We > think that optimizing this system will help us land fixes to critical > bugs faster. > > [3] https://etherpad.openstack.org/p/gatetriage-june2014 > > The basic idea is to use the data from elastic recheck to identify that > a patch is fixing a critical gate related bug. When one of these is > found i
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Jul 24, 2014, at 12:54 PM, Sean Dague wrote: > On 07/24/2014 02:51 PM, Joshua Harlow wrote: >> A potentially brilliant idea ;-) >> >> Aren't all the machines the gate runs tests on VMs running via OpenStack >> APIs? >> >> OpenStack supports snapshotting (last time I checked). So instead of >> providing back a whole bunch of log files, provide back a snapshot of the >> machine/s that ran the tests; let person who wants to download that snapshot >> download it (and then they can boot it up into virtualbox, qemu, there own >> OpenStack cloud...) and investigate all the log files they desire. >> >> Are we really being so conservative on space that we couldn't do this? I >> find it hard to believe that space is a concern for anything anymore (if it >> really matters store the snapshots in ceph, or glusterfs, swift, or >> something else... which should dedup the blocks). This is pretty common with >> how people use snapshots and what they back them with anyway so it would be >> nice if infra exposed the same thing... >> >> Would something like that be possible? I'm not so familiar with all the >> inner workings of the infra project; but if it eventually boots VMs using an >> OpenStack cloud, it would seem reasonable that it could provide the same >> mechanisms we are all already used to using... >> >> Thoughts? > > There are actual space concerns. Especially when we're talking about 20k > runs / week. At which point snapshots are probably in the neighborhood > of 10G, so we're talking about 200 TB / week of storage. Plus there are > actual technical details of the fact that glance end points are really > quite beta in the clouds we use. Remember our tests runs aren't pets, > they are cattle, we need to figure out the right distillation of data > and move on, as there isn't enough space or time to keep everything around. Sure not pets..., save only the failing ones then (the broken cattle)? Is 200TB/week really how much is actually stored when ceph or other uses data-deduping? Does rackspace or HP (the VM providers for infra afaik) do this/or use a similar deduping technology for storing snapshots? I agree with right distillation and maybe it's not always needed, but it would/could be nice to have a button on gerrit that u could activate within a certain amount of time after the run to get all the images that the VMs used during the tests (yes the download would be likely be huge) if you really want to setup the exact same environment that the test failed with. Maybe have that button expire after a week (then u only need 200 TB of *expiring* space). > Also portability of system images is... limited between hypervisors. > > If this is something you'd like to see if you could figure out the hard > parts of, I invite you to dive in on the infra side. It's very easy to > say it's easy. :) Actually coming up with a workable solution requires a > ton more time and energy. Of course, that goes without saying, I guess I thought this is a ML for discussions and thoughts (in part the 'thought' part of this subject) and need not be a solution off the bat. Just an idea anyway... > > -Sean > > -- > Sean Dague > http://dague.net > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 12:40 PM, Daniel P. Berrange wrote: > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> ==Future changes== > >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use the data from elastic recheck to identify that >> a patch is fixing a critical gate related bug. When one of these is >> found in the queues it will be given higher priority, including bubbling >> up to the top of the gate queue automatically. The manual promote >> process should no longer be needed, and instead bugs fixing elastic >> recheck tracked issues will be promoted automatically. >> >> At the same time we'll also promote review on critical gate bugs through >> making them visible in a number of different channels (like on elastic >> recheck pages, review day, and in the gerrit dashboards). The idea here >> again is to make the reviews that fix key bugs pop to the top of >> everyone's views. > > In some of the harder gate bugs I've looked at (especially the infamous > 'live snapshot' timeout bug), it has been damn hard to actually figure > out what's wrong. AFAIK, no one has ever been able to reproduce it > outside of the gate infrastructure. I've even gone as far as setting up > identical Ubuntu VMs to the ones used in the gate on a local cloud, and > running the tempest tests multiple times, but still can't reproduce what > happens on the gate machines themselves :-( As such we're relying on > code inspection and the collected log messages to try and figure out > what might be wrong. > > The gate collects alot of info and publishes it, but in this case I > have found the published logs to be insufficient - I needed to get > the more verbose libvirtd.log file. devstack has the ability to turn > this on via an environment variable, but it is disabled by default > because it would add 3% to the total size of logs collected per gate > job. Right now we're at 95% full on 14 TB (which is the max # of volumes you can attach to a single system in RAX), so every gig is sacred. There has been a big push, which included the sprint last week in Darmstadt, to get log data into swift, at which point our available storage goes way up. So for right now, we're a little squashed. Hopefully within a month we'll have the full solution. As soon as we get those kinks out, I'd say we're in a position to flip on that logging in devstack by default. > There's no way for me to get that environment variable for devstack > turned on for a specific review I want to test with. In the end I > uploaded a change to nova which abused rootwrap to elevate privileges, > install extra deb packages, reconfigure libvirtd logging and restart > the libvirtd daemon. > > > https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters > https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py > > This let me get further, but still not resolve it. My next attack is > to build a custom QEMU binary and hack nova further so that it can > download my custom QEMU binary from a website onto the gate machine > and run the test with it. Failing that I'm going to be hacking things > to try to attach to QEMU in the gate with GDB and get stack traces. > Anything is doable thanks to rootwrap giving us a way to elevate > privileges from Nova, but it is a somewhat tedious approach. > > I'd like us to think about whether they is anything we can do to make > life easier in these kind of hard debugging scenarios where the regular > logs are not sufficient. Agreed. Honestly, though we do also need to figure out first fail detection on our logs as well. Because realistically if we can't debug failures from those, then I really don't understand how we're ever going to expect large users to. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 02:51 PM, Joshua Harlow wrote: > A potentially brilliant idea ;-) > > Aren't all the machines the gate runs tests on VMs running via OpenStack APIs? > > OpenStack supports snapshotting (last time I checked). So instead of > providing back a whole bunch of log files, provide back a snapshot of the > machine/s that ran the tests; let person who wants to download that snapshot > download it (and then they can boot it up into virtualbox, qemu, there own > OpenStack cloud...) and investigate all the log files they desire. > > Are we really being so conservative on space that we couldn't do this? I find > it hard to believe that space is a concern for anything anymore (if it really > matters store the snapshots in ceph, or glusterfs, swift, or something > else... which should dedup the blocks). This is pretty common with how people > use snapshots and what they back them with anyway so it would be nice if > infra exposed the same thing... > > Would something like that be possible? I'm not so familiar with all the inner > workings of the infra project; but if it eventually boots VMs using an > OpenStack cloud, it would seem reasonable that it could provide the same > mechanisms we are all already used to using... > > Thoughts? There are actual space concerns. Especially when we're talking about 20k runs / week. At which point snapshots are probably in the neighborhood of 10G, so we're talking about 200 TB / week of storage. Plus there are actual technical details of the fact that glance end points are really quite beta in the clouds we use. Remember our tests runs aren't pets, they are cattle, we need to figure out the right distillation of data and move on, as there isn't enough space or time to keep everything around. Also portability of system images is... limited between hypervisors. If this is something you'd like to see if you could figure out the hard parts of, I invite you to dive in on the infra side. It's very easy to say it's easy. :) Actually coming up with a workable solution requires a ton more time and energy. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Jul 24, 2014, at 12:08 PM, Anita Kuno wrote: > On 07/24/2014 12:40 PM, Daniel P. Berrange wrote: >> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: >> >>> ==Future changes== >> >>> ===Fixing Faster=== >>> >>> We introduce bugs to OpenStack at some constant rate, which piles up >>> over time. Our systems currently treat all changes as equally risky and >>> important to the health of the system, which makes landing code changes >>> to fix key bugs slow when we're at a high reset rate. We've got a manual >>> process of promoting changes today to get around this, but that's >>> actually quite costly in people time, and takes getting all the right >>> people together at once to promote changes. You can see a number of the >>> changes we promoted during the gate storm in June [3], and it was no >>> small number of fixes to get us back to a reasonably passing gate. We >>> think that optimizing this system will help us land fixes to critical >>> bugs faster. >>> >>> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >>> >>> The basic idea is to use the data from elastic recheck to identify that >>> a patch is fixing a critical gate related bug. When one of these is >>> found in the queues it will be given higher priority, including bubbling >>> up to the top of the gate queue automatically. The manual promote >>> process should no longer be needed, and instead bugs fixing elastic >>> recheck tracked issues will be promoted automatically. >>> >>> At the same time we'll also promote review on critical gate bugs through >>> making them visible in a number of different channels (like on elastic >>> recheck pages, review day, and in the gerrit dashboards). The idea here >>> again is to make the reviews that fix key bugs pop to the top of >>> everyone's views. >> >> In some of the harder gate bugs I've looked at (especially the infamous >> 'live snapshot' timeout bug), it has been damn hard to actually figure >> out what's wrong. AFAIK, no one has ever been able to reproduce it >> outside of the gate infrastructure. I've even gone as far as setting up >> identical Ubuntu VMs to the ones used in the gate on a local cloud, and >> running the tempest tests multiple times, but still can't reproduce what >> happens on the gate machines themselves :-( As such we're relying on >> code inspection and the collected log messages to try and figure out >> what might be wrong. >> >> The gate collects alot of info and publishes it, but in this case I >> have found the published logs to be insufficient - I needed to get >> the more verbose libvirtd.log file. devstack has the ability to turn >> this on via an environment variable, but it is disabled by default >> because it would add 3% to the total size of logs collected per gate >> job. >> >> There's no way for me to get that environment variable for devstack >> turned on for a specific review I want to test with. In the end I >> uploaded a change to nova which abused rootwrap to elevate privileges, >> install extra deb packages, reconfigure libvirtd logging and restart >> the libvirtd daemon. >> >> >> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters >> https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py >> >> This let me get further, but still not resolve it. My next attack is >> to build a custom QEMU binary and hack nova further so that it can >> download my custom QEMU binary from a website onto the gate machine >> and run the test with it. Failing that I'm going to be hacking things >> to try to attach to QEMU in the gate with GDB and get stack traces. >> Anything is doable thanks to rootwrap giving us a way to elevate >> privileges from Nova, but it is a somewhat tedious approach. >> >> I'd like us to think about whether they is anything we can do to make >> life easier in these kind of hard debugging scenarios where the regular >> logs are not sufficient. >> >> Regards, >> Daniel >> > For really really difficult bugs that can't be reproduced outside the > gate, we do have the ability to hold vms if we know they have are > displaying the bug, if they are caught before the vm in question is > scheduled for deletion. In this case, make your intentions known in a > discussion with a member of infra-root. A conversation will ensue > involving what to do to get you what you need to continue debugging. > Why? Is space really that expensive? It boggles my mind a little that we have a well financed foundation (afaik, correct me if I am wrong...) but yet can't save 'all' the things in a smart manner (saving all the VMs snapshots doesn't mean saving hundreds/thousands of gigabytes when u are using de-duping cinder/glance... backends). Expire those VMs after a week if that helps but it feels like we shouldn't be so conservative about developers needs to have access to all the VMs that the gate used/created..., it's not like developers are trying to 'harm' openstack by investigating root
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 12:40 PM, Daniel P. Berrange wrote: > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> ==Future changes== > >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use the data from elastic recheck to identify that >> a patch is fixing a critical gate related bug. When one of these is >> found in the queues it will be given higher priority, including bubbling >> up to the top of the gate queue automatically. The manual promote >> process should no longer be needed, and instead bugs fixing elastic >> recheck tracked issues will be promoted automatically. >> >> At the same time we'll also promote review on critical gate bugs through >> making them visible in a number of different channels (like on elastic >> recheck pages, review day, and in the gerrit dashboards). The idea here >> again is to make the reviews that fix key bugs pop to the top of >> everyone's views. > > In some of the harder gate bugs I've looked at (especially the infamous > 'live snapshot' timeout bug), it has been damn hard to actually figure > out what's wrong. AFAIK, no one has ever been able to reproduce it > outside of the gate infrastructure. I've even gone as far as setting up > identical Ubuntu VMs to the ones used in the gate on a local cloud, and > running the tempest tests multiple times, but still can't reproduce what > happens on the gate machines themselves :-( As such we're relying on > code inspection and the collected log messages to try and figure out > what might be wrong. > > The gate collects alot of info and publishes it, but in this case I > have found the published logs to be insufficient - I needed to get > the more verbose libvirtd.log file. devstack has the ability to turn > this on via an environment variable, but it is disabled by default > because it would add 3% to the total size of logs collected per gate > job. > > There's no way for me to get that environment variable for devstack > turned on for a specific review I want to test with. In the end I > uploaded a change to nova which abused rootwrap to elevate privileges, > install extra deb packages, reconfigure libvirtd logging and restart > the libvirtd daemon. > > > https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters > https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py > > This let me get further, but still not resolve it. My next attack is > to build a custom QEMU binary and hack nova further so that it can > download my custom QEMU binary from a website onto the gate machine > and run the test with it. Failing that I'm going to be hacking things > to try to attach to QEMU in the gate with GDB and get stack traces. > Anything is doable thanks to rootwrap giving us a way to elevate > privileges from Nova, but it is a somewhat tedious approach. > > I'd like us to think about whether they is anything we can do to make > life easier in these kind of hard debugging scenarios where the regular > logs are not sufficient. > > Regards, > Daniel > For really really difficult bugs that can't be reproduced outside the gate, we do have the ability to hold vms if we know they have are displaying the bug, if they are caught before the vm in question is scheduled for deletion. In this case, make your intentions known in a discussion with a member of infra-root. A conversation will ensue involving what to do to get you what you need to continue debugging. It doesn't work in all cases, but some have found it helpful. Keep in mind you will be asked to demonstrate you have tried all other avenues before this one is exercised. Thanks, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
A potentially brilliant idea ;-) Aren't all the machines the gate runs tests on VMs running via OpenStack APIs? OpenStack supports snapshotting (last time I checked). So instead of providing back a whole bunch of log files, provide back a snapshot of the machine/s that ran the tests; let person who wants to download that snapshot download it (and then they can boot it up into virtualbox, qemu, there own OpenStack cloud...) and investigate all the log files they desire. Are we really being so conservative on space that we couldn't do this? I find it hard to believe that space is a concern for anything anymore (if it really matters store the snapshots in ceph, or glusterfs, swift, or something else... which should dedup the blocks). This is pretty common with how people use snapshots and what they back them with anyway so it would be nice if infra exposed the same thing... Would something like that be possible? I'm not so familiar with all the inner workings of the infra project; but if it eventually boots VMs using an OpenStack cloud, it would seem reasonable that it could provide the same mechanisms we are all already used to using... Thoughts? On Jul 24, 2014, at 9:40 AM, Daniel P. Berrange wrote: > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> ==Future changes== > >> ===Fixing Faster=== >> >> We introduce bugs to OpenStack at some constant rate, which piles up >> over time. Our systems currently treat all changes as equally risky and >> important to the health of the system, which makes landing code changes >> to fix key bugs slow when we're at a high reset rate. We've got a manual >> process of promoting changes today to get around this, but that's >> actually quite costly in people time, and takes getting all the right >> people together at once to promote changes. You can see a number of the >> changes we promoted during the gate storm in June [3], and it was no >> small number of fixes to get us back to a reasonably passing gate. We >> think that optimizing this system will help us land fixes to critical >> bugs faster. >> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 >> >> The basic idea is to use the data from elastic recheck to identify that >> a patch is fixing a critical gate related bug. When one of these is >> found in the queues it will be given higher priority, including bubbling >> up to the top of the gate queue automatically. The manual promote >> process should no longer be needed, and instead bugs fixing elastic >> recheck tracked issues will be promoted automatically. >> >> At the same time we'll also promote review on critical gate bugs through >> making them visible in a number of different channels (like on elastic >> recheck pages, review day, and in the gerrit dashboards). The idea here >> again is to make the reviews that fix key bugs pop to the top of >> everyone's views. > > In some of the harder gate bugs I've looked at (especially the infamous > 'live snapshot' timeout bug), it has been damn hard to actually figure > out what's wrong. AFAIK, no one has ever been able to reproduce it > outside of the gate infrastructure. I've even gone as far as setting up > identical Ubuntu VMs to the ones used in the gate on a local cloud, and > running the tempest tests multiple times, but still can't reproduce what > happens on the gate machines themselves :-( As such we're relying on > code inspection and the collected log messages to try and figure out > what might be wrong. > > The gate collects alot of info and publishes it, but in this case I > have found the published logs to be insufficient - I needed to get > the more verbose libvirtd.log file. devstack has the ability to turn > this on via an environment variable, but it is disabled by default > because it would add 3% to the total size of logs collected per gate > job. > > There's no way for me to get that environment variable for devstack > turned on for a specific review I want to test with. In the end I > uploaded a change to nova which abused rootwrap to elevate privileges, > install extra deb packages, reconfigure libvirtd logging and restart > the libvirtd daemon. > > > https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters > https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py > > This let me get further, but still not resolve it. My next attack is > to build a custom QEMU binary and hack nova further so that it can > download my custom QEMU binary from a website onto the gate machine > and run the test with it. Failing that I'm going to be hacking things > to try to attach to QEMU in the gate with GDB and get stack traces. > Anything is doable thanks to rootwrap giving us a way to elevate > privileges from Nova, but it is a somewhat tedious approach. > > I'd like us to think about whether they is anything we can do to make > life easier in these kind of hard debugging scenarios where the regular > logs are no
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > ==Future changes== > ===Fixing Faster=== > > We introduce bugs to OpenStack at some constant rate, which piles up > over time. Our systems currently treat all changes as equally risky and > important to the health of the system, which makes landing code changes > to fix key bugs slow when we're at a high reset rate. We've got a manual > process of promoting changes today to get around this, but that's > actually quite costly in people time, and takes getting all the right > people together at once to promote changes. You can see a number of the > changes we promoted during the gate storm in June [3], and it was no > small number of fixes to get us back to a reasonably passing gate. We > think that optimizing this system will help us land fixes to critical > bugs faster. > > [3] https://etherpad.openstack.org/p/gatetriage-june2014 > > The basic idea is to use the data from elastic recheck to identify that > a patch is fixing a critical gate related bug. When one of these is > found in the queues it will be given higher priority, including bubbling > up to the top of the gate queue automatically. The manual promote > process should no longer be needed, and instead bugs fixing elastic > recheck tracked issues will be promoted automatically. > > At the same time we'll also promote review on critical gate bugs through > making them visible in a number of different channels (like on elastic > recheck pages, review day, and in the gerrit dashboards). The idea here > again is to make the reviews that fix key bugs pop to the top of > everyone's views. In some of the harder gate bugs I've looked at (especially the infamous 'live snapshot' timeout bug), it has been damn hard to actually figure out what's wrong. AFAIK, no one has ever been able to reproduce it outside of the gate infrastructure. I've even gone as far as setting up identical Ubuntu VMs to the ones used in the gate on a local cloud, and running the tempest tests multiple times, but still can't reproduce what happens on the gate machines themselves :-( As such we're relying on code inspection and the collected log messages to try and figure out what might be wrong. The gate collects alot of info and publishes it, but in this case I have found the published logs to be insufficient - I needed to get the more verbose libvirtd.log file. devstack has the ability to turn this on via an environment variable, but it is disabled by default because it would add 3% to the total size of logs collected per gate job. There's no way for me to get that environment variable for devstack turned on for a specific review I want to test with. In the end I uploaded a change to nova which abused rootwrap to elevate privileges, install extra deb packages, reconfigure libvirtd logging and restart the libvirtd daemon. https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py This let me get further, but still not resolve it. My next attack is to build a custom QEMU binary and hack nova further so that it can download my custom QEMU binary from a website onto the gate machine and run the test with it. Failing that I'm going to be hacking things to try to attach to QEMU in the gate with GDB and get stack traces. Anything is doable thanks to rootwrap giving us a way to elevate privileges from Nova, but it is a somewhat tedious approach. I'd like us to think about whether they is anything we can do to make life easier in these kind of hard debugging scenarios where the regular logs are not sufficient. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/23/2014 05:39 PM, James E. Blair wrote: > ==Final thoughts== > > The current rate of test failures and subsequent rechecks is not > sustainable in the long term. It's not good for contributors, > reveiewers, or the overall project quality. While these bugs do need to > be addressed, it's unlikely that the current process will cause that to > happen. Instead, we want to push more substantial testing into the > projects themselves with functional and interface testing, and depend > less on devstack-gate integration tests to catch all bugs. This should > help us catch bugs closer to the source and in an environment where > debugging is easier. We also want to reduce the scope of devstack gate > tests to a gold standard while running tests of other configurations in > a traditional CI process so that people interested in those > configurations can focus on ensuring they work. Very nice writeup. I think these steps sound like a positive way forward. Thanks! -- Russell Bryant ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
On 07/24/2014 06:06 AM, Chmouel Boudjnah wrote: > Hello, > > Thanks for writing this summary, I like all those ideas and thanks working > hard on fixing this. > >> * For all non gold standard configurations, we'll dedicate a part of >> our infrastructure to running them in a continuous background loop, >> as well as making these configs available as experimental jobs. The >> idea here is that we'll actually be able to provide more >> configurations that are operating in a more traditional CI (post >> merge) context. People that are interested in keeping these bits >> functional can monitor those jobs and help with fixes when needed. >> The experimental jobs mean that if developers are concerned about >> the effect of a particular change on one of these configs, it's easy >> to request a pre-merge test run. In the near term we might imagine >> this would allow for things like ceph, mongodb, docker, and possibly >> very new libvirt to be validated in some way upstream. > > What about external CI ? is external CI would need to be post merge or > still stay as is ? what would be the difference between external CI > plugging on review changes and post CI merges? External CI is *really* supposed to be for things that Infrastructure can't or won't run (for technical or policy reasons). VMWare isn't open source, so that would always need to be outside of infra. Xen is something that there remains technical challenges on to get working in infra, but I think everyone would like to see it there eventually. Overall capacity and randomness issues means we can't do all these configs in a pre-merge context. But moving to a fixed capacity post merge world means we could create a ton of test data for these configurations. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward
Hello, Thanks for writing this summary, I like all those ideas and thanks working hard on fixing this. > * For all non gold standard configurations, we'll dedicate a part of > our infrastructure to running them in a continuous background loop, > as well as making these configs available as experimental jobs. The > idea here is that we'll actually be able to provide more > configurations that are operating in a more traditional CI (post > merge) context. People that are interested in keeping these bits > functional can monitor those jobs and help with fixes when needed. > The experimental jobs mean that if developers are concerned about > the effect of a particular change on one of these configs, it's easy > to request a pre-merge test run. In the near term we might imagine > this would allow for things like ceph, mongodb, docker, and possibly > very new libvirt to be validated in some way upstream. What about external CI ? is external CI would need to be post merge or still stay as is ? what would be the difference between external CI plugging on review changes and post CI merges? Chmouel ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev