On 06/13/2014 08:13 AM, Mark McLoughlin wrote: > On Fri, 2014-06-13 at 07:31 -0400, Sean Dague wrote: >> On 06/13/2014 02:36 AM, Mark McLoughlin wrote: >>> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote: >>>> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote: >>>>> We're definitely deep into capacity issues, so it's going to be time to >>>>> start making tougher decisions about things we decide aren't different >>>>> enough to bother testing on every commit. >>>> >>>> In order to save resources why not combine some of the jobs in different >>>> ways. So for example instead of: >>>> >>>> check-tempest-dsvm-full >>>> check-tempest-dsvm-postgres-full >>>> >>>> Couldn't we just drop the postgres-full job and run one of the Neutron >>>> jobs w/ postgres instead? Or something similar, so long as at least one >>>> of the jobs which runs most of Tempest is using PostgreSQL I think we'd >>>> be mostly fine. Not shooting for 100% coverage for everything with our >>>> limited resource pool is fine, lets just do the best we can. >>>> >>>> Ditto for gate jobs (not check). >>> >>> I think that's what Clark was suggesting in: >>> >>> https://etherpad.openstack.org/p/juno-test-maxtrices >>> >>>>> Previously we've been testing Postgresql in the gate because it has a >>>>> stricter interpretation of SQL than MySQL. And when we didn't test >>>>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly. >>>>> >>>>> However Monty brought up a good point at Summit, that MySQL has a strict >>>>> mode. That should actually enforce the same strictness. >>>>> >>>>> My proposal is that we land this change to devstack - >>>>> https://review.openstack.org/#/c/97442/ and backport it to past devstack >>>>> branches. >>>>> >>>>> Then we drop the pg jobs, as the differences between the 2 configs >>>>> should then be very minimal. All the *actual* failures we've seen >>>>> between the 2 were completely about this strict SQL mode interpretation. >>>> >>>> >>>> I suppose I would like to see us keep it in the mix. Running SmokeStack >>>> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it >>>> concurrently with many of the other jobs and I too had limited resources >>>> (much less that what we have in infra today). >>>> >>>> Would MySQL strict SQL mode catch stuff like this (old bugs, but still >>>> valid for this topic I think): >>>> >>>> https://bugs.launchpad.net/nova/+bug/948066 >>>> >>>> https://bugs.launchpad.net/nova/+bug/1003756 >>>> >>>> >>>> Having support for and testing against at least 2 databases helps keep >>>> our SQL queries and migrations cleaner... and is generally a good >>>> practice given we have abstractions which are meant to support this sort >>>> of thing anyway (so by all means let us test them!). >>>> >>>> Also, Having compacted the Nova migrations 3 times now I found many >>>> issues by testing on multiple databases (MySQL and PostgreSQL). I'm >>>> quite certain our migrations would be worse off if we just tested >>>> against the single database. >>> >>> Certainly sounds like this testing is far beyond the "might one day be >>> useful" level Sean talks about. >> >> The migration compaction is a good point. And I'm happy to see there >> were some bugs exposed as well. >> >> Here is where I remain stuck.... >> >> We are now at a failure rate in which it's 3 days (minimum) to land a >> fix that decreases our failure rate at all. >> >> The way we are currently solving this is by effectively building "manual >> zuul" and taking smart humans in coordination to end run around our >> system. We've merged 18 fixes so far - >> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a >> fix this way is at least an order of magnitude more expensive on people >> time because of the analysis and coordination we need to go through to >> make sure these things are the right things to jump the queue. >> >> That effort, over 8 days, has gotten us down to *only* a 24hr merge >> delay. And there are no more smoking guns. What's left is a ton of >> subtle things. I've got ~ 30 patches outstanding right now (a bunch are >> things to clarify what's going on in the build runs especially in the >> fail scenarios). Every single one of them has been failed by Jenkins at >> least once. Almost every one was failed by a different unique issue. >> >> So I'd say at best we're 25% of the way towards solving this. That being >> said, because of the deep queues, people are just recheck grinding (or >> hitting the jackpot and landing something through that then fails a lot >> after landing). That leads to bugs like this: >> >> https://bugs.launchpad.net/heat/+bug/1306029 >> >> Which was seen early in the patch - https://review.openstack.org/#/c/97569/ >> >> Then kind of destroyed us completely for a day - >> http://status.openstack.org/elastic-recheck/ (it's the top graph). >> >> And, predictably, a week into a long gate queue everyone is now grumpy. >> The sniping between projects, and within projects in assigning blame >> starts to spike at about day 4 of these events. Everyone assumes someone >> else is to blame for these things. >> >> So there is real community impact when we get to these states. >> >> .... >> >> So, I'm kind of burnt out trying to figure out how to get us out of >> this. As I do take it personally when we as a project can't merge code. >> As that's a terrible state to be in. >> >> Pleading to get more people to dive in, is mostly not helping. >> >> So my only thinking at this point is we prune back our test jobs to the >> point that they are a small enough number of configurations that the >> fixed number of people actually trying to debug this, actually can. >> >> If there are other ideas, that's great. >> >> But 'you aren't allowed to do less' isn't really sustainable. That just >> leads to people giving up on helping. > > Totally understand, and agree with the severity of the situation. > > Retreating is one thing, but let's not label the job as useless in the > process. We can disable the job because of capacity issues even if we > feel its coverage is as important as the day we added the job. > > How about explicitly priority ordering the jobs such that when we > retreat the lowest priority job is dropped first, but is also the first > one to be added back (assuming its pass rate is sufficiently high) when > we feel we have capacity again? > > Debating the priority order of the jobs, brainstorming ways of mixing > configurations in the jobs to get the best coverage, etc. would then be > something we'd do with cool heads in calmer times rather than when the > gate is on fire.
Yeh, doing that prioritization is on on my todo list. It was about 7th before we got into this state. I really think it's going to be important to have sponsors as well. Anything beyond mysql full and the base grenade job need sponsors. If they fall behind on addressing / categorizing bugs, we start degrading the jobs. Because it's way to easy to set priorities for "someone else", and not realize that there is no someone else. -Sean -- Sean Dague http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev