It's worth noticing that elastic recheck is signalling bug 1253896 and bug 1224001 but they have actually the same signature. I found also interesting that neutron is triggering a lot bug 1254890, which appears to be a hang on /dev/nbdX during key injection; so far I have no explanation for that.
As suggested on IRC, the neutron isolated job had a failure rate of about 5-7% last week (until thursday I think). It might be therefore also looking at tempest/devstack patches which might be triggering failure or uncovering issues in neutron. I shared a few findings on the mailing list yesterday ([1]). I hope people actively looking at failures will find them helpful. Salvatore [1] http://lists.openstack.org/pipermail/openstack-dev/2014-January/025013.html On 22 January 2014 14:57, Sean Dague <[email protected]> wrote: > On 01/22/2014 09:38 AM, Sean Dague wrote: > > Things aren't great, but they are actually better than yesterday. > > > > Vital Stats: > > Gate queue length: 107 > > Check queue length: 107 > > Head of gate entered: 45hrs ago > > Changes merged in last 24hrs: 58 > > > > The 58 changes merged is actually a good number, not a great number, but > > best we've seen in a number of days. I saw at least a 6 streak merge > > yesterday, so zuul is starting to behave like we expect it should. > > > > = Previous Top Bugs = > > > > Our previous top 2 issues - 1270680 and 1270608 (not confusing at all) > > are under control. > > > > Bug 1270680 - v3 extensions api inherently racey wrt instances > > > > Russell managed the second part of the fix for this, we've not seen it > > come back since that was ninja merged. > > > > Bug 1270608 - n-cpu 'iSCSI device not found' log causes > > gate-tempest-dsvm-*-full to fail > > > > Turning off the test that was triggering this made it completely go > > away. We'll have to revisit if that's because there is a cinder bug or a > > tempest bug, but we'll do that once the dust has settled. > > > > = New Top Bugs = > > > > Note: all fail numbers are across all queues > > > > Bug 1253896 - Attempts to verify guests are running via SSH fails. SSH > > connection to guest does not work. > > > > 83 fails in 24hrs > > > > > > Bug 1224001 - test_network_basic_ops fails waiting for network to become > > available > > > > 51 fails in 24hrs > > > > > > Bug 1254890 - "Timed out waiting for thing" causes tempest-dsvm-* > failures > > > > 30 fails in 24hrs > > > > > > We are now sorting - http://status.openstack.org/elastic-recheck/ by > > failures in the last 24hrs, so we can use it more as a hit list. The top > > 3 issues are fingerprinted against infra, but are mostly related to > > normal restart operations at this point. > > > > = Starvation Update = > > > > with 214 jobs across queues, and averaging 7 devstack nodes per job, our > > working set is 1498 nodes (i.e. if we had than number we'd be able to be > > running all the jobs right now in parallel). > > > > Our current quota of nodes gives us ~ 480. Which is < 1/3 our working > > set, and part of the reasons for delays. Rackspace has generously > > increased our quota in 2 of their availability zones, and Monty is going > > to prioritize getting those online. > > > > Because of Jenkins scaling issues (it starts generating failures when > > talking to too many build slaves), that means spinning up more Jenkins > > masters. We've found a 1 / 100 ratio makes Jenkins basically stable, > > pushing beyond that means new fails. Jenkins is not inherently elastic, > > so this is a somewhat manual process. Monty is diving on that. > > > > There is also a TCP slow start algorthm for zuul that Clark was working > > on yesterday, which we'll put into production as soon as it is good. > > This will prevent us from speculating all the way down the gate queue, > > just to throw it all away on a reset. It acts just like TCP, on every > > success we grow our speculation length, on every fail we reduce it, with > > a sane minimum so we don't over throttle ourselves. > > > > > > Thanks to everyone that's been pitching in digging on reset bugs. More > > help is needed. Many core reviewers are at this point completely > > ignoring normal reviews until the gate is back, so if you are waiting > > for a review on some code, the best way to get it, is help us fix the > > bugs reseting the gate. > > One last thing, Anita has also gotten on top of pruning out all the > neutron changes from the gate. Something is very wrong in the neutron > isolated jobs right now, so their chance of passing is close enough to > 0, that we need to keep them out of the gate. This is a new regression > in the last couple of days. > > This is a contributing factor in the gates moving again. > > She and Mark are rallying the Neutron folks to sort this one out. > > -Sean > > -- > Sean Dague > Samsung Research America > [email protected] / [email protected] > http://dague.net > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
