Hi Salvatore, I did notice the issue and I flagged this bug report:
https://bugs.launchpad.net/nova/+bug/1352141 I'll follow up. Cheers, Armando On 7 August 2014 01:34, Salvatore Orlando <sorla...@nicira.com> wrote: > I had to put the patch back on WIP because yesterday a bug causing a 100% > failure rate slipped in. > It should be an easy fix, and I'm already working on it. > Situations like this, exemplified by [1] are a bit frustrating for all the > people working on improving neutron quality. > Now, if you allow me a little rant, as Neutron is receiving a lot of > attention for all the ongoing discussion regarding this group policy stuff, > would it be possible for us to receive a bit of attention to ensure both > the full job and the grenade one are switched to voting before the juno-3 > review crunch. > > We've already had the attention of the QA team, it would probably good if > we could get the attention of the infra core team to ensure: > 1) the jobs are also deemed by them stable enough to be switched to voting > 2) the relevant patches for openstack-infra/config are reviewed > > Regards, > Salvatore > > [1] > http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ== > > > On 23 July 2014 14:59, Matthew Treinish <mtrein...@kortar.org> wrote: > >> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote: >> > Here I am again bothering you with the state of the full job for >> Neutron. >> > >> > The patch for fixing an issue in nova's server external events extension >> > merged yesterday [1] >> > We do not have yet enough data points to make a reliable assessment, >> but of >> > out 37 runs since the patch merged, we had "only" 5 failures, which puts >> > the failure rate at about 13% >> > >> > This is ugly compared with the current failure rate of the smoketest >> (3%). >> > However, I think it is good enough to start making the full job voting >> at >> > least for neutron patches. >> > Once we'll be able to bring down failure rate to anything around 5%, we >> can >> > then enable the job everywhere. >> >> I think that sounds like a good plan. I'm also curious how the failure >> rates >> compare to the other non-neutron jobs, that might be a useful comparison >> too >> for deciding when to flip the switch everywhere. >> >> > >> > As much as I hate asymmetric gating, I think this is a good compromise >> for >> > avoiding developers working on other projects are badly affected by the >> > higher failure rate in the neutron full job. >> >> So we discussed this during the project meeting a couple of weeks ago [3] >> and >> there was a general agreement that doing it asymmetrically at first would >> be >> better. Everyone should be wary of the potential harms with doing it >> asymmetrically and I think priority will be given to fixing issues that >> block >> the neutron gate should they arise. >> >> > I will therefore resume work on [2] and remove the WIP status as soon >> as I >> > can confirm a failure rate below 15% with more data points. >> > >> >> Thanks for keeping on top of this Salvatore. It'll be good to finally be >> at >> least partially gating with a parallel job. >> >> -Matt Treinish >> >> > >> > [1] https://review.openstack.org/#/c/103865/ >> > [2] https://review.openstack.org/#/c/88289/ >> [3] >> http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28 >> >> > >> > >> > On 10 July 2014 11:49, Salvatore Orlando <sorla...@nicira.com> wrote: >> > >> > > >> > > >> > > >> > > On 10 July 2014 11:27, Ihar Hrachyshka <ihrac...@redhat.com> wrote: >> > > >> > >> -----BEGIN PGP SIGNED MESSAGE----- >> > >> Hash: SHA512 >> > >> >> > >> On 10/07/14 11:07, Salvatore Orlando wrote: >> > >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2] >> > >> > it seems there has been an improvement on the failure rate, which >> > >> > seem to have dropped to 25% from over 40%. Still, since the patch >> > >> > merged there have been 11 failures already in the full job out of >> > >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to >> > >> > problems in the patches being tested - 1 had the same root cause as >> > >> > bug 1329564. Indeed the related job started before the patch merged >> > >> > but finished after. So this failure "doesn't count". - 1 was for an >> > >> > issue introduced about a week ago which actually causing a lot of >> > >> > failures in the full job [3]. Fix should be easy for it; however >> > >> > given the nature of the test we might even skip it while it's >> > >> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is >> > >> > going on on gerrit regarding the most suitable approach. - 3 were >> > >> > for lock wait timeout errors. Several people in the community are >> > >> > already working on them. I hope this will raise the profile of this >> > >> > issue (maybe some might think it's just a corner case as it rarely >> > >> > causes failures in smoke jobs, whereas the truth is that error >> > >> > occurs but it does not cause job failure because the jobs isn't >> > >> > parallel). >> > >> >> > >> Can you give directions on where to find those lock timeout failures? >> > >> I'd like to check logs to see whether they have the same nature as >> > >> most other failures (e.g. improper yield under transaction). >> > >> >> > > >> > > This logstash query will give you all occurences of lock wait timeout >> > > issues: message:"(OperationalError) (1205, 'Lock wait timeout >> exceeded; try >> > > restarting transaction')" AND tags:"screen-q-svc.txt" >> > > >> > > The fact that in most cases the build succeeds anyway is misleading, >> > > because in many cases these errors occur in RPC handling between >> agents and >> > > servers, and therefore are not detected by tempest. The neutron full >> job, >> > > which is parallel, increases their occurrence because of parallelism >> - and >> > > since API request too occur concurrently it also yields a higher >> tempest >> > > build failure rate. >> > > >> > > However, as I argued in the past the "lock wait timeout" error should >> > > always be treated as an error condition. >> > > Eugene has already classified lock wait timeout failures and filed >> bugs >> > > for them a few weeks ago. >> > > >> > > >> > >> > >> > >> > Summarizing, I think time is not yet ripe to enable the full job; >> > >> > once bug 1333654 is fixed, we should go for it. AFAIK there is no >> > >> > way for working around it in gate tests other than disabling >> > >> > nova/neutron event reporting, which I guess we don't want to do. >> > >> > >> > >> > Salvatore >> > >> > >> > >> > [1] https://review.openstack.org/#/c/105239 [2] >> > >> > >> > >> >> http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ== >> > >> > >> > >> > >> > >> [3] >> > >> > >> > >> >> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0= >> > >> > >> > >> > >> > >> [4] https://bugs.launchpad.net/nova/+bug/1333654 >> > >> > >> > >> > >> > >> > On 2 July 2014 17:57, Salvatore Orlando <sorla...@nicira.com> >> > >> > wrote: >> > >> > >> > >> >> Hi again, >> > >> >> >> > >> >> From my analysis most of the failures affecting the neutron full >> > >> >> job are because of bugs [1] and [2] for which patch [3] and [4] >> > >> >> have been proposed. Both patches address the nova side of the >> > >> >> neutron/nova notification system for vif plugging. It is worth >> > >> >> noting that these bugs did manifest only in the neutron full job >> > >> >> not because of its "full" nature, but because of its "parallel" >> > >> >> nature. >> > >> >> >> > >> >> Openstackers with a good memory will probably remember we fixed >> > >> >> the parallel job back in January, before the massive "kernel bug" >> > >> >> gate outage [5]. However, since parallel testing was >> > >> >> unfortunately never enabled on the smoke job we run on the gate, >> > >> >> we allowed new bugs to slip in. For this reason I would recommend >> > >> >> the following: - once patches [3] and [4] have been reviewed and >> > >> >> merge, re-assess neutron full job failure rate over a period of >> > >> >> 48 hours (72 if the period includes at least 24 hours within a >> > >> >> weekend - GMT time) - turn neutron full job to voting if the >> > >> >> previous step reveals a failure rate below 10%, otherwise go back >> > >> >> to the drawing board >> > >> >> >> > >> >> In my opinion whether the full job should be enabled in an >> > >> >> asymmetric fashion or not should be a decision for the QA and >> > >> >> Infra teams. Once the full job is made voting there will >> > >> >> inevitably be a higher failure rate. An asymmetric gate will not >> > >> >> cause backlogs on other projects, so less angry people, but as >> > >> >> Matt said it will still allow other bugs to slip in. Personally >> > >> >> I'm ok either way. >> > >> >> >> > >> >> The reason why we're expecting a higher failure rate on the full >> > >> >> job is that we have already observed that some "known" bugs, such >> > >> >> as the various lock timeout issues affecting neutron tend to show >> > >> >> with a higher frequency on the full job because of its parallel >> > >> >> nature. >> > >> >> >> > >> >> Salvatore >> > >> >> >> > >> >> [1] https://launchpad.net/bugs/1329546 [2] >> > >> >> https://launchpad.net/bugs/1333654 [3] >> > >> >> https://review.openstack.org/#/c/99182/ [4] >> > >> >> https://review.openstack.org/#/c/103865/ [5] >> > >> >> https://bugs.launchpad.net/neutron/+bug/1273386 >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> On 25 June 2014 23:38, Matthew Treinish <mtrein...@kortar.org> >> > >> >> wrote: >> > >> >> >> > >> >>> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando >> > >> >>> wrote: >> > >> >>>> There is a long standing patch [1] for enabling the neutron >> > >> >>>> full job. Little before the Icehouse release date, when we >> > >> >>>> first pushed this, the neutron full job had a failure rate of >> > >> >>>> less than 10%. However, since has come by, and perceived >> > >> >>>> failure rates were higher, we ran again this analysis. >> > >> >>> >> > >> >>> So I'm not exactly a fan of having the gates be asymmetrical. >> > >> >>> It's very easy for breaks to slip in blocking the neutron gate >> > >> >>> if it's not voting everywhere. Especially because I think most >> > >> >>> people have been trained to ignore the full job because it's >> > >> >>> been nonvoting for so long. Is there a particular reason we >> > >> >>> just don't switch everything all at once? I think having a >> > >> >>> little bit of friction everywhere during the migration is fine. >> > >> >>> Especially if we do it way before a milestone. (as opposed to >> > >> >>> the original parallel switch which was right before H-3) >> > >> >>> >> > >> >>>> >> > >> >>>> Here are the findings in a nutshell. 1) If we were to enable >> > >> >>>> the job today we might expect about a 3-fold increase in >> > >> >>>> neutron job failures when compared with the smoke test. >> > >> >>> This is >> > >> >>>> unfortunately not acceptable and we therefore need to >> > >> >>>> identify and fix >> > >> >>> the >> > >> >>>> issues causing the additional failure rate. 2) However this >> > >> >>>> also puts us in a position where if we wait until the failure >> > >> >>>> rate drops under a given threshold we might end up chasing a >> > >> >>> moving >> > >> >>>> target as new issues might be introduced at any time since >> > >> >>>> the job is >> > >> >>> not >> > >> >>>> voting. 3) When it comes to evaluating failure rates for a >> > >> >>>> non voting job, >> > >> >>> taking >> > >> >>>> the rough numbers does not mean anything, as that will take >> > >> >>>> in account patches 'in progress' which end up failing the >> > >> >>>> tests because of >> > >> >>> problems in >> > >> >>>> the patch themselves. >> > >> >>>> >> > >> >>>> Well, that was pretty much a lot for a "nutshell"; however if >> > >> >>>> you're not yet bored to death please go on reading. >> > >> >>>> >> > >> >>>> The data in this post are a bit skewed because of a rise in >> > >> >>>> neutron job failures in the past 36 hours. However, this rise >> > >> >>>> affects both the full >> > >> >>> and >> > >> >>>> the smoke job so it does not invalidate what we say here. The >> > >> >>>> results >> > >> >>> shown >> > >> >>>> below are representative of the gate status 12 hours ago. >> > >> >>>> >> > >> >>>> - Neutron smoke job failure rates (all queues) 24 hours: >> > >> >>>> 22.4% 48 hours: 19.3% 7 days: 8.96% - Neutron smoke job >> > >> >>>> failure rates (gate queue only): 24 hours: 10.41% 48 hours: >> > >> >>>> 10.20% 7 days: 3.53% - Neutron full job failure rate (check >> > >> >>>> queue only as it's non voting): 24 hours: 31.54% 48 hours: >> > >> >>>> 28.87% 7 days: 25.73% >> > >> >>>> >> > >> >>>> Check/Gate Ratio between neutron smoke failures 24 hours: >> > >> >>>> 2.15 48 hours: 1.89 7 days: 2.53 >> > >> >>>> >> > >> >>>> Estimated job failure rate for neutron full job if it were to >> > >> >>>> run in the gate: 24 hours: 14.67% 48 hours: 15.27% 7 days: >> > >> >>>> 10.16% >> > >> >>>> >> > >> >>>> The numbers are therefore not terrible, but definitely not >> > >> >>>> good enough; looking at the last 7 days the full job will >> > >> >>>> have a failure rate about 3 times higher than the smoke job. >> > >> >>>> >> > >> >>>> We then took, as it's usual for us when we do this kind of >> > >> >>>> evaluation, a window with a reasonable number of failures (41 >> > >> >>>> in our case), and >> > >> >>> analysed >> > >> >>>> them in detail. >> > >> >>>> >> > >> >>>> Of these 41 failures 17 were excluded because of infra >> > >> >>>> problems, patches 'in progress', or other transient failures; >> > >> >>>> considering that over the >> > >> >>> same >> > >> >>>> period of time 160 full job runs succeeded this would leave >> > >> >>>> us with 24 failures on 184 run, and therefore a failure rate >> > >> >>>> of 13.04%, which not >> > >> >>> far >> > >> >>>> from the estimate. >> > >> >>>> >> > >> >>>> Let's consider now these 24 'real' falures: A) 2 were for >> > >> >>>> the SSH timeout (8.33% of failures, 1.08% of total full >> > >> >>> job >> > >> >>>> runs). These specific failure is being analyzed to see if a >> > >> >>>> specific fingerprint can be found B) 2 (8.33% of failures, >> > >> >>>> 1.08% of total full job runs) were for a >> > >> >>> failure >> > >> >>>> in test load balancer basic, which is actually a test design >> > >> >>>> issue and >> > >> >>> is >> > >> >>>> already being addressed [2] C) 7 (29.16% of failures, 3.81% >> > >> >>>> of total full job runs) were for an >> > >> >>> issue >> > >> >>>> while resizing a server, which has been already spotted and >> > >> >>>> has a bug in progress [3] D) 5 (20.83% of failures, 2.72% of >> > >> >>>> total full job runs) manifested as a failure in >> > >> >>>> test_server_address; however the actual root cause was being >> > >> >>>> masked by [4]. A bug has been filed [5]; this is the most >> > >> >>>> worrying one >> > >> >>> in >> > >> >>>> my opinion as there are many cases where the fault happens >> > >> >>>> but does not trigger a failure because of the way tempest >> > >> >>>> tests are designed. E) 6 are because of our friend lock wait >> > >> >>>> timeout. This was initially >> > >> >>> filed >> > >> >>>> as [6] but since then we've closed it to file more detailed >> > >> >>>> bug reports >> > >> >>> as >> > >> >>>> the lock wait timeout can manifest in various places; Eugene >> > >> >>>> is leading >> > >> >>> the >> > >> >>>> effort on this problem with Kevin B. >> > >> >>>> >> > >> >>>> >> > >> >>>> Summarizing the only failure modes specific to the full job >> > >> >>>> seem to be >> > >> >>> C & >> > >> >>>> D. If we were able to fix those we should reasonably expect a >> > >> >>>> failure >> > >> >>> rate >> > >> >>>> of about 6.5%. That's still almost twice as the smoke job, >> > >> >>>> but I deem it acceptable for two reasons: 1- by voting, we >> > >> >>>> will avoid new bugs affecting the full job from being >> > >> >>>> introduced. it is worth reminding people that any bug >> > >> >>>> affecting the full job is likely to affect production >> > >> >>>> environments >> > >> >>> >> > >> >>> +1, this is a very good point. >> > >> >>> >> > >> >>>> 2- patches failing in the gate will spur neutron developers >> > >> >>>> to quickly >> > >> >>> find >> > >> >>>> a fix. Patches failing a non voting job will cause some >> > >> >>>> neutron core >> > >> >>> team >> > >> >>>> members to write long and boring posts to the mailing list. >> > >> >>>> >> > >> >>> >> > >> >>> Well, you can always hope. :) But, in my experience the error >> > >> >>> is often fixed quickly but the lesson isn't learned, so it will >> > >> >>> just happen again. That's why I think we should just grit our >> > >> >>> teeth and turn it on everywhere. >> > >> >>> >> > >> >>>> Salvatore >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> [1] https://review.openstack.org/#/c/88289/ [2] >> > >> >>>> https://review.openstack.org/#/c/98065/ [3] >> > >> >>>> https://bugs.launchpad.net/nova/+bug/1329546 [4] >> > >> >>>> https://bugs.launchpad.net/tempest/+bug/1332414 [5] >> > >> >>>> https://bugs.launchpad.net/nova/+bug/1333654 [5] >> > >> >>>> https://bugs.launchpad.net/nova/+bug/1283522 >> > >> >>> >> > >> >>> Very cool, thanks for the update Salvatore. I'm very excited to >> > >> >>> get this voting. >> > >> >>> >> > >> >>> >> > >> >>> -Matt Treinish >> > >> >>> >> > >> >>> _______________________________________________ OpenStack-dev >> > >> >>> mailing list OpenStack-dev@lists.openstack.org >> > >> >>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > >> >>> >> > >> >>> >> > >> >> >> > >> > >> > >> >>> >> > >> > >> > >> > >> > >> > _______________________________________________ OpenStack-dev >> > >> > mailing list OpenStack-dev@lists.openstack.org >> > >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > >> > >> > >> -----BEGIN PGP SIGNATURE----- >> > >> Version: GnuPG/MacGPG2 v2.0.22 (Darwin) >> > >> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ >> > >> >> > >> iQEcBAEBCgAGBQJTvlxzAAoJEC5aWaUY1u57FJ8H/i+gPR/VZuWFvkOu7pNTHuSj >> > >> 8iSA1LJRGe7I9185Gbh22fVzGlahqDpB2hCJjKtWIcL/ml/pgSNGzafB/DhqUUlL >> > >> 4GT1UUHptqlKaNX9GLl9I/bknUBEtpwg3hSBivVdCkRYiVwfX86a2ZeeHaCAONwY >> > >> ykhiNgoXhR6mr8oEJEIvtjnTDlodR+1dcEq+Nchf/6Fzd8J29dI2Qu38JkweK/qP >> > >> m6koPdKSJFzrneOWMCW0Dta6yBKjb3bMCNJUVO/KSGg+MRuSmrufOmLCW5JFu95S >> > >> DWIQSTWs3A+dSy9+xuByClQP9kDpG3aUXxW6uRu5UshHMAF5vLATmdCdK4kBiBY= >> > >> =K9qm >> > >> -----END PGP SIGNATURE----- >> > >> >> > >> _______________________________________________ >> > >> OpenStack-dev mailing list >> > >> OpenStack-dev@lists.openstack.org >> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > >> >> > > >> > > >> >> > _______________________________________________ >> > OpenStack-dev mailing list >> > OpenStack-dev@lists.openstack.org >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev