The patch for bug 1329564 [1] merged about 11 hours ago. >From [2] it seems there has been an improvement on the failure rate, which seem to have dropped to 25% from over 40%. Still, since the patch merged there have been 11 failures already in the full job out of 42 jobs executed in total. Of these 11 failures: - 3 were due to problems in the patches being tested - 1 had the same root cause as bug 1329564. Indeed the related job started before the patch merged but finished after. So this failure "doesn't count". - 1 was for an issue introduced about a week ago which actually causing a lot of failures in the full job [3]. Fix should be easy for it; however given the nature of the test we might even skip it while it's fixed. - 3 were for bug 1333654 [4]; for this bug discussion is going on on gerrit regarding the most suitable approach. - 3 were for lock wait timeout errors. Several people in the community are already working on them. I hope this will raise the profile of this issue (maybe some might think it's just a corner case as it rarely causes failures in smoke jobs, whereas the truth is that error occurs but it does not cause job failure because the jobs isn't parallel).
Summarizing, I think time is not yet ripe to enable the full job; once bug 1333654 is fixed, we should go for it. AFAIK there is no way for working around it in gate tests other than disabling nova/neutron event reporting, which I guess we don't want to do. Salvatore [1] https://review.openstack.org/#/c/105239 [2] http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ== [3] http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0= [4] https://bugs.launchpad.net/nova/+bug/1333654 On 2 July 2014 17:57, Salvatore Orlando <sorla...@nicira.com> wrote: > Hi again, > > From my analysis most of the failures affecting the neutron full job are > because of bugs [1] and [2] for which patch [3] and [4] have been proposed. > Both patches address the nova side of the neutron/nova notification system > for vif plugging. > It is worth noting that these bugs did manifest only in the neutron full > job not because of its "full" nature, but because of its "parallel" nature. > > Openstackers with a good memory will probably remember we fixed the > parallel job back in January, before the massive "kernel bug" gate outage > [5]. However, since parallel testing was unfortunately never enabled on the > smoke job we run on the gate, we allowed new bugs to slip in. > For this reason I would recommend the following: > - once patches [3] and [4] have been reviewed and merge, re-assess neutron > full job failure rate over a period of 48 hours (72 if the period includes > at least 24 hours within a weekend - GMT time) > - turn neutron full job to voting if the previous step reveals a failure > rate below 10%, otherwise go back to the drawing board > > In my opinion whether the full job should be enabled in an asymmetric > fashion or not should be a decision for the QA and Infra teams. Once the > full job is made voting there will inevitably be a higher failure rate. An > asymmetric gate will not cause backlogs on other projects, so less angry > people, but as Matt said it will still allow other bugs to slip in. > Personally I'm ok either way. > > The reason why we're expecting a higher failure rate on the full job is > that we have already observed that some "known" bugs, such as the various > lock timeout issues affecting neutron tend to show with a higher frequency > on the full job because of its parallel nature. > > Salvatore > > [1] https://launchpad.net/bugs/1329546 > [2] https://launchpad.net/bugs/1333654 > [3] https://review.openstack.org/#/c/99182/ > [4] https://review.openstack.org/#/c/103865/ > [5] https://bugs.launchpad.net/neutron/+bug/1273386 > > > > > On 25 June 2014 23:38, Matthew Treinish <mtrein...@kortar.org> wrote: > >> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote: >> > There is a long standing patch [1] for enabling the neutron full job. >> > Little before the Icehouse release date, when we first pushed this, the >> > neutron full job had a failure rate of less than 10%. However, since has >> > come by, and perceived failure rates were higher, we ran again this >> > analysis. >> >> So I'm not exactly a fan of having the gates be asymmetrical. It's very >> easy >> for breaks to slip in blocking the neutron gate if it's not voting >> everywhere. >> Especially because I think most people have been trained to ignore the >> full >> job because it's been nonvoting for so long. Is there a particular reason >> we >> just don't switch everything all at once? I think having a little bit of >> friction everywhere during the migration is fine. Especially if we do it >> way >> before a milestone. (as opposed to the original parallel switch which was >> right >> before H-3) >> >> > >> > Here are the findings in a nutshell. >> > 1) If we were to enable the job today we might expect about a 3-fold >> > increase in neutron job failures when compared with the smoke test. >> This is >> > unfortunately not acceptable and we therefore need to identify and fix >> the >> > issues causing the additional failure rate. >> > 2) However this also puts us in a position where if we wait until the >> > failure rate drops under a given threshold we might end up chasing a >> moving >> > target as new issues might be introduced at any time since the job is >> not >> > voting. >> > 3) When it comes to evaluating failure rates for a non voting job, >> taking >> > the rough numbers does not mean anything, as that will take in account >> > patches 'in progress' which end up failing the tests because of >> problems in >> > the patch themselves. >> > >> > Well, that was pretty much a lot for a "nutshell"; however if you're not >> > yet bored to death please go on reading. >> > >> > The data in this post are a bit skewed because of a rise in neutron job >> > failures in the past 36 hours. However, this rise affects both the full >> and >> > the smoke job so it does not invalidate what we say here. The results >> shown >> > below are representative of the gate status 12 hours ago. >> > >> > - Neutron smoke job failure rates (all queues) >> > 24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96% >> > - Neutron smoke job failure rates (gate queue only): >> > 24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53% >> > - Neutron full job failure rate (check queue only as it's non voting): >> > 24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73% >> > >> > Check/Gate Ratio between neutron smoke failures >> > 24 hours: 2.15 48 hours: 1.89 7 days: 2.53 >> > >> > Estimated job failure rate for neutron full job if it were to run in the >> > gate: >> > 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16% >> > >> > The numbers are therefore not terrible, but definitely not good enough; >> > looking at the last 7 days the full job will have a failure rate about 3 >> > times higher than the smoke job. >> > >> > We then took, as it's usual for us when we do this kind of evaluation, a >> > window with a reasonable number of failures (41 in our case), and >> analysed >> > them in detail. >> > >> > Of these 41 failures 17 were excluded because of infra problems, patches >> > 'in progress', or other transient failures; considering that over the >> same >> > period of time 160 full job runs succeeded this would leave us with 24 >> > failures on 184 run, and therefore a failure rate of 13.04%, which not >> far >> > from the estimate. >> > >> > Let's consider now these 24 'real' falures: >> > A) 2 were for the SSH timeout (8.33% of failures, 1.08% of total full >> job >> > runs). These specific failure is being analyzed to see if a specific >> > fingerprint can be found >> > B) 2 (8.33% of failures, 1.08% of total full job runs) were for a >> failure >> > in test load balancer basic, which is actually a test design issue and >> is >> > already being addressed [2] >> > C) 7 (29.16% of failures, 3.81% of total full job runs) were for an >> issue >> > while resizing a server, which has been already spotted and has a bug in >> > progress [3] >> > D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a >> > failure in test_server_address; however the actual root cause was being >> > masked by [4]. A bug has been filed [5]; this is the most worrying one >> in >> > my opinion as there are many cases where the fault happens but does not >> > trigger a failure because of the way tempest tests are designed. >> > E) 6 are because of our friend lock wait timeout. This was initially >> filed >> > as [6] but since then we've closed it to file more detailed bug reports >> as >> > the lock wait timeout can manifest in various places; Eugene is leading >> the >> > effort on this problem with Kevin B. >> > >> > >> > Summarizing the only failure modes specific to the full job seem to be >> C & >> > D. If we were able to fix those we should reasonably expect a failure >> rate >> > of about 6.5%. That's still almost twice as the smoke job, but I deem it >> > acceptable for two reasons: >> > 1- by voting, we will avoid new bugs affecting the full job from being >> > introduced. it is worth reminding people that any bug affecting the full >> > job is likely to affect production environments >> >> +1, this is a very good point. >> >> > 2- patches failing in the gate will spur neutron developers to quickly >> find >> > a fix. Patches failing a non voting job will cause some neutron core >> team >> > members to write long and boring posts to the mailing list. >> > >> >> Well, you can always hope. :) But, in my experience the error is often >> fixed >> quickly but the lesson isn't learned, so it will just happen again. >> That's why >> I think we should just grit our teeth and turn it on everywhere. >> >> > Salvatore >> > >> > >> > >> > >> > [1] https://review.openstack.org/#/c/88289/ >> > [2] https://review.openstack.org/#/c/98065/ >> > [3] https://bugs.launchpad.net/nova/+bug/1329546 >> > [4] https://bugs.launchpad.net/tempest/+bug/1332414 >> > [5] https://bugs.launchpad.net/nova/+bug/1333654 >> > [5] https://bugs.launchpad.net/nova/+bug/1283522 >> >> Very cool, thanks for the update Salvatore. I'm very excited to get this >> voting. >> >> >> -Matt Treinish >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev