[openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
We're at a 18hr backup in the gate, which is really unusual given the amount of decoupling. Even under our current load that means we're seeing huge failure rates causing resets. It appears one of the major culprits is the python34 tests in neutron, which were over a 40% failure rate recently - http://goo.gl/9wCerK That tends to lead to things like - http://dl.dropbox.com/u/6514884/screenshot_249.png - which means a huge amount of work has been reset. Right now 3 of 7 neutron patches in the gate that are within the sliding window are in a failure state (they are also the only current visible fails in the window). Looking at one of the patches in question - https://review.openstack.org/#/c/202207/ - shows it's been rechecked 3 times, and these failures were seen in earlier runs. I do understand that people want to get their code merged, but rechecking patches that are failing this much without going after the root causes means everyone pays for it. This is blocking a lot of other projects from landing code in a timely manner. The functional tests seem to have a quite high failure rate as well from spot checking. If the results of these tests are mostly going to be ignored and rechecked, can we remove them from the gate definition on neutron so they aren't damaging the overall flow of the gate? Thanks, -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 28/08/15 13:39, Kevin Benton wrote: For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 Which leads on to https://review.openstack.org/#/c/217379/6. Which is itself failing to merge for various dvsm-functional reasons, including failure of test_restart_wsgi_on_sighup_multiple_workers [1]. There's a bug for that at https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show any activity for the last few days. [1] http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz Regards, Neil __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 08/28/2015 09:22 AM, Assaf Muller wrote: On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com mailto:neil.jer...@metaswitch.com wrote: On 28/08/15 13:39, Kevin Benton wrote: For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 Which leads on to https://review.openstack.org/#/c/217379/6. Armando reported the py34 Neutron gate issues a few hours after they started, and I pushed that fix a few hours after that. Sadly it's taking time to get that through the gate. When issues like these arrise, please bring them to the infra team in #openstack-infra. They can promote fixes that unbreak things. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 08/28/2015 08:34 AM, Kevin Benton wrote: One of the patches that fixes one of the functional failures that has been hitting is here: https://review.openstack.org/#/c/217927/ However, it failed in the DVR job on the 'test_router_rescheduling' test.[1] This failure is because the logic to skip when DVR is enabled is based on a check that will always return False.[2] I pushed a patch to tempest to fix that [3] so once that gets merged we should be able to get the one above merged. For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 That would be indicative of the fact that the tests aren't isolated, and running them in parallel breaks things because the tests implicitly depend on both order, and that everything before them actually ran. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
Why would that only impact py34 and not py27? Aren't the py27 run with testtools? On Fri, Aug 28, 2015 at 5:41 AM, Sean Dague s...@dague.net wrote: On 08/28/2015 08:34 AM, Kevin Benton wrote: One of the patches that fixes one of the functional failures that has been hitting is here: https://review.openstack.org/#/c/217927/ However, it failed in the DVR job on the 'test_router_rescheduling' test.[1] This failure is because the logic to skip when DVR is enabled is based on a check that will always return False.[2] I pushed a patch to tempest to fix that [3] so once that gets merged we should be able to get the one above merged. For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 That would be indicative of the fact that the tests aren't isolated, and running them in parallel breaks things because the tests implicitly depend on both order, and that everything before them actually ran. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Kevin Benton __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 08/28/2015 08:50 AM, Kevin Benton wrote: Why would that only impact py34 and not py27? Aren't the py27 run with testtools? py34 is only running some subset of tests, so there are a lot of ways this can go weird. It may be that the db tests that are failing assume some other tests which have a db setup thing run before them. In the py27 case there are enough tests that do that setup that stastically one nearly always runs before the ones that are problematic. There are a couple of modes you can run testr in, like --isolated which will expose tests that are coupled to other tests running before them. If you generate a local fail you can also --analyze-isolation to figure out what tests are coupled. testr also reorders tests to attempt to be faster in aggregate. So run order is different than it would be in testtools.run case. In the testtools.run case all the tests are just run in discovery order. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
One of the patches that fixes one of the functional failures that has been hitting is here: https://review.openstack.org/#/c/217927/ However, it failed in the DVR job on the 'test_router_rescheduling' test.[1] This failure is because the logic to skip when DVR is enabled is based on a check that will always return False.[2] I pushed a patch to tempest to fix that [3] so once that gets merged we should be able to get the one above merged. For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 1. http://logs.openstack.org/27/217927/1/check/gate-tempest-dsvm-neutron-dvr/3361f9f/logs/testr_results.html.gz 2. https://github.com/openstack/tempest/blob/8d827589e6589814e01089eb56b4d109274c781a/tempest/scenario/test_network_basic_ops.py#L662-L663 3. https://review.openstack.org/#/c/218242/ On Fri, Aug 28, 2015 at 4:09 AM, Sean Dague s...@dague.net wrote: We're at a 18hr backup in the gate, which is really unusual given the amount of decoupling. Even under our current load that means we're seeing huge failure rates causing resets. It appears one of the major culprits is the python34 tests in neutron, which were over a 40% failure rate recently - http://goo.gl/9wCerK That tends to lead to things like - http://dl.dropbox.com/u/6514884/screenshot_249.png - which means a huge amount of work has been reset. Right now 3 of 7 neutron patches in the gate that are within the sliding window are in a failure state (they are also the only current visible fails in the window). Looking at one of the patches in question - https://review.openstack.org/#/c/202207/ - shows it's been rechecked 3 times, and these failures were seen in earlier runs. I do understand that people want to get their code merged, but rechecking patches that are failing this much without going after the root causes means everyone pays for it. This is blocking a lot of other projects from landing code in a timely manner. The functional tests seem to have a quite high failure rate as well from spot checking. If the results of these tests are mostly going to be ignored and rechecked, can we remove them from the gate definition on neutron so they aren't damaging the overall flow of the gate? Thanks, -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Kevin Benton __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com wrote: On 28/08/15 13:39, Kevin Benton wrote: For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 Which leads on to https://review.openstack.org/#/c/217379/6. Armando reported the py34 Neutron gate issues a few hours after they started, and I pushed that fix a few hours after that. Sadly it's taking time to get that through the gate. Which is itself failing to merge for various dvsm-functional reasons, including failure of test_restart_wsgi_on_sighup_multiple_workers [1]. There's a bug for that at https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show any activity for the last few days. [1] http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz Regards, Neil __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
To recap, we had three issues impacting the gate queue: 1) The neutron functional job has had a high failure rate for a while now. Since it's impacting the gate, I've removed it from the gate queue but kept it in the Neutron check queue: https://review.openstack.org/#/c/218302/ If you'd like to help, the the list of bugs impacting the Neutron functional job is linked in that patch. 2) A new Tempest scenario test was added that caused the DVR job failure rate to sky rocket to over 50%. It actually highlighted a legit bug with DVR and legacy routers. Kevin proposed a patch that skips that test entirely until we can resolve the bug in Neutron: https://review.openstack.org/#/c/218242/ (Currently it tries to skip the test conditionally, the next PS will skip the test entirely). 3) The Neutron py34 job has been made unstable due to a recent change (By me, yay) that made the tests run with multiple workers. This highlighted an issue with the Neutron unit testing infrastructure, which is fixed here: https://review.openstack.org/#/c/217379/ With all three patches merged we should be good to go. On Fri, Aug 28, 2015 at 9:37 AM, Sean Dague s...@dague.net wrote: On 08/28/2015 09:22 AM, Assaf Muller wrote: On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com mailto:neil.jer...@metaswitch.com wrote: On 28/08/15 13:39, Kevin Benton wrote: For the py34 failures, they seem to have started around the same time as a change was merged that adjusted the way they were ran so I proposed a revert for that patch here: https://review.openstack.org/218244 Which leads on to https://review.openstack.org/#/c/217379/6. Armando reported the py34 Neutron gate issues a few hours after they started, and I pushed that fix a few hours after that. Sadly it's taking time to get that through the gate. When issues like these arrise, please bring them to the infra team in #openstack-infra. They can promote fixes that unbreak things. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 08/28/2015 11:20 AM, Assaf Muller wrote: To recap, we had three issues impacting the gate queue: 1) The neutron functional job has had a high failure rate for a while now. Since it's impacting the gate, I've removed it from the gate queue but kept it in the Neutron check queue: https://review.openstack.org/#/c/218302/ If you'd like to help, the the list of bugs impacting the Neutron functional job is linked in that patch. 2) A new Tempest scenario test was added that caused the DVR job failure rate to sky rocket to over 50%. It actually highlighted a legit bug with DVR and legacy routers. Kevin proposed a patch that skips that test entirely until we can resolve the bug in Neutron: https://review.openstack.org/#/c/218242/ (Currently it tries to skip the test conditionally, the next PS will skip the test entirely). 3) The Neutron py34 job has been made unstable due to a recent change (By me, yay) that made the tests run with multiple workers. This highlighted an issue with the Neutron unit testing infrastructure, which is fixed here: https://review.openstack.org/#/c/217379/ With all three patches merged we should be good to go. Well, with all 3 of these we should be much better for sure. There are probably additional issues causing intermittent failures which should be looked at. These 3 are definitely masking anything else. https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of patches to promote for things causing races in the gate (we've got a cinder one was well). If other issues are known with fixes posted, please feel free to add them with comments. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On 28 August 2015 at 16:57, Sean Dague s...@dague.net wrote: On 08/28/2015 11:20 AM, Assaf Muller wrote: To recap, we had three issues impacting the gate queue: 1) The neutron functional job has had a high failure rate for a while now. Since it's impacting the gate, I've removed it from the gate queue but kept it in the Neutron check queue: https://review.openstack.org/#/c/218302/ If you'd like to help, the the list of bugs impacting the Neutron functional job is linked in that patch. 2) A new Tempest scenario test was added that caused the DVR job failure rate to sky rocket to over 50%. It actually highlighted a legit bug with DVR and legacy routers. Kevin proposed a patch that skips that test entirely until we can resolve the bug in Neutron: https://review.openstack.org/#/c/218242/ (Currently it tries to skip the test conditionally, the next PS will skip the test entirely). 3) The Neutron py34 job has been made unstable due to a recent change (By me, yay) that made the tests run with multiple workers. This highlighted an issue with the Neutron unit testing infrastructure, which is fixed here: https://review.openstack.org/#/c/217379/ With all three patches merged we should be good to go. Well, with all 3 of these we should be much better for sure. There are probably additional issues causing intermittent failures which should be looked at. These 3 are definitely masking anything else. Sadly, since the issues are independent, it is very likely for one of the patch to fail jenkins tests for one of the other two issues. If the situation persists is it crazy to conside switching neutron-py34 and neutron-functional to non-voting until these patches merge. Neutron cores might abstain from approving patches (unless trivial or documentation) while these jobs are non-voting. https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of patches to promote for things causing races in the gate (we've got a cinder one was well). If other issues are known with fixes posted, please feel free to add them with comments. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate
On Fri, Aug 28, 2015 at 1:50 PM, Salvatore Orlando salv.orla...@gmail.com wrote: On 28 August 2015 at 16:57, Sean Dague s...@dague.net wrote: On 08/28/2015 11:20 AM, Assaf Muller wrote: To recap, we had three issues impacting the gate queue: 1) The neutron functional job has had a high failure rate for a while now. Since it's impacting the gate, I've removed it from the gate queue but kept it in the Neutron check queue: https://review.openstack.org/#/c/218302/ If you'd like to help, the the list of bugs impacting the Neutron functional job is linked in that patch. 2) A new Tempest scenario test was added that caused the DVR job failure rate to sky rocket to over 50%. It actually highlighted a legit bug with DVR and legacy routers. Kevin proposed a patch that skips that test entirely until we can resolve the bug in Neutron: https://review.openstack.org/#/c/218242/ (Currently it tries to skip the test conditionally, the next PS will skip the test entirely). 3) The Neutron py34 job has been made unstable due to a recent change (By me, yay) that made the tests run with multiple workers. This highlighted an issue with the Neutron unit testing infrastructure, which is fixed here: https://review.openstack.org/#/c/217379/ With all three patches merged we should be good to go. Well, with all 3 of these we should be much better for sure. There are probably additional issues causing intermittent failures which should be looked at. These 3 are definitely masking anything else. Sadly, since the issues are independent, it is very likely for one of the patch to fail jenkins tests for one of the other two issues. If the situation persists is it crazy to conside switching neutron-py34 and neutron-functional to non-voting until these patches merge. Neutron cores might abstain from approving patches (unless trivial or documentation) while these jobs are non-voting. We have two of the three merged. The Neutron functional tests are no longer part of the gate queue, only the check queue, and the Tempest router_reschedule test will no longer fail as part of the DVR job. This means that the py34 patch now has a better chance of going in. https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of patches to promote for things causing races in the gate (we've got a cinder one was well). If other issues are known with fixes posted, please feel free to add them with comments. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev