[openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
We're at a 18hr backup in the gate, which is really unusual given the
amount of decoupling. Even under our current load that means we're
seeing huge failure rates causing resets.

It appears one of the major culprits is the python34 tests in neutron,
which were over a 40% failure rate recently - http://goo.gl/9wCerK

That tends to lead to things like -
http://dl.dropbox.com/u/6514884/screenshot_249.png - which means a huge
amount of work has been reset. Right now 3 of 7 neutron patches in the
gate that are within the sliding window are in a failure state (they are
also the only current visible fails in the window).

Looking at one of the patches in question -
https://review.openstack.org/#/c/202207/ - shows it's been rechecked 3
times, and these failures were seen in earlier runs.

I do understand that people want to get their code merged, but
rechecking patches that are failing this much without going after the
root causes means everyone pays for it. This is blocking a lot of other
projects from landing code in a timely manner.

The functional tests seem to have a quite high failure rate as well from
spot checking. If the results of these tests are mostly going to be
ignored and rechecked, can we remove them from the gate definition on
neutron so they aren't damaging the overall flow of the gate?

Thanks,

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Neil Jerram
On 28/08/15 13:39, Kevin Benton wrote:
 For the py34 failures, they seem to have started around the same time
 as a change was merged that adjusted the way they were ran so I
 proposed a revert for that patch
 here: https://review.openstack.org/218244



Which leads on to https://review.openstack.org/#/c/217379/6.

Which is itself failing to merge for various dvsm-functional reasons,
including failure of test_restart_wsgi_on_sighup_multiple_workers [1]. 
There's a bug for that at
https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show
any activity for the last few days.

[1]
http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz

Regards,
Neil



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 09:22 AM, Assaf Muller wrote:
 
 
 On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com
 mailto:neil.jer...@metaswitch.com wrote:
 
 On 28/08/15 13:39, Kevin Benton wrote:
  For the py34 failures, they seem to have started around the same time
  as a change was merged that adjusted the way they were ran so I
  proposed a revert for that patch
  here: https://review.openstack.org/218244
 
 
 
 Which leads on to https://review.openstack.org/#/c/217379/6.
 
 
 Armando reported the py34 Neutron gate issues a few hours after they
 started,
 and I pushed that fix a few hours after that. Sadly it's taking time to
 get that
 through the gate.

When issues like these arrise, please bring them to the infra team in
#openstack-infra. They can promote fixes that unbreak things.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 08:34 AM, Kevin Benton wrote:
 One of the patches that fixes one of the functional failures that has
 been hitting is here: https://review.openstack.org/#/c/217927/
 
 However, it failed in the DVR job on the 'test_router_rescheduling'
 test.[1] This failure is because the logic to skip when DVR is enabled
 is based on a check that will always return False.[2] I pushed a patch
 to tempest to fix that [3] so once that gets merged we should be able to
 get the one above merged.
 
 For the py34 failures, they seem to have started around the same time as
 a change was merged that adjusted the way they were ran so I proposed a
 revert for that patch here: https://review.openstack.org/218244

That would be indicative of the fact that the tests aren't isolated, and
running them in parallel breaks things because the tests implicitly
depend on both order, and that everything before them actually ran.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Kevin Benton
Why would that only impact py34 and not py27? Aren't the py27 run with
testtools?


On Fri, Aug 28, 2015 at 5:41 AM, Sean Dague s...@dague.net wrote:

 On 08/28/2015 08:34 AM, Kevin Benton wrote:
  One of the patches that fixes one of the functional failures that has
  been hitting is here: https://review.openstack.org/#/c/217927/
 
  However, it failed in the DVR job on the 'test_router_rescheduling'
  test.[1] This failure is because the logic to skip when DVR is enabled
  is based on a check that will always return False.[2] I pushed a patch
  to tempest to fix that [3] so once that gets merged we should be able to
  get the one above merged.
 
  For the py34 failures, they seem to have started around the same time as
  a change was merged that adjusted the way they were ran so I proposed a
  revert for that patch here: https://review.openstack.org/218244

 That would be indicative of the fact that the tests aren't isolated, and
 running them in parallel breaks things because the tests implicitly
 depend on both order, and that everything before them actually ran.

 -Sean

 --
 Sean Dague
 http://dague.net

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 08:50 AM, Kevin Benton wrote:
 Why would that only impact py34 and not py27? Aren't the py27 run with
 testtools?

py34 is only running some subset of tests, so there are a lot of ways
this can go weird.

It may be that the db tests that are failing assume some other tests
which have a db setup thing run before them. In the py27 case there are
enough tests that do that setup that stastically one nearly always runs
before the ones that are problematic.

There are a couple of modes you can run testr in, like --isolated which
will expose tests that are coupled to other tests running before them.
If you generate a local fail you can also --analyze-isolation to figure
out what tests are coupled.

testr also reorders tests to attempt to be faster in aggregate. So run
order is different than it would be in testtools.run case.

In the testtools.run case all the tests are just run in discovery order.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Kevin Benton
One of the patches that fixes one of the functional failures that has been
hitting is here: https://review.openstack.org/#/c/217927/

However, it failed in the DVR job on the 'test_router_rescheduling'
test.[1] This failure is because the logic to skip when DVR is enabled is
based on a check that will always return False.[2] I pushed a patch to
tempest to fix that [3] so once that gets merged we should be able to get
the one above merged.

For the py34 failures, they seem to have started around the same time as a
change was merged that adjusted the way they were ran so I proposed a
revert for that patch here: https://review.openstack.org/218244

1.
http://logs.openstack.org/27/217927/1/check/gate-tempest-dsvm-neutron-dvr/3361f9f/logs/testr_results.html.gz
2.
https://github.com/openstack/tempest/blob/8d827589e6589814e01089eb56b4d109274c781a/tempest/scenario/test_network_basic_ops.py#L662-L663
3. https://review.openstack.org/#/c/218242/

On Fri, Aug 28, 2015 at 4:09 AM, Sean Dague s...@dague.net wrote:

 We're at a 18hr backup in the gate, which is really unusual given the
 amount of decoupling. Even under our current load that means we're
 seeing huge failure rates causing resets.

 It appears one of the major culprits is the python34 tests in neutron,
 which were over a 40% failure rate recently - http://goo.gl/9wCerK

 That tends to lead to things like -
 http://dl.dropbox.com/u/6514884/screenshot_249.png - which means a huge
 amount of work has been reset. Right now 3 of 7 neutron patches in the
 gate that are within the sliding window are in a failure state (they are
 also the only current visible fails in the window).

 Looking at one of the patches in question -
 https://review.openstack.org/#/c/202207/ - shows it's been rechecked 3
 times, and these failures were seen in earlier runs.

 I do understand that people want to get their code merged, but
 rechecking patches that are failing this much without going after the
 root causes means everyone pays for it. This is blocking a lot of other
 projects from landing code in a timely manner.

 The functional tests seem to have a quite high failure rate as well from
 spot checking. If the results of these tests are mostly going to be
 ignored and rechecked, can we remove them from the gate definition on
 neutron so they aren't damaging the overall flow of the gate?

 Thanks,

 -Sean

 --
 Sean Dague
 http://dague.net

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com
wrote:

 On 28/08/15 13:39, Kevin Benton wrote:
  For the py34 failures, they seem to have started around the same time
  as a change was merged that adjusted the way they were ran so I
  proposed a revert for that patch
  here: https://review.openstack.org/218244
 
 

 Which leads on to https://review.openstack.org/#/c/217379/6.


Armando reported the py34 Neutron gate issues a few hours after they
started,
and I pushed that fix a few hours after that. Sadly it's taking time to get
that
through the gate.



 Which is itself failing to merge for various dvsm-functional reasons,
 including failure of test_restart_wsgi_on_sighup_multiple_workers [1].
 There's a bug for that at
 https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show
 any activity for the last few days.

 [1]

 http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz

 Regards,
 Neil



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
To recap, we had three issues impacting the gate queue:

1) The neutron functional job has had a high failure rate for a while now.
Since it's impacting the gate,
I've removed it from the gate queue but kept it in the Neutron check queue:
https://review.openstack.org/#/c/218302/

If you'd like to help, the the list of bugs impacting the Neutron
functional job is linked in that patch.

2) A new Tempest scenario test was added that caused the DVR job failure
rate to sky rocket to over 50%.
It actually highlighted a legit bug with DVR and legacy routers. Kevin
proposed a patch that skips that test
entirely until we can resolve the bug in Neutron:
https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
test conditionally, the next PS will skip the test entirely).

3) The Neutron py34 job has been made unstable due to a recent change (By
me, yay) that made the tests
run with multiple workers. This highlighted an issue with the Neutron unit
testing infrastructure, which is fixed here:
https://review.openstack.org/#/c/217379/

With all three patches merged we should be good to go.

On Fri, Aug 28, 2015 at 9:37 AM, Sean Dague s...@dague.net wrote:

 On 08/28/2015 09:22 AM, Assaf Muller wrote:
 
 
  On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram neil.jer...@metaswitch.com
  mailto:neil.jer...@metaswitch.com wrote:
 
  On 28/08/15 13:39, Kevin Benton wrote:
   For the py34 failures, they seem to have started around the same
 time
   as a change was merged that adjusted the way they were ran so I
   proposed a revert for that patch
   here: https://review.openstack.org/218244
  
  
 
  Which leads on to https://review.openstack.org/#/c/217379/6.
 
 
  Armando reported the py34 Neutron gate issues a few hours after they
  started,
  and I pushed that fix a few hours after that. Sadly it's taking time to
  get that
  through the gate.

 When issues like these arrise, please bring them to the infra team in
 #openstack-infra. They can promote fixes that unbreak things.

 -Sean

 --
 Sean Dague
 http://dague.net

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 11:20 AM, Assaf Muller wrote:
 To recap, we had three issues impacting the gate queue:
 
 1) The neutron functional job has had a high failure rate for a while
 now. Since it's impacting the gate,
 I've removed it from the gate queue but kept it in the Neutron check queue:
 https://review.openstack.org/#/c/218302/
 
 If you'd like to help, the the list of bugs impacting the Neutron
 functional job is linked in that patch.
 
 2) A new Tempest scenario test was added that caused the DVR job failure
 rate to sky rocket to over 50%.
 It actually highlighted a legit bug with DVR and legacy routers. Kevin
 proposed a patch that skips that test
 entirely until we can resolve the bug in Neutron:
 https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
 test conditionally, the next PS will skip the test entirely).
 
 3) The Neutron py34 job has been made unstable due to a recent change
 (By me, yay) that made the tests
 run with multiple workers. This highlighted an issue with the Neutron
 unit testing infrastructure, which is fixed here:
 https://review.openstack.org/#/c/217379/
 
 With all three patches merged we should be good to go.

Well, with all 3 of these we should be much better for sure. There are
probably additional issues causing intermittent failures which should be
looked at. These 3 are definitely masking anything else.

https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
patches to promote for things causing races in the gate (we've got a
cinder one was well). If other issues are known with fixes posted,
please feel free to add them with comments.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Salvatore Orlando
On 28 August 2015 at 16:57, Sean Dague s...@dague.net wrote:

 On 08/28/2015 11:20 AM, Assaf Muller wrote:
  To recap, we had three issues impacting the gate queue:
 
  1) The neutron functional job has had a high failure rate for a while
  now. Since it's impacting the gate,
  I've removed it from the gate queue but kept it in the Neutron check
 queue:
  https://review.openstack.org/#/c/218302/
 
  If you'd like to help, the the list of bugs impacting the Neutron
  functional job is linked in that patch.
 
  2) A new Tempest scenario test was added that caused the DVR job failure
  rate to sky rocket to over 50%.
  It actually highlighted a legit bug with DVR and legacy routers. Kevin
  proposed a patch that skips that test
  entirely until we can resolve the bug in Neutron:
  https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
  test conditionally, the next PS will skip the test entirely).
 
  3) The Neutron py34 job has been made unstable due to a recent change
  (By me, yay) that made the tests
  run with multiple workers. This highlighted an issue with the Neutron
  unit testing infrastructure, which is fixed here:
  https://review.openstack.org/#/c/217379/
 
  With all three patches merged we should be good to go.

 Well, with all 3 of these we should be much better for sure. There are
 probably additional issues causing intermittent failures which should be
 looked at. These 3 are definitely masking anything else.


Sadly, since the issues are independent, it is very likely for one of the
patch to fail jenkins tests for one of the other two issues.
If the situation persists is it crazy to conside switching neutron-py34 and
neutron-functional to non-voting until these patches merge.
Neutron cores might abstain from approving patches (unless trivial or
documentation) while these jobs are non-voting.



 https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
 patches to promote for things causing races in the gate (we've got a
 cinder one was well). If other issues are known with fixes posted,
 please feel free to add them with comments.





 -Sean

 --
 Sean Dague
 http://dague.net

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
On Fri, Aug 28, 2015 at 1:50 PM, Salvatore Orlando salv.orla...@gmail.com
wrote:



 On 28 August 2015 at 16:57, Sean Dague s...@dague.net wrote:

 On 08/28/2015 11:20 AM, Assaf Muller wrote:
  To recap, we had three issues impacting the gate queue:
 
  1) The neutron functional job has had a high failure rate for a while
  now. Since it's impacting the gate,
  I've removed it from the gate queue but kept it in the Neutron check
 queue:
  https://review.openstack.org/#/c/218302/
 
  If you'd like to help, the the list of bugs impacting the Neutron
  functional job is linked in that patch.
 
  2) A new Tempest scenario test was added that caused the DVR job failure
  rate to sky rocket to over 50%.
  It actually highlighted a legit bug with DVR and legacy routers. Kevin
  proposed a patch that skips that test
  entirely until we can resolve the bug in Neutron:
  https://review.openstack.org/#/c/218242/ (Currently it tries to skip
 the
  test conditionally, the next PS will skip the test entirely).
 
  3) The Neutron py34 job has been made unstable due to a recent change
  (By me, yay) that made the tests
  run with multiple workers. This highlighted an issue with the Neutron
  unit testing infrastructure, which is fixed here:
  https://review.openstack.org/#/c/217379/
 
  With all three patches merged we should be good to go.

 Well, with all 3 of these we should be much better for sure. There are
 probably additional issues causing intermittent failures which should be
 looked at. These 3 are definitely masking anything else.


 Sadly, since the issues are independent, it is very likely for one of the
 patch to fail jenkins tests for one of the other two issues.
 If the situation persists is it crazy to conside switching neutron-py34
 and neutron-functional to non-voting until these patches merge.
 Neutron cores might abstain from approving patches (unless trivial or
 documentation) while these jobs are non-voting.


We have two of the three merged. The Neutron functional tests are no longer
part of the gate queue, only the check queue,
and the Tempest router_reschedule test will no longer fail as part of the
DVR job. This means that the py34 patch now has
a better chance of going in.





 https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
 patches to promote for things causing races in the gate (we've got a
 cinder one was well). If other issues are known with fixes posted,
 please feel free to add them with comments.





 -Sean

 --
 Sean Dague
 http://dague.net

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev