Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
On Fri, Aug 28, 2015 at 1:50 PM, Salvatore Orlando 
wrote:

>
>
> On 28 August 2015 at 16:57, Sean Dague  wrote:
>
>> On 08/28/2015 11:20 AM, Assaf Muller wrote:
>> > To recap, we had three issues impacting the gate queue:
>> >
>> > 1) The neutron functional job has had a high failure rate for a while
>> > now. Since it's impacting the gate,
>> > I've removed it from the gate queue but kept it in the Neutron check
>> queue:
>> > https://review.openstack.org/#/c/218302/
>> >
>> > If you'd like to help, the the list of bugs impacting the Neutron
>> > functional job is linked in that patch.
>> >
>> > 2) A new Tempest scenario test was added that caused the DVR job failure
>> > rate to sky rocket to over 50%.
>> > It actually highlighted a legit bug with DVR and legacy routers. Kevin
>> > proposed a patch that skips that test
>> > entirely until we can resolve the bug in Neutron:
>> > https://review.openstack.org/#/c/218242/ (Currently it tries to skip
>> the
>> > test conditionally, the next PS will skip the test entirely).
>> >
>> > 3) The Neutron py34 job has been made unstable due to a recent change
>> > (By me, yay) that made the tests
>> > run with multiple workers. This highlighted an issue with the Neutron
>> > unit testing infrastructure, which is fixed here:
>> > https://review.openstack.org/#/c/217379/
>> >
>> > With all three patches merged we should be good to go.
>>
>> Well, with all 3 of these we should be much better for sure. There are
>> probably additional issues causing intermittent failures which should be
>> looked at. These 3 are definitely masking anything else.
>>
>
> Sadly, since the issues are independent, it is very likely for one of the
> patch to fail jenkins tests for one of the other two issues.
> If the situation persists is it crazy to conside switching neutron-py34
> and neutron-functional to non-voting until these patches merge.
> Neutron cores might abstain from approving patches (unless trivial or
> documentation) while these jobs are non-voting.
>

We have two of the three merged. The Neutron functional tests are no longer
part of the gate queue, only the check queue,
and the Tempest router_reschedule test will no longer fail as part of the
DVR job. This means that the py34 patch now has
a better chance of going in.


>
>
>>
>> https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
>> patches to promote for things causing races in the gate (we've got a
>> cinder one was well). If other issues are known with fixes posted,
>> please feel free to add them with comments.
>>
>
>
>
>>
>> -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Salvatore Orlando
On 28 August 2015 at 16:57, Sean Dague  wrote:

> On 08/28/2015 11:20 AM, Assaf Muller wrote:
> > To recap, we had three issues impacting the gate queue:
> >
> > 1) The neutron functional job has had a high failure rate for a while
> > now. Since it's impacting the gate,
> > I've removed it from the gate queue but kept it in the Neutron check
> queue:
> > https://review.openstack.org/#/c/218302/
> >
> > If you'd like to help, the the list of bugs impacting the Neutron
> > functional job is linked in that patch.
> >
> > 2) A new Tempest scenario test was added that caused the DVR job failure
> > rate to sky rocket to over 50%.
> > It actually highlighted a legit bug with DVR and legacy routers. Kevin
> > proposed a patch that skips that test
> > entirely until we can resolve the bug in Neutron:
> > https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
> > test conditionally, the next PS will skip the test entirely).
> >
> > 3) The Neutron py34 job has been made unstable due to a recent change
> > (By me, yay) that made the tests
> > run with multiple workers. This highlighted an issue with the Neutron
> > unit testing infrastructure, which is fixed here:
> > https://review.openstack.org/#/c/217379/
> >
> > With all three patches merged we should be good to go.
>
> Well, with all 3 of these we should be much better for sure. There are
> probably additional issues causing intermittent failures which should be
> looked at. These 3 are definitely masking anything else.
>

Sadly, since the issues are independent, it is very likely for one of the
patch to fail jenkins tests for one of the other two issues.
If the situation persists is it crazy to conside switching neutron-py34 and
neutron-functional to non-voting until these patches merge.
Neutron cores might abstain from approving patches (unless trivial or
documentation) while these jobs are non-voting.


>
> https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
> patches to promote for things causing races in the gate (we've got a
> cinder one was well). If other issues are known with fixes posted,
> please feel free to add them with comments.
>



>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 11:20 AM, Assaf Muller wrote:
> To recap, we had three issues impacting the gate queue:
> 
> 1) The neutron functional job has had a high failure rate for a while
> now. Since it's impacting the gate,
> I've removed it from the gate queue but kept it in the Neutron check queue:
> https://review.openstack.org/#/c/218302/
> 
> If you'd like to help, the the list of bugs impacting the Neutron
> functional job is linked in that patch.
> 
> 2) A new Tempest scenario test was added that caused the DVR job failure
> rate to sky rocket to over 50%.
> It actually highlighted a legit bug with DVR and legacy routers. Kevin
> proposed a patch that skips that test
> entirely until we can resolve the bug in Neutron:
> https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
> test conditionally, the next PS will skip the test entirely).
> 
> 3) The Neutron py34 job has been made unstable due to a recent change
> (By me, yay) that made the tests
> run with multiple workers. This highlighted an issue with the Neutron
> unit testing infrastructure, which is fixed here:
> https://review.openstack.org/#/c/217379/
> 
> With all three patches merged we should be good to go.

Well, with all 3 of these we should be much better for sure. There are
probably additional issues causing intermittent failures which should be
looked at. These 3 are definitely masking anything else.

https://etherpad.openstack.org/p/gate-fire-2015-08-28 is a set of
patches to promote for things causing races in the gate (we've got a
cinder one was well). If other issues are known with fixes posted,
please feel free to add them with comments.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
To recap, we had three issues impacting the gate queue:

1) The neutron functional job has had a high failure rate for a while now.
Since it's impacting the gate,
I've removed it from the gate queue but kept it in the Neutron check queue:
https://review.openstack.org/#/c/218302/

If you'd like to help, the the list of bugs impacting the Neutron
functional job is linked in that patch.

2) A new Tempest scenario test was added that caused the DVR job failure
rate to sky rocket to over 50%.
It actually highlighted a legit bug with DVR and legacy routers. Kevin
proposed a patch that skips that test
entirely until we can resolve the bug in Neutron:
https://review.openstack.org/#/c/218242/ (Currently it tries to skip the
test conditionally, the next PS will skip the test entirely).

3) The Neutron py34 job has been made unstable due to a recent change (By
me, yay) that made the tests
run with multiple workers. This highlighted an issue with the Neutron unit
testing infrastructure, which is fixed here:
https://review.openstack.org/#/c/217379/

With all three patches merged we should be good to go.

On Fri, Aug 28, 2015 at 9:37 AM, Sean Dague  wrote:

> On 08/28/2015 09:22 AM, Assaf Muller wrote:
> >
> >
> > On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram  > > wrote:
> >
> > On 28/08/15 13:39, Kevin Benton wrote:
> > > For the py34 failures, they seem to have started around the same
> time
> > > as a change was merged that adjusted the way they were ran so I
> > > proposed a revert for that patch
> > > here: https://review.openstack.org/218244
> > >
> > >
> >
> > Which leads on to https://review.openstack.org/#/c/217379/6.
> >
> >
> > Armando reported the py34 Neutron gate issues a few hours after they
> > started,
> > and I pushed that fix a few hours after that. Sadly it's taking time to
> > get that
> > through the gate.
>
> When issues like these arrise, please bring them to the infra team in
> #openstack-infra. They can promote fixes that unbreak things.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 08:50 AM, Kevin Benton wrote:
> Why would that only impact py34 and not py27? Aren't the py27 run with
> testtools?

py34 is only running some subset of tests, so there are a lot of ways
this can go weird.

It may be that the db tests that are failing assume some other tests
which have a db setup thing run before them. In the py27 case there are
enough tests that do that setup that stastically one nearly always runs
before the ones that are problematic.

There are a couple of modes you can run testr in, like --isolated which
will expose tests that are coupled to other tests running before them.
If you generate a local fail you can also --analyze-isolation to figure
out what tests are coupled.

testr also reorders tests to attempt to be faster in aggregate. So run
order is different than it would be in testtools.run case.

In the testtools.run case all the tests are just run in discovery order.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 09:22 AM, Assaf Muller wrote:
> 
> 
> On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram  > wrote:
> 
> On 28/08/15 13:39, Kevin Benton wrote:
> > For the py34 failures, they seem to have started around the same time
> > as a change was merged that adjusted the way they were ran so I
> > proposed a revert for that patch
> > here: https://review.openstack.org/218244
> >
> >
> 
> Which leads on to https://review.openstack.org/#/c/217379/6.
> 
> 
> Armando reported the py34 Neutron gate issues a few hours after they
> started,
> and I pushed that fix a few hours after that. Sadly it's taking time to
> get that
> through the gate.

When issues like these arrise, please bring them to the infra team in
#openstack-infra. They can promote fixes that unbreak things.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Assaf Muller
On Fri, Aug 28, 2015 at 9:12 AM, Neil Jerram 
wrote:

> On 28/08/15 13:39, Kevin Benton wrote:
> > For the py34 failures, they seem to have started around the same time
> > as a change was merged that adjusted the way they were ran so I
> > proposed a revert for that patch
> > here: https://review.openstack.org/218244
> >
> >
>
> Which leads on to https://review.openstack.org/#/c/217379/6.
>

Armando reported the py34 Neutron gate issues a few hours after they
started,
and I pushed that fix a few hours after that. Sadly it's taking time to get
that
through the gate.


>
> Which is itself failing to merge for various dvsm-functional reasons,
> including failure of test_restart_wsgi_on_sighup_multiple_workers [1].
> There's a bug for that at
> https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show
> any activity for the last few days.
>
> [1]
>
> http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz
>
> Regards,
> Neil
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Neil Jerram
On 28/08/15 13:39, Kevin Benton wrote:
> For the py34 failures, they seem to have started around the same time
> as a change was merged that adjusted the way they were ran so I
> proposed a revert for that patch
> here: https://review.openstack.org/218244
>
>

Which leads on to https://review.openstack.org/#/c/217379/6.

Which is itself failing to merge for various dvsm-functional reasons,
including failure of test_restart_wsgi_on_sighup_multiple_workers [1]. 
There's a bug for that at
https://bugs.launchpad.net/neutron/+bug/1478190, but that doesn't show
any activity for the last few days.

[1]
http://logs.openstack.org/79/217379/6/gate/gate-neutron-dsvm-functional/2991b11/testr_results.html.gz

Regards,
Neil



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Kevin Benton
Why would that only impact py34 and not py27? Aren't the py27 run with
testtools?


On Fri, Aug 28, 2015 at 5:41 AM, Sean Dague  wrote:

> On 08/28/2015 08:34 AM, Kevin Benton wrote:
> > One of the patches that fixes one of the functional failures that has
> > been hitting is here: https://review.openstack.org/#/c/217927/
> >
> > However, it failed in the DVR job on the 'test_router_rescheduling'
> > test.[1] This failure is because the logic to skip when DVR is enabled
> > is based on a check that will always return False.[2] I pushed a patch
> > to tempest to fix that [3] so once that gets merged we should be able to
> > get the one above merged.
> >
> > For the py34 failures, they seem to have started around the same time as
> > a change was merged that adjusted the way they were ran so I proposed a
> > revert for that patch here: https://review.openstack.org/218244
>
> That would be indicative of the fact that the tests aren't isolated, and
> running them in parallel breaks things because the tests implicitly
> depend on both order, and that everything before them actually ran.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Sean Dague
On 08/28/2015 08:34 AM, Kevin Benton wrote:
> One of the patches that fixes one of the functional failures that has
> been hitting is here: https://review.openstack.org/#/c/217927/
> 
> However, it failed in the DVR job on the 'test_router_rescheduling'
> test.[1] This failure is because the logic to skip when DVR is enabled
> is based on a check that will always return False.[2] I pushed a patch
> to tempest to fix that [3] so once that gets merged we should be able to
> get the one above merged.
> 
> For the py34 failures, they seem to have started around the same time as
> a change was merged that adjusted the way they were ran so I proposed a
> revert for that patch here: https://review.openstack.org/218244

That would be indicative of the fact that the tests aren't isolated, and
running them in parallel breaks things because the tests implicitly
depend on both order, and that everything before them actually ran.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] 40% failure on neutron python3.4 tests in the gate

2015-08-28 Thread Kevin Benton
One of the patches that fixes one of the functional failures that has been
hitting is here: https://review.openstack.org/#/c/217927/

However, it failed in the DVR job on the 'test_router_rescheduling'
test.[1] This failure is because the logic to skip when DVR is enabled is
based on a check that will always return False.[2] I pushed a patch to
tempest to fix that [3] so once that gets merged we should be able to get
the one above merged.

For the py34 failures, they seem to have started around the same time as a
change was merged that adjusted the way they were ran so I proposed a
revert for that patch here: https://review.openstack.org/218244

1.
http://logs.openstack.org/27/217927/1/check/gate-tempest-dsvm-neutron-dvr/3361f9f/logs/testr_results.html.gz
2.
https://github.com/openstack/tempest/blob/8d827589e6589814e01089eb56b4d109274c781a/tempest/scenario/test_network_basic_ops.py#L662-L663
3. https://review.openstack.org/#/c/218242/

On Fri, Aug 28, 2015 at 4:09 AM, Sean Dague  wrote:

> We're at a 18hr backup in the gate, which is really unusual given the
> amount of decoupling. Even under our current load that means we're
> seeing huge failure rates causing resets.
>
> It appears one of the major culprits is the python34 tests in neutron,
> which were over a 40% failure rate recently - http://goo.gl/9wCerK
>
> That tends to lead to things like -
> http://dl.dropbox.com/u/6514884/screenshot_249.png - which means a huge
> amount of work has been reset. Right now 3 of 7 neutron patches in the
> gate that are within the sliding window are in a failure state (they are
> also the only current visible fails in the window).
>
> Looking at one of the patches in question -
> https://review.openstack.org/#/c/202207/ - shows it's been rechecked 3
> times, and these failures were seen in earlier runs.
>
> I do understand that people want to get their code merged, but
> rechecking patches that are failing this much without going after the
> root causes means everyone pays for it. This is blocking a lot of other
> projects from landing code in a timely manner.
>
> The functional tests seem to have a quite high failure rate as well from
> spot checking. If the results of these tests are mostly going to be
> ignored and rechecked, can we remove them from the gate definition on
> neutron so they aren't damaging the overall flow of the gate?
>
> Thanks,
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev