[openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Davanum Srinivas
Hi,

Failures for this job has been trending up and is causing the large
gate queue as well. I've logged a bug:
https://bugs.launchpad.net/openstack-gate/+bug/1536622

and am requesting switching the voting to off for this job:
https://review.openstack.org/#/c/270788/

We need to find and fix the underlying issue which can help us
determine when to switch this back on to voting or we cleanup this job
from all the gate queues and move them to check queues (i have a TODO
for this in this review)

Thanks,
Dims

-- 
Davanum Srinivas :: https://twitter.com/dims

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Sean Dague
On 01/21/2016 08:18 AM, Davanum Srinivas wrote:
> Hi,
> 
> Failures for this job has been trending up and is causing the large
> gate queue as well. I've logged a bug:
> https://bugs.launchpad.net/openstack-gate/+bug/1536622
> 
> and am requesting switching the voting to off for this job:
> https://review.openstack.org/#/c/270788/
> 
> We need to find and fix the underlying issue which can help us
> determine when to switch this back on to voting or we cleanup this job
> from all the gate queues and move them to check queues (i have a TODO
> for this in this review)

By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8

All the spot checking of jobs I've found is the job dying on the liberty
side validation with test_volume_boot_pattern, which means we've never
even gotten to the any of the real grenade logic.

+2 on non-voting.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Matt Riedemann



On 1/21/2016 7:33 AM, Sean Dague wrote:

On 01/21/2016 08:18 AM, Davanum Srinivas wrote:

Hi,

Failures for this job has been trending up and is causing the large
gate queue as well. I've logged a bug:
https://bugs.launchpad.net/openstack-gate/+bug/1536622

and am requesting switching the voting to off for this job:
https://review.openstack.org/#/c/270788/

We need to find and fix the underlying issue which can help us
determine when to switch this back on to voting or we cleanup this job
from all the gate queues and move them to check queues (i have a TODO
for this in this review)


By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8

All the spot checking of jobs I've found is the job dying on the liberty
side validation with test_volume_boot_pattern, which means we've never
even gotten to the any of the real grenade logic.

+2 on non-voting.

-Sean



clarkb was looking into this yesterday, see the IRC logs starting here:

http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24

--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Matt Riedemann



On 1/21/2016 7:33 AM, Sean Dague wrote:

On 01/21/2016 08:18 AM, Davanum Srinivas wrote:

Hi,

Failures for this job has been trending up and is causing the large
gate queue as well. I've logged a bug:
https://bugs.launchpad.net/openstack-gate/+bug/1536622

and am requesting switching the voting to off for this job:
https://review.openstack.org/#/c/270788/

We need to find and fix the underlying issue which can help us
determine when to switch this back on to voting or we cleanup this job
from all the gate queues and move them to check queues (i have a TODO
for this in this review)


By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8

All the spot checking of jobs I've found is the job dying on the liberty
side validation with test_volume_boot_pattern, which means we've never
even gotten to the any of the real grenade logic.

+2 on non-voting.

-Sean



Potential fix here:

https://review.openstack.org/#/c/270857/

--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Matthew Treinish
On Thu, Jan 21, 2016 at 08:18:14AM -0500, Davanum Srinivas wrote:
> Hi,
> 
> Failures for this job has been trending up and is causing the large
> gate queue as well. I've logged a bug:
> https://bugs.launchpad.net/openstack-gate/+bug/1536622
> 
> and am requesting switching the voting to off for this job:
> https://review.openstack.org/#/c/270788/

I think this was premature, we were actually looking at the problem last night. 
If
you look at:

http://status.openstack.org/openstack-health/#/g/node_provider/internap-nyj01

and

http://status.openstack.org/openstack-health/#/g/node_provider/bluebox-sjc1

grenade-multinode is 100% failure on both providers. The working hypothesis is
that it's because tempest is trying to login to the guest over the "private"
network which isn't setup as accessible outside. You can see the discussion on
this starting here:

http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24

> 
> We need to find and fix the underlying issue which can help us
> determine when to switch this back on to voting or we cleanup this job
> from all the gate queues and move them to check queues (i have a TODO
> for this in this review)

TBH, there is always this push to remove jobs or testing whenever there is
release pressure and a gate backup. No one seems to notice whenever anything 
isn't
working and recheck grinds patches through. (well maybe not you Dims, because
you're more on top of it then almost everyone) I know that I get complacent when
there isn't a gate backup. The problem is when things like our categorization 
rate
on:

http://status.openstack.org/elastic-recheck/data/uncategorized.html

routinely has been at or below 50% this cycle it's not really a surprise we have
gate backups like this. More people need to be actively debugging these problems
as they come up, it can't just be the same handful of us. I don't think making
things non-voting is the trend we want to set because then what's gonna be the
motivation to get others to help on this.

-Matt Treinish


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures

2016-01-21 Thread Sean Dague
On 01/21/2016 11:00 AM, Matthew Treinish wrote:
> On Thu, Jan 21, 2016 at 08:18:14AM -0500, Davanum Srinivas wrote:
>> Hi,
>>
>> Failures for this job has been trending up and is causing the large
>> gate queue as well. I've logged a bug:
>> https://bugs.launchpad.net/openstack-gate/+bug/1536622
>>
>> and am requesting switching the voting to off for this job:
>> https://review.openstack.org/#/c/270788/
> 
> I think this was premature, we were actually looking at the problem last 
> night. If
> you look at:
> 
> http://status.openstack.org/openstack-health/#/g/node_provider/internap-nyj01
> 
> and
> 
> http://status.openstack.org/openstack-health/#/g/node_provider/bluebox-sjc1
> 
> grenade-multinode is 100% failure on both providers. The working hypothesis is
> that it's because tempest is trying to login to the guest over the "private"
> network which isn't setup as accessible outside. You can see the discussion on
> this starting here:
> 
> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24
> 
>>
>> We need to find and fix the underlying issue which can help us
>> determine when to switch this back on to voting or we cleanup this job
>> from all the gate queues and move them to check queues (i have a TODO
>> for this in this review)
> 
> TBH, there is always this push to remove jobs or testing whenever there is
> release pressure and a gate backup. No one seems to notice whenever anything 
> isn't
> working and recheck grinds patches through. (well maybe not you Dims, because
> you're more on top of it then almost everyone) I know that I get complacent 
> when
> there isn't a gate backup. The problem is when things like our categorization 
> rate
> on:
> 
> http://status.openstack.org/elastic-recheck/data/uncategorized.html
> 
> routinely has been at or below 50% this cycle it's not really a surprise we 
> have
> gate backups like this. More people need to be actively debugging these 
> problems
> as they come up, it can't just be the same handful of us. I don't think making
> things non-voting is the trend we want to set because then what's gonna be the
> motivation to get others to help on this.

Deciding to stop everyone else's work while a key infrastructure / test
setup bug is being sorted isn't really an option.

It's an OpenStack global lock on all productivity.

Making jobs non voting means that it's a local lock instead of a global
one. That *has* to be the model for fixing things like this. We need to
get some agreement on that fact, otherwise there will never be more
volunteers to help fix things. Not everyone in the community can drop
all the work and context they have for solving hard problems because a
new cloud was added / upgraded / acts differently.

When your bus lights on fire you don't just keep driving with the bus
full of passengers. You pull over, let them get off, and deal with the
fire separately from the passengers.

If there is in flight work, by a set of people that are all going to
bed, handing that off with an email needs to happen. Especially if we
are expecting them to not just start over from scratch.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev