[openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
Hi, Failures for this job has been trending up and is causing the large gate queue as well. I've logged a bug: https://bugs.launchpad.net/openstack-gate/+bug/1536622 and am requesting switching the voting to off for this job: https://review.openstack.org/#/c/270788/ We need to find and fix the underlying issue which can help us determine when to switch this back on to voting or we cleanup this job from all the gate queues and move them to check queues (i have a TODO for this in this review) Thanks, Dims -- Davanum Srinivas :: https://twitter.com/dims __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
On 01/21/2016 08:18 AM, Davanum Srinivas wrote: > Hi, > > Failures for this job has been trending up and is causing the large > gate queue as well. I've logged a bug: > https://bugs.launchpad.net/openstack-gate/+bug/1536622 > > and am requesting switching the voting to off for this job: > https://review.openstack.org/#/c/270788/ > > We need to find and fix the underlying issue which can help us > determine when to switch this back on to voting or we cleanup this job > from all the gate queues and move them to check queues (i have a TODO > for this in this review) By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8 All the spot checking of jobs I've found is the job dying on the liberty side validation with test_volume_boot_pattern, which means we've never even gotten to the any of the real grenade logic. +2 on non-voting. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
On 1/21/2016 7:33 AM, Sean Dague wrote: On 01/21/2016 08:18 AM, Davanum Srinivas wrote: Hi, Failures for this job has been trending up and is causing the large gate queue as well. I've logged a bug: https://bugs.launchpad.net/openstack-gate/+bug/1536622 and am requesting switching the voting to off for this job: https://review.openstack.org/#/c/270788/ We need to find and fix the underlying issue which can help us determine when to switch this back on to voting or we cleanup this job from all the gate queues and move them to check queues (i have a TODO for this in this review) By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8 All the spot checking of jobs I've found is the job dying on the liberty side validation with test_volume_boot_pattern, which means we've never even gotten to the any of the real grenade logic. +2 on non-voting. -Sean clarkb was looking into this yesterday, see the IRC logs starting here: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24 -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
On 1/21/2016 7:33 AM, Sean Dague wrote: On 01/21/2016 08:18 AM, Davanum Srinivas wrote: Hi, Failures for this job has been trending up and is causing the large gate queue as well. I've logged a bug: https://bugs.launchpad.net/openstack-gate/+bug/1536622 and am requesting switching the voting to off for this job: https://review.openstack.org/#/c/270788/ We need to find and fix the underlying issue which can help us determine when to switch this back on to voting or we cleanup this job from all the gate queues and move them to check queues (i have a TODO for this in this review) By trending up we mean above 75% failure rate - http://tinyurl.com/zrq35e8 All the spot checking of jobs I've found is the job dying on the liberty side validation with test_volume_boot_pattern, which means we've never even gotten to the any of the real grenade logic. +2 on non-voting. -Sean Potential fix here: https://review.openstack.org/#/c/270857/ -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
On Thu, Jan 21, 2016 at 08:18:14AM -0500, Davanum Srinivas wrote: > Hi, > > Failures for this job has been trending up and is causing the large > gate queue as well. I've logged a bug: > https://bugs.launchpad.net/openstack-gate/+bug/1536622 > > and am requesting switching the voting to off for this job: > https://review.openstack.org/#/c/270788/ I think this was premature, we were actually looking at the problem last night. If you look at: http://status.openstack.org/openstack-health/#/g/node_provider/internap-nyj01 and http://status.openstack.org/openstack-health/#/g/node_provider/bluebox-sjc1 grenade-multinode is 100% failure on both providers. The working hypothesis is that it's because tempest is trying to login to the guest over the "private" network which isn't setup as accessible outside. You can see the discussion on this starting here: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24 > > We need to find and fix the underlying issue which can help us > determine when to switch this back on to voting or we cleanup this job > from all the gate queues and move them to check queues (i have a TODO > for this in this review) TBH, there is always this push to remove jobs or testing whenever there is release pressure and a gate backup. No one seems to notice whenever anything isn't working and recheck grinds patches through. (well maybe not you Dims, because you're more on top of it then almost everyone) I know that I get complacent when there isn't a gate backup. The problem is when things like our categorization rate on: http://status.openstack.org/elastic-recheck/data/uncategorized.html routinely has been at or below 50% this cycle it's not really a surprise we have gate backups like this. More people need to be actively debugging these problems as they come up, it can't just be the same handful of us. I don't think making things non-voting is the trend we want to set because then what's gonna be the motivation to get others to help on this. -Matt Treinish signature.asc Description: PGP signature __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
On 01/21/2016 11:00 AM, Matthew Treinish wrote: > On Thu, Jan 21, 2016 at 08:18:14AM -0500, Davanum Srinivas wrote: >> Hi, >> >> Failures for this job has been trending up and is causing the large >> gate queue as well. I've logged a bug: >> https://bugs.launchpad.net/openstack-gate/+bug/1536622 >> >> and am requesting switching the voting to off for this job: >> https://review.openstack.org/#/c/270788/ > > I think this was premature, we were actually looking at the problem last > night. If > you look at: > > http://status.openstack.org/openstack-health/#/g/node_provider/internap-nyj01 > > and > > http://status.openstack.org/openstack-health/#/g/node_provider/bluebox-sjc1 > > grenade-multinode is 100% failure on both providers. The working hypothesis is > that it's because tempest is trying to login to the guest over the "private" > network which isn't setup as accessible outside. You can see the discussion on > this starting here: > > http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-20.log.html#t2016-01-20T22:44:24 > >> >> We need to find and fix the underlying issue which can help us >> determine when to switch this back on to voting or we cleanup this job >> from all the gate queues and move them to check queues (i have a TODO >> for this in this review) > > TBH, there is always this push to remove jobs or testing whenever there is > release pressure and a gate backup. No one seems to notice whenever anything > isn't > working and recheck grinds patches through. (well maybe not you Dims, because > you're more on top of it then almost everyone) I know that I get complacent > when > there isn't a gate backup. The problem is when things like our categorization > rate > on: > > http://status.openstack.org/elastic-recheck/data/uncategorized.html > > routinely has been at or below 50% this cycle it's not really a surprise we > have > gate backups like this. More people need to be actively debugging these > problems > as they come up, it can't just be the same handful of us. I don't think making > things non-voting is the trend we want to set because then what's gonna be the > motivation to get others to help on this. Deciding to stop everyone else's work while a key infrastructure / test setup bug is being sorted isn't really an option. It's an OpenStack global lock on all productivity. Making jobs non voting means that it's a local lock instead of a global one. That *has* to be the model for fixing things like this. We need to get some agreement on that fact, otherwise there will never be more volunteers to help fix things. Not everyone in the community can drop all the work and context they have for solving hard problems because a new cloud was added / upgraded / acts differently. When your bus lights on fire you don't just keep driving with the bus full of passengers. You pull over, let them get off, and deal with the fire separately from the passengers. If there is in flight work, by a set of people that are all going to bed, handing that off with an email needs to happen. Especially if we are expecting them to not just start over from scratch. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev