Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-09-03 Thread Sean Dague

On 08/17/2013 12:26 AM, Clint Byrum wrote:

Excerpts from Maru Newby's message of 2013-08-16 16:42:23 -0700:


On Aug 16, 2013, at 11:44 AM, Clint Byrum  wrote:


Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:

Neutron has been in and out of the gate for the better part of the past month, 
and it didn't slow the pace of development one bit.  Most Neutron developers 
kept on working as if nothing was wrong, blithely merging changes with no 
guarantees that they weren't introducing new breakage.  New bugs were indeed 
merged, greatly increasing the time and effort required to get Neutron back in 
the gate.  I don't think this is sustainable, and I'd like to make a suggestion 
for how to minimize the impact of gate breakage.

For the record, I don't think consistent gate breakage in one project should be 
allowed to hold up the development of other projects.  The current approach of 
skipping tests or otherwise making a given job non-voting for innocent projects 
should continue.  It is arguably worth taking the risk of relaxing gating for 
those innocent projects rather than halting development unnecessarily.

However, I don't think it is a good idea to relax a broken gate for the 
offending project.  So if a broken job/test is clearly Neutron related, it 
should continue to gate Neutron, effectively preventing merges until the 
problem is fixed.  This would both raise the visibility of breakage beyond the 
person responsible for fixing it, and prevent additional breakage from slipping 
past were the gating to be relaxed.

Thoughts?



I think this is a cultural problem related to the code review discussing
from earlier in the week.

We are not looking at finding a defect and reverting as a good thing where
high fives should be shared all around. Instead, "you broke the gate"
seems to mean "you are a bad developer". I have been a bad actor here too,
getting frustrated with the gate-breaker and saying the wrong thing.

The problem really is "you _broke_ the gate". It should be "the gate has
found a defect, hooray!". It doesn't matter what causes the gate to stop,
it is _always_ a defect. Now, it is possible the defect is in tempest,
or jenkins, or HP/Rackspace's clouds where the tests run. But it is
always a defect that what worked before, does not work now.

Defects are to be expected. None of us can write perfect code. We should
be happy to revert commits and go forward with an enabled gate while
the team responsible for the commit gathers information and works to
correct the issue.


You're preaching to the choir, and I suspect that anyone with an interest in 
software quality is likely to prefer problem solving to finger pointing.  
However, my intent with this thread was not to promote more constructive 
thinking about defect detection.  Rather, I was hoping to communicate a flaw in 
the existing process and seek consensus on how that process could best be 
modified to minimize the cost of resolving gate breakage.



I believe that the process is a symptom of the culture. If we were
more eager to revert/discover/fix/re-submit on failure, we wouldn't
be turning off the gate for things. Instead we cling to whatever has
had the requisite "+2/approval" as if passing the stringent review has
imparted our code with magical powers which will eventually morph into
a passing gate.

In a perfect world we could make our CI infrastructure bisect the failures
to try and isolate the commits that did them so at least anybody can see
the commit that did the damage and revert it quickly. Realistically, most
of the time we remove from the gate because the failures are intermittent
and take _forever_ to discover, so that may not even be possible.

I am suggesting that we all change our perspective and embrace "revert
this immediately" as "thank you for finding that defect" not "you jerk
why did you revert my code". It may still be hard to find which commit
to revert, but at least one can spend that time with the idea that they
will be rewarded, rather than punished, for their efforts.


Late on the thread (was out), but an important clarification here is to 
realize that most gate breaks aren't 100% fails, they are 5% or 2% or 1% 
(or less) fails.


For a patch to land in Nova it's got to run through pass tempest 3 times 
in a gate run (and it probably won't have been pushed there until it 
passed in the check run). Which means it's got to work at least 90% of 
the time.


Neutron, because it only runs 1 configuration of tempest, and only in 
smoke mode, means that a patch that only works 50% of the time can 
easily land.


Bisection of patches to the failure point only works if you have a 
binary test for success and failure. In the Tempest gate, with a real 
devstack environment, running 20 services asynchronously on variability 
performance guests in a cloud, if we had a consistent binary test we'd 
never have landed the code in the first place.


So there isn't an automatic bisection solution.

 

Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-18 Thread Joe Gordon
On Sat, Aug 17, 2013 at 9:46 AM, Robert Collins
wrote:

> On 17 August 2013 23:49, Salvatore Orlando  wrote:
> > I tend to agree that when the gate for a project is broken, nothing
> should
> > be merged for that project until the gate jobs are green again.
> > In the case of Neutron, making the job non voting only caused more bugs
> to
> > slip through, and that meant more works for the developer themselves, and
> > more headaches for developers of other projects relying on it.
>
>
>
> > When dealing with intermittent failures, like the bug which probably
> started
> > the issues we've been witnessing in the past 3 weeks, I think it might a
> > sensible idea to make the job non-voting only for projects which surely
> > can't be the cause of the gate failure; or perhaps skip the offending
> test
> > only.
> >
> > This means however asymettrical gating, and from Monty's post it seems
> > there's something quite wrong with it. However, due to my lack of
> expertise
> > on the subject, I am unable to see the issue with it.
> >
> > Salvatore
>
> The asymmetry we should fear is when project A can land something
> something which will break project B. In this case the proposal is to
> say 'B is broken already, permit A to land things without remorse
> until B is unbroken'.
>
> The problem is, if A makes the breakage of B worse, B ends up in
> catchup mode, which is most unfun.
>
> Concretely, take heat for A and neutron for B. Tempest d-g jobs start
> failing in neutron, so they are made skips. Now heat could make
> neutron tests in tempest worse, and we won't know - or if we do know,
> they'll still land.
>
> Previous discussion here has endorsed 'revert problematic commits,
> it's not blame on the developer, just do it', so I'm not going to
> mention that.
>
> What I will suggest we do is start running some number - lets say 20 -
> of midnight state jobs, all identical. Ignoring datetime sensitive
> tests, which are fortunately rare, this should identify tests that
> fail 5% of the time, independent of incoming commits. We can use this
> to generate a baseline reference for which tests fail intermittently
> in trunk, and when something breaks intermittently outside of that
> set, we can be pretty *sure* it's in the last days commits.
>

+1, although we already have a manual vaguely similar version of this (
http://status.openstack.org/rechecks/)


>
> Secondly, in principle it should be straight forward to do this for
> any point in time, so when a new problem shows it's head, we can start
> a bisection up programmatically - independent of the dev analysis - to
> find where it was introduced. If we have resources we could even do
> N-section rather than bisection.


+1


>


> Killing all intermittent issues test suites is /hard/, so I think we
> need to have a belt-and-braces approach and engineer a rapid response
> system to spikes in intermittent failures, in addition to working on
> the failures themselves.


> -Rob
> --
> Robert Collins 
> Distinguished Technologist
> HP Converged Cloud
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-17 Thread Robert Collins
On 17 August 2013 23:49, Salvatore Orlando  wrote:
> I tend to agree that when the gate for a project is broken, nothing should
> be merged for that project until the gate jobs are green again.
> In the case of Neutron, making the job non voting only caused more bugs to
> slip through, and that meant more works for the developer themselves, and
> more headaches for developers of other projects relying on it.



> When dealing with intermittent failures, like the bug which probably started
> the issues we've been witnessing in the past 3 weeks, I think it might a
> sensible idea to make the job non-voting only for projects which surely
> can't be the cause of the gate failure; or perhaps skip the offending test
> only.
>
> This means however asymettrical gating, and from Monty's post it seems
> there's something quite wrong with it. However, due to my lack of expertise
> on the subject, I am unable to see the issue with it.
>
> Salvatore

The asymmetry we should fear is when project A can land something
something which will break project B. In this case the proposal is to
say 'B is broken already, permit A to land things without remorse
until B is unbroken'.

The problem is, if A makes the breakage of B worse, B ends up in
catchup mode, which is most unfun.

Concretely, take heat for A and neutron for B. Tempest d-g jobs start
failing in neutron, so they are made skips. Now heat could make
neutron tests in tempest worse, and we won't know - or if we do know,
they'll still land.

Previous discussion here has endorsed 'revert problematic commits,
it's not blame on the developer, just do it', so I'm not going to
mention that.

What I will suggest we do is start running some number - lets say 20 -
of midnight state jobs, all identical. Ignoring datetime sensitive
tests, which are fortunately rare, this should identify tests that
fail 5% of the time, independent of incoming commits. We can use this
to generate a baseline reference for which tests fail intermittently
in trunk, and when something breaks intermittently outside of that
set, we can be pretty *sure* it's in the last days commits.

Secondly, in principle it should be straight forward to do this for
any point in time, so when a new problem shows it's head, we can start
a bisection up programmatically - independent of the dev analysis - to
find where it was introduced. If we have resources we could even do
N-section rather than bisection.

Killing all intermittent issues test suites is /hard/, so I think we
need to have a belt-and-braces approach and engineer a rapid response
system to spikes in intermittent failures, in addition to working on
the failures themselves.

-Rob
-- 
Robert Collins 
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-17 Thread Joe Gordon
On Aug 17, 2013 7:52 AM, "Salvatore Orlando"  wrote:
>
> I tend to agree that when the gate for a project is broken, nothing
should be merged for that project until the gate jobs are green again.
> In the case of Neutron, making the job non voting only caused more bugs
to slip through, and that meant more works for the developer themselves,
and more headaches for developers of other projects relying on it.
>
> When dealing with intermittent failures, like the bug which probably
started the issues we've been witnessing in the past 3 weeks, I think it
might a sensible idea to make the job non-voting only for projects which
surely can't be the cause of the gate failure; or perhaps skip the
offending test only.
>
> This means however asymettrical gating, and from Monty's post it seems
there's something quite wrong with it. However, due to my lack of expertise
on the subject, I am unable to see the issue with it.

Although not as simple this can also be done by telling neutron-core to
only merge fixes that will move things closer to gating tests staying green.

We all use a similar process for feature freeze already.

>
> Salvatore
>
>
>
>
> On 17 August 2013 01:42, Maru Newby  wrote:
>>
>>
>> On Aug 16, 2013, at 11:44 AM, Clint Byrum  wrote:
>>
>> > Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:
>> >> Neutron has been in and out of the gate for the better part of the
past month, and it didn't slow the pace of development one bit.  Most
Neutron developers kept on working as if nothing was wrong, blithely
merging changes with no guarantees that they weren't introducing new
breakage.  New bugs were indeed merged, greatly increasing the time and
effort required to get Neutron back in the gate.  I don't think this is
sustainable, and I'd like to make a suggestion for how to minimize the
impact of gate breakage.
>> >>
>> >> For the record, I don't think consistent gate breakage in one project
should be allowed to hold up the development of other projects.  The
current approach of skipping tests or otherwise making a given job
non-voting for innocent projects should continue.  It is arguably worth
taking the risk of relaxing gating for those innocent projects rather than
halting development unnecessarily.
>> >>
>> >> However, I don't think it is a good idea to relax a broken gate for
the offending project.  So if a broken job/test is clearly Neutron related,
it should continue to gate Neutron, effectively preventing merges until the
problem is fixed.  This would both raise the visibility of breakage beyond
the person responsible for fixing it, and prevent additional breakage from
slipping past were the gating to be relaxed.
>> >>
>> >> Thoughts?
>> >>
>> >
>> > I think this is a cultural problem related to the code review
discussing
>> > from earlier in the week.
>> >
>> > We are not looking at finding a defect and reverting as a good thing
where
>> > high fives should be shared all around. Instead, "you broke the gate"
>> > seems to mean "you are a bad developer". I have been a bad actor here
too,
>> > getting frustrated with the gate-breaker and saying the wrong thing.
>> >
>> > The problem really is "you _broke_ the gate". It should be "the gate
has
>> > found a defect, hooray!". It doesn't matter what causes the gate to
stop,
>> > it is _always_ a defect. Now, it is possible the defect is in tempest,
>> > or jenkins, or HP/Rackspace's clouds where the tests run. But it is
>> > always a defect that what worked before, does not work now.
>> >
>> > Defects are to be expected. None of us can write perfect code. We
should
>> > be happy to revert commits and go forward with an enabled gate while
>> > the team responsible for the commit gathers information and works to
>> > correct the issue.
>>
>> You're preaching to the choir, and I suspect that anyone with an
interest in software quality is likely to prefer problem solving to finger
pointing.  However, my intent with this thread was not to promote more
constructive thinking about defect detection.  Rather, I was hoping to
communicate a flaw in the existing process and seek consensus on how that
process could best be modified to minimize the cost of resolving gate
breakage.
>>
>>
>> > ___
>> > OpenStack-dev mailing list
>> > OpenStack-dev@lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-17 Thread Salvatore Orlando
I tend to agree that when the gate for a project is broken, nothing should
be merged for that project until the gate jobs are green again.
In the case of Neutron, making the job non voting only caused more bugs to
slip through, and that meant more works for the developer themselves, and
more headaches for developers of other projects relying on it.

When dealing with intermittent failures, like the bug which probably
started the issues we've been witnessing in the past 3 weeks, I think it
might a sensible idea to make the job non-voting only for projects which
surely can't be the cause of the gate failure; or perhaps skip the
offending test only.

This means however asymettrical gating, and from Monty's post it seems
there's something quite wrong with it. However, due to my lack of expertise
on the subject, I am unable to see the issue with it.

Salvatore




On 17 August 2013 01:42, Maru Newby  wrote:

>
> On Aug 16, 2013, at 11:44 AM, Clint Byrum  wrote:
>
> > Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:
> >> Neutron has been in and out of the gate for the better part of the past
> month, and it didn't slow the pace of development one bit.  Most Neutron
> developers kept on working as if nothing was wrong, blithely merging
> changes with no guarantees that they weren't introducing new breakage.  New
> bugs were indeed merged, greatly increasing the time and effort required to
> get Neutron back in the gate.  I don't think this is sustainable, and I'd
> like to make a suggestion for how to minimize the impact of gate breakage.
> >>
> >> For the record, I don't think consistent gate breakage in one project
> should be allowed to hold up the development of other projects.  The
> current approach of skipping tests or otherwise making a given job
> non-voting for innocent projects should continue.  It is arguably worth
> taking the risk of relaxing gating for those innocent projects rather than
> halting development unnecessarily.
> >>
> >> However, I don't think it is a good idea to relax a broken gate for the
> offending project.  So if a broken job/test is clearly Neutron related, it
> should continue to gate Neutron, effectively preventing merges until the
> problem is fixed.  This would both raise the visibility of breakage beyond
> the person responsible for fixing it, and prevent additional breakage from
> slipping past were the gating to be relaxed.
> >>
> >> Thoughts?
> >>
> >
> > I think this is a cultural problem related to the code review discussing
> > from earlier in the week.
> >
> > We are not looking at finding a defect and reverting as a good thing
> where
> > high fives should be shared all around. Instead, "you broke the gate"
> > seems to mean "you are a bad developer". I have been a bad actor here
> too,
> > getting frustrated with the gate-breaker and saying the wrong thing.
> >
> > The problem really is "you _broke_ the gate". It should be "the gate has
> > found a defect, hooray!". It doesn't matter what causes the gate to stop,
> > it is _always_ a defect. Now, it is possible the defect is in tempest,
> > or jenkins, or HP/Rackspace's clouds where the tests run. But it is
> > always a defect that what worked before, does not work now.
> >
> > Defects are to be expected. None of us can write perfect code. We should
> > be happy to revert commits and go forward with an enabled gate while
> > the team responsible for the commit gathers information and works to
> > correct the issue.
>
> You're preaching to the choir, and I suspect that anyone with an interest
> in software quality is likely to prefer problem solving to finger pointing.
>  However, my intent with this thread was not to promote more constructive
> thinking about defect detection.  Rather, I was hoping to communicate a
> flaw in the existing process and seek consensus on how that process could
> best be modified to minimize the cost of resolving gate breakage.
>
>
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-08-16 16:42:23 -0700:
> 
> On Aug 16, 2013, at 11:44 AM, Clint Byrum  wrote:
> 
> > Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:
> >> Neutron has been in and out of the gate for the better part of the past 
> >> month, and it didn't slow the pace of development one bit.  Most Neutron 
> >> developers kept on working as if nothing was wrong, blithely merging 
> >> changes with no guarantees that they weren't introducing new breakage.  
> >> New bugs were indeed merged, greatly increasing the time and effort 
> >> required to get Neutron back in the gate.  I don't think this is 
> >> sustainable, and I'd like to make a suggestion for how to minimize the 
> >> impact of gate breakage.
> >> 
> >> For the record, I don't think consistent gate breakage in one project 
> >> should be allowed to hold up the development of other projects.  The 
> >> current approach of skipping tests or otherwise making a given job 
> >> non-voting for innocent projects should continue.  It is arguably worth 
> >> taking the risk of relaxing gating for those innocent projects rather than 
> >> halting development unnecessarily.
> >> 
> >> However, I don't think it is a good idea to relax a broken gate for the 
> >> offending project.  So if a broken job/test is clearly Neutron related, it 
> >> should continue to gate Neutron, effectively preventing merges until the 
> >> problem is fixed.  This would both raise the visibility of breakage beyond 
> >> the person responsible for fixing it, and prevent additional breakage from 
> >> slipping past were the gating to be relaxed.
> >> 
> >> Thoughts?
> >> 
> > 
> > I think this is a cultural problem related to the code review discussing
> > from earlier in the week.
> > 
> > We are not looking at finding a defect and reverting as a good thing where
> > high fives should be shared all around. Instead, "you broke the gate"
> > seems to mean "you are a bad developer". I have been a bad actor here too,
> > getting frustrated with the gate-breaker and saying the wrong thing.
> > 
> > The problem really is "you _broke_ the gate". It should be "the gate has
> > found a defect, hooray!". It doesn't matter what causes the gate to stop,
> > it is _always_ a defect. Now, it is possible the defect is in tempest,
> > or jenkins, or HP/Rackspace's clouds where the tests run. But it is
> > always a defect that what worked before, does not work now.
> > 
> > Defects are to be expected. None of us can write perfect code. We should
> > be happy to revert commits and go forward with an enabled gate while
> > the team responsible for the commit gathers information and works to
> > correct the issue.
> 
> You're preaching to the choir, and I suspect that anyone with an interest in 
> software quality is likely to prefer problem solving to finger pointing.  
> However, my intent with this thread was not to promote more constructive 
> thinking about defect detection.  Rather, I was hoping to communicate a flaw 
> in the existing process and seek consensus on how that process could best be 
> modified to minimize the cost of resolving gate breakage.
> 

I believe that the process is a symptom of the culture. If we were
more eager to revert/discover/fix/re-submit on failure, we wouldn't
be turning off the gate for things. Instead we cling to whatever has
had the requisite "+2/approval" as if passing the stringent review has
imparted our code with magical powers which will eventually morph into
a passing gate.

In a perfect world we could make our CI infrastructure bisect the failures
to try and isolate the commits that did them so at least anybody can see
the commit that did the damage and revert it quickly. Realistically, most
of the time we remove from the gate because the failures are intermittent
and take _forever_ to discover, so that may not even be possible.

I am suggesting that we all change our perspective and embrace "revert
this immediately" as "thank you for finding that defect" not "you jerk
why did you revert my code". It may still be hard to find which commit
to revert, but at least one can spend that time with the idea that they
will be rewarded, rather than punished, for their efforts.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Maru Newby

On Aug 16, 2013, at 11:44 AM, Clint Byrum  wrote:

> Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:
>> Neutron has been in and out of the gate for the better part of the past 
>> month, and it didn't slow the pace of development one bit.  Most Neutron 
>> developers kept on working as if nothing was wrong, blithely merging changes 
>> with no guarantees that they weren't introducing new breakage.  New bugs 
>> were indeed merged, greatly increasing the time and effort required to get 
>> Neutron back in the gate.  I don't think this is sustainable, and I'd like 
>> to make a suggestion for how to minimize the impact of gate breakage.
>> 
>> For the record, I don't think consistent gate breakage in one project should 
>> be allowed to hold up the development of other projects.  The current 
>> approach of skipping tests or otherwise making a given job non-voting for 
>> innocent projects should continue.  It is arguably worth taking the risk of 
>> relaxing gating for those innocent projects rather than halting development 
>> unnecessarily.
>> 
>> However, I don't think it is a good idea to relax a broken gate for the 
>> offending project.  So if a broken job/test is clearly Neutron related, it 
>> should continue to gate Neutron, effectively preventing merges until the 
>> problem is fixed.  This would both raise the visibility of breakage beyond 
>> the person responsible for fixing it, and prevent additional breakage from 
>> slipping past were the gating to be relaxed.
>> 
>> Thoughts?
>> 
> 
> I think this is a cultural problem related to the code review discussing
> from earlier in the week.
> 
> We are not looking at finding a defect and reverting as a good thing where
> high fives should be shared all around. Instead, "you broke the gate"
> seems to mean "you are a bad developer". I have been a bad actor here too,
> getting frustrated with the gate-breaker and saying the wrong thing.
> 
> The problem really is "you _broke_ the gate". It should be "the gate has
> found a defect, hooray!". It doesn't matter what causes the gate to stop,
> it is _always_ a defect. Now, it is possible the defect is in tempest,
> or jenkins, or HP/Rackspace's clouds where the tests run. But it is
> always a defect that what worked before, does not work now.
> 
> Defects are to be expected. None of us can write perfect code. We should
> be happy to revert commits and go forward with an enabled gate while
> the team responsible for the commit gathers information and works to
> correct the issue.

You're preaching to the choir, and I suspect that anyone with an interest in 
software quality is likely to prefer problem solving to finger pointing.  
However, my intent with this thread was not to promote more constructive 
thinking about defect detection.  Rather, I was hoping to communicate a flaw in 
the existing process and seek consensus on how that process could best be 
modified to minimize the cost of resolving gate breakage.


> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Maru Newby

On Aug 16, 2013, at 11:44 AM, Monty Taylor  wrote:

> 
> 
> On 08/16/2013 02:25 PM, Maru Newby wrote:
>> Neutron has been in and out of the gate for the better part of the
>> past month, and it didn't slow the pace of development one bit.  Most
>> Neutron developers kept on working as if nothing was wrong, blithely
>> merging changes with no guarantees that they weren't introducing new
>> breakage.  New bugs were indeed merged, greatly increasing the time
>> and effort required to get Neutron back in the gate.  I don't think
>> this is sustainable, and I'd like to make a suggestion for how to
>> minimize the impact of gate breakage.
>> 
>> For the record, I don't think consistent gate breakage in one project
>> should be allowed to hold up the development of other projects.  The
>> current approach of skipping tests or otherwise making a given job
>> non-voting for innocent projects should continue.  It is arguably
>> worth taking the risk of relaxing gating for those innocent projects
>> rather than halting development unnecessarily.
>> 
>> However, I don't think it is a good idea to relax a broken gate for
>> the offending project.  So if a broken job/test is clearly Neutron
>> related, it should continue to gate Neutron, effectively preventing
>> merges until the problem is fixed.  This would both raise the
>> visibility of breakage beyond the person responsible for fixing it,
>> and prevent additional breakage from slipping past were the gating to
>> be relaxed.
> 
> I do not know the exact implementation that would work here, but I do
> think it's worth discussing further. Essentially, a neutron bug killing
> the gate for a nova dev isn't necessarily going to help - because the
> nova dev doesn't necessarily have the background to fix it.
> 
> I want to be very careful that we don't wind up with an assymetrical
> gate though…

What are your concerns regarding an 'asymmetrical gate'?  By halting neutron 
development until neutron-caused breakage is fixed, there would presumably be 
sufficient motivation to ensure timely resolution.

> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:
> Neutron has been in and out of the gate for the better part of the past 
> month, and it didn't slow the pace of development one bit.  Most Neutron 
> developers kept on working as if nothing was wrong, blithely merging changes 
> with no guarantees that they weren't introducing new breakage.  New bugs were 
> indeed merged, greatly increasing the time and effort required to get Neutron 
> back in the gate.  I don't think this is sustainable, and I'd like to make a 
> suggestion for how to minimize the impact of gate breakage.
> 
> For the record, I don't think consistent gate breakage in one project should 
> be allowed to hold up the development of other projects.  The current 
> approach of skipping tests or otherwise making a given job non-voting for 
> innocent projects should continue.  It is arguably worth taking the risk of 
> relaxing gating for those innocent projects rather than halting development 
> unnecessarily.
> 
> However, I don't think it is a good idea to relax a broken gate for the 
> offending project.  So if a broken job/test is clearly Neutron related, it 
> should continue to gate Neutron, effectively preventing merges until the 
> problem is fixed.  This would both raise the visibility of breakage beyond 
> the person responsible for fixing it, and prevent additional breakage from 
> slipping past were the gating to be relaxed.
> 
> Thoughts?
> 

I think this is a cultural problem related to the code review discussing
from earlier in the week.

We are not looking at finding a defect and reverting as a good thing where
high fives should be shared all around. Instead, "you broke the gate"
seems to mean "you are a bad developer". I have been a bad actor here too,
getting frustrated with the gate-breaker and saying the wrong thing.

The problem really is "you _broke_ the gate". It should be "the gate has
found a defect, hooray!". It doesn't matter what causes the gate to stop,
it is _always_ a defect. Now, it is possible the defect is in tempest,
or jenkins, or HP/Rackspace's clouds where the tests run. But it is
always a defect that what worked before, does not work now.

Defects are to be expected. None of us can write perfect code. We should
be happy to revert commits and go forward with an enabled gate while
the team responsible for the commit gathers information and works to
correct the issue.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Monty Taylor


On 08/16/2013 02:25 PM, Maru Newby wrote:
> Neutron has been in and out of the gate for the better part of the
> past month, and it didn't slow the pace of development one bit.  Most
> Neutron developers kept on working as if nothing was wrong, blithely
> merging changes with no guarantees that they weren't introducing new
> breakage.  New bugs were indeed merged, greatly increasing the time
> and effort required to get Neutron back in the gate.  I don't think
> this is sustainable, and I'd like to make a suggestion for how to
> minimize the impact of gate breakage.
> 
> For the record, I don't think consistent gate breakage in one project
> should be allowed to hold up the development of other projects.  The
> current approach of skipping tests or otherwise making a given job
> non-voting for innocent projects should continue.  It is arguably
> worth taking the risk of relaxing gating for those innocent projects
> rather than halting development unnecessarily.
> 
> However, I don't think it is a good idea to relax a broken gate for
> the offending project.  So if a broken job/test is clearly Neutron
> related, it should continue to gate Neutron, effectively preventing
> merges until the problem is fixed.  This would both raise the
> visibility of breakage beyond the person responsible for fixing it,
> and prevent additional breakage from slipping past were the gating to
> be relaxed.

I do not know the exact implementation that would work here, but I do
think it's worth discussing further. Essentially, a neutron bug killing
the gate for a nova dev isn't necessarily going to help - because the
nova dev doesn't necessarily have the background to fix it.

I want to be very careful that we don't wind up with an assymetrical
gate though...

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Alex Gaynor
I'd strongly agree with that, a project must always be gated by any tests
for it, even if they don't gate for other projects. I'd also argue that any
time there's a non-gating test (for any project) it needs a formal
explanation of why it's not gating yet, what the plan to get it to gating
is, and on what timeframe it's expected to be.

Alex


On Fri, Aug 16, 2013 at 11:25 AM, Maru Newby  wrote:

> Neutron has been in and out of the gate for the better part of the past
> month, and it didn't slow the pace of development one bit.  Most Neutron
> developers kept on working as if nothing was wrong, blithely merging
> changes with no guarantees that they weren't introducing new breakage.  New
> bugs were indeed merged, greatly increasing the time and effort required to
> get Neutron back in the gate.  I don't think this is sustainable, and I'd
> like to make a suggestion for how to minimize the impact of gate breakage.
>
> For the record, I don't think consistent gate breakage in one project
> should be allowed to hold up the development of other projects.  The
> current approach of skipping tests or otherwise making a given job
> non-voting for innocent projects should continue.  It is arguably worth
> taking the risk of relaxing gating for those innocent projects rather than
> halting development unnecessarily.
>
> However, I don't think it is a good idea to relax a broken gate for the
> offending project.  So if a broken job/test is clearly Neutron related, it
> should continue to gate Neutron, effectively preventing merges until the
> problem is fixed.  This would both raise the visibility of breakage beyond
> the person responsible for fixing it, and prevent additional breakage from
> slipping past were the gating to be relaxed.
>
> Thoughts?
>
>
> m.
>
>
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
"I disapprove of what you say, but I will defend to the death your right to
say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: 125F 5C67 DFE9 4084
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

2013-08-16 Thread Maru Newby
Neutron has been in and out of the gate for the better part of the past month, 
and it didn't slow the pace of development one bit.  Most Neutron developers 
kept on working as if nothing was wrong, blithely merging changes with no 
guarantees that they weren't introducing new breakage.  New bugs were indeed 
merged, greatly increasing the time and effort required to get Neutron back in 
the gate.  I don't think this is sustainable, and I'd like to make a suggestion 
for how to minimize the impact of gate breakage.

For the record, I don't think consistent gate breakage in one project should be 
allowed to hold up the development of other projects.  The current approach of 
skipping tests or otherwise making a given job non-voting for innocent projects 
should continue.  It is arguably worth taking the risk of relaxing gating for 
those innocent projects rather than halting development unnecessarily.

However, I don't think it is a good idea to relax a broken gate for the 
offending project.  So if a broken job/test is clearly Neutron related, it 
should continue to gate Neutron, effectively preventing merges until the 
problem is fixed.  This would both raise the visibility of breakage beyond the 
person responsible for fixing it, and prevent additional breakage from slipping 
past were the gating to be relaxed.

Thoughts?


m.





___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev