Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-28 Thread Eoghan Glynn


> Sean Dague wrote:
> 
> To be clear, the functional tests will not be Tempest tests. This is a
> different class of testing, it's really another tox target that needs a
> devstack to run. A really good initial transition would be things like
> the CLI testing.
> 
> Also, the Tempest team has gone out of it's way to tell people it's not
> a stable interface, and don't do that. Contributions to help make parts
> of Tempest into a stable library would be appreciated.

Well, in a perfect world, this "libification" of the re-usable bits from
Tempest would be nicely advanced *before* the projects all rush in to
implement their own in-tree functional testing mechanisms.

But as we know, we all live in a highly imperfect world ...

So do we expect the tempest-lib to be fleshed out in an emergent fashion,
as the projects dig into implementing their own in-tree func tests?

Or is it seen as an upfront seeding process, that the QA team members with
Tempest domain knowledge are expecting to drive?

Cheers,
Eoghan

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-28 Thread Daniel P. Berrange
On Mon, Jul 28, 2014 at 02:28:56PM +0200, Thierry Carrez wrote:
> James E. Blair wrote:
> > [...]
> > Most of these bugs are not failures of the test system; they are real
> > bugs.  Many of them have even been in OpenStack for a long time, but are
> > only becoming visible now due to improvements in our tests.  That's not
> > much help to developers whose patches are being hit with negative test
> > results from unrelated failures.  We need to find a way to address the
> > non-deterministic bugs that are lurking in OpenStack without making it
> > easier for new bugs to creep in.
> 
> I think that's a critical point. As a community, we need to move from a
> perspective where we see the gate as a process step and failure there
> being described as "the gate is broken".
> 
> Although in some cases the failures are indeed coming from a gate bug,
> in most cases the failures are coming from a pileup of race conditions
> and other rare errors in OpenStack itself. In other words, the gate is
> not broken, *OpenStack* is broken. If you can't get the tests to pass on
> a proposed change due to test failures, that means OpenStack itself has
> reached a level where it just doesn't work. The gate is just a thermometer.
> 
> Those type of problems need to be solved, even if changes can be
> introduced in the CI/gate system to mitigate some of their most painful
> side-effects. However, currently, only a handful of developers actually
> work on fixing such issues -- and today those developers are completely
> overwhelmed and burnt out.
> 
> We need to have more people working on those bugs. We need to
> communicate this key type of strategic contribution to our corporate
> sponsors. We need to make it practical to work on those bugs, by
> providing all the tools we can to help in the debugging. We need to make
> it rewarding to work on those bugs: some of those bugs will be the most
> complex bugs you can find in OpenStack -- they should be viewed as an
> intellectual challenge for our best minds, rather than as cleaning up a
> sewer that other people continuously contribute to fill.

I recall it was suggested elsewhere recently, but I think that perhaps
we should consider having much more regular bug squashing days. eg could
say we have "bug squash wednesdays" every 2 weeks or so where we explicitly
encourage people to focus their attention exclusively on bug fixes and
ignore all feature related stuff. Core reviewers could set the tone by
not reviewing any patches which were not tagged with a bug on those days
and encouraging discussions around the bugs in IRC. The bug triage and
gate teams could help prime it by providing a couple of lists of bugs,
each list targetted to suit some skill level, to make  it easy for people
to pick off bugs to attack on those days. 

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-28 Thread Thierry Carrez
James E. Blair wrote:
> [...]
> Most of these bugs are not failures of the test system; they are real
> bugs.  Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests.  That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures.  We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.

I think that's a critical point. As a community, we need to move from a
perspective where we see the gate as a process step and failure there
being described as "the gate is broken".

Although in some cases the failures are indeed coming from a gate bug,
in most cases the failures are coming from a pileup of race conditions
and other rare errors in OpenStack itself. In other words, the gate is
not broken, *OpenStack* is broken. If you can't get the tests to pass on
a proposed change due to test failures, that means OpenStack itself has
reached a level where it just doesn't work. The gate is just a thermometer.

Those type of problems need to be solved, even if changes can be
introduced in the CI/gate system to mitigate some of their most painful
side-effects. However, currently, only a handful of developers actually
work on fixing such issues -- and today those developers are completely
overwhelmed and burnt out.

We need to have more people working on those bugs. We need to
communicate this key type of strategic contribution to our corporate
sponsors. We need to make it practical to work on those bugs, by
providing all the tools we can to help in the debugging. We need to make
it rewarding to work on those bugs: some of those bugs will be the most
complex bugs you can find in OpenStack -- they should be viewed as an
intellectual challenge for our best minds, rather than as cleaning up a
sewer that other people continuously contribute to fill.

> The CI system and project infrastructure are not static.  They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now.  The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth.  This post from
> Sean Dague goes into a bit of the background: [1].  The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
> [...]

I like all the options suggested there, and I enjoyed the discussion
that followed.

-- 
Thierry Carrez (ttx)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-26 Thread Jay Pipes

On 07/24/2014 06:36 PM, John Dickinson wrote:

On Jul 24, 2014, at 3:25 PM, Sean Dague  wrote:

On 07/24/2014 06:15 PM, Angus Salkeld wrote:

We do this in Solum and I really like it. It's nice for the same
reviewers to see the functional tests and the code the implements
a feature.

One downside is we have had failures due to tempest reworking
their client code. This hasn't happened for a while, but it would
be good for tempest to recognize that people are using tempest as
a library and will maintain API.


To be clear, the functional tests will not be Tempest tests. This
is a different class of testing, it's really another tox target
that needs a devstack to run. A really good initial transition
would be things like the CLI testing.


I too love this idea. In addition to the current Tempest tests that
are run against every patch, Swift has in-tree unit, functional[1],
and probe[2] tests. This makes it quite easy to test locally before
submitting patches and makes keeping test coverage high much easier
too. I'm really happy to hear that this will be the future direction
of testing in OpenStack.


And Glance has had functional tests in-tree for 3 years:

http://git.openstack.org/cgit/openstack/glance/tree/glance/tests/functional

-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Robert Collins
On 26 July 2014 08:20, Matthew Treinish  wrote:

>> This is also more of a pragmatic organic approach to figuring out the
>> interfaces we need to lock down. When one projects breaks depending on
>> an interface in another project, that should trigger this kind of
>> contract growth, which hopefully formally turns into a document later
>> for a stable interface.
>
> So notifications are a good example of this, but I think how we handled this
> is also an example of what not to do. The order was backwards, there should
> have been a stability guarantee upfront, with a versioning mechanism on
> notifications when another project started relying on using them. The fact 
> that
> there are at least 2 ML threads on how to fix and test this at this point in
> ceilometer's life seems like a poor way to handle it. I don't want to see us
> repeat this by allowing cross-project interactions to depend on unstable
> interfaces.

+1

> I agree that there is a scaling issue, our variable testing quality and 
> coverage
> between all the projects in the tempest tree is proof enough of this. I just
> don't want to see us lose the protection we have against inadvertent changes.
> Having the friction of something like the tempest two step is important, we've
> blocked a lot of breaking api changes because of this.
>
> The other thing to consider is that when we adopted branchless tempest part of
> the goal there was to ensure the consistency between release boundaries. If
> we're really advocating dropping most of the API coverage out of tempest part 
> of
> the story needs to be around how we prevent things from slipping between 
> release
> boundaries too.

I'm also worried about the impact on TripleO - we run everything
together functionally, and we've been aiming at the gate since
forever: we need more stability, and I'm worried that this may lead to
less. I don't think more lock-down and a bigger matrix is needed - and
I support doing an experiment to see if we end up in a better place.
Still worried :).


> But, having worked on this stuff for ~2 years I can say from personal 
> experience
> that every project slips when it comes to API stability, despite the best
> intentions, unless there was test coverage for it. I don't want to see us open
> the flood gates on this just because we've gotten ourselves into a bad 
> situation
> with the state of the gate.

+1


>> Our current model leans far too much on the idea of the only time we
>> ever try to test things for real is when we throw all 1 million lines of
>> source code into one pot and stir. It really shouldn't be surprising how
>> many bugs shake out there. And this is the wrong layer to debug from, so
>> I firmly believe we need to change this back to something we can
>> actually manage to shake the bugs out with. Because right now we're
>> finding them, but our infrastructure isn't optimized for fixing them,
>> and we need to change that.
>>
>
> I agree a layered approach is best, I'm not disagreeing on that point. I just 
> am
> not sure how much we really should be decreasing the scope of Tempest as the 
> top
> layer around the api tests. I don't think we should too much just because we
> beefing up the middle with improved functional testing. In my view having some
> duplication between the layers is fine and desirable actually.
>
> Anyway, I feel like I'm diverging this thread off into a different area, so 
> I'll
> shoot off a separate thread on the topic of scale and scope of Tempest and the
> new in-tree project specific functional tests. But to summarize, what I think 
> we
> should be clear about at the high level for this thread is that for right now 
> is
> that for the short term we aren't changing the scope of Tempest. Instead we
> should just be vigilant in managing tempest's growth (which we've been trying 
> to
> do already) We can revisit the discussion of decreasing Tempest's size once
> everyone's figured out the per project functional testing. This will also give
> us time to collect longer term data about test stability in the gate so we can
> figure out which things are actually valuable to have in tempest. I think this
> is what probably got lost in the noise here but has been discussed elsewhere.

I'm pretty interested in having contract tests within each project, I
think thats the right responsibility for them - my specific concern is
the recovery process / time to recovery when a regression does get
through.

-Rob

-- 
Robert Collins 
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Matthew Treinish
On Thu, Jul 24, 2014 at 06:54:38PM -0400, Sean Dague wrote:
> On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> >> OpenStack has a substantial CI system that is core to its development
> >> process.  The goals of the system are to facilitate merging good code,
> >> prevent regressions, and ensure that there is at least one configuration
> >> of upstream OpenStack that we know works as a whole.  The "project
> >> gating" technique that we use is effective at preventing many kinds of
> >> regressions from landing, however more subtle, non-deterministic bugs
> >> can still get through, and these are the bugs that are currently
> >> plaguing developers with seemingly random test failures.
> >>
> >> Most of these bugs are not failures of the test system; they are real
> >> bugs.  Many of them have even been in OpenStack for a long time, but are
> >> only becoming visible now due to improvements in our tests.  That's not
> >> much help to developers whose patches are being hit with negative test
> >> results from unrelated failures.  We need to find a way to address the
> >> non-deterministic bugs that are lurking in OpenStack without making it
> >> easier for new bugs to creep in.
> >>
> >> The CI system and project infrastructure are not static.  They have
> >> evolved with the project to get to where they are today, and the
> >> challenge now is to continue to evolve them to address the problems
> >> we're seeing now.  The QA and Infrastructure teams recently hosted a
> >> sprint where we discussed some of these issues in depth.  This post from
> >> Sean Dague goes into a bit of the background: [1].  The rest of this
> >> email outlines the medium and long-term changes we would like to make to
> >> address these problems.
> >>
> >> [1] https://dague.net/2014/07/22/openstack-failures/
> >>
> >> ==Things we're already doing==
> >>
> >> The elastic-recheck tool[2] is used to identify "random" failures in
> >> test runs.  It tries to match failures to known bugs using signatures
> >> created from log messages.  It helps developers prioritize bugs by how
> >> frequently they manifest as test failures.  It also collects information
> >> on unclassified errors -- we can see how many (and which) test runs
> >> failed for an unknown reason and our overall progress on finding
> >> fingerprints for random failures.
> >>
> >> [2] http://status.openstack.org/elastic-recheck/
> >>
> >> We added a feature to Zuul that lets us manually "promote" changes to
> >> the top of the Gate pipeline.  When the QA team identifies a change that
> >> fixes a bug that is affecting overall gate stability, we can move that
> >> change to the top of the queue so that it may merge more quickly.
> >>
> >> We added the clean check facility in reaction to the January gate break
> >> down. While it does mean that any individual patch might see more tests
> >> run on it, it's now largely kept the gate queue at a countable number of
> >> hours, instead of regularly growing to more than a work day in
> >> length. It also means that a developer can Approve a code merge before
> >> tests have returned, and not ruin it for everyone else if there turned
> >> out to be a bug that the tests could catch.
> >>
> >> ==Future changes==
> >>
> >> ===Communication===
> >> We used to be better at communicating about the CI system.  As it and
> >> the project grew, we incrementally added to our institutional knowledge,
> >> but we haven't been good about maintaining that information in a form
> >> that new or existing contributors can consume to understand what's going
> >> on and why.
> >>
> >> We have started on a major effort in that direction that we call the
> >> "infra-manual" project -- it's designed to be a comprehensive "user
> >> manual" for the project infrastructure, including the CI process.  Even
> >> before that project is complete, we will write a document that
> >> summarizes the CI system and ensure it is included in new developer
> >> documentation and linked to from test results.
> >>
> >> There are also a number of ways for people to get involved in the CI
> >> system, whether focused on Infrastructure or QA, but it is not always
> >> clear how to do so.  We will improve our documentation to highlight how
> >> to contribute.
> >>
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixe

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Joe Gordon
On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague  wrote:

> On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> >> OpenStack has a substantial CI system that is core to its development
> >> process.  The goals of the system are to facilitate merging good code,
> >> prevent regressions, and ensure that there is at least one configuration
> >> of upstream OpenStack that we know works as a whole.  The "project
> >> gating" technique that we use is effective at preventing many kinds of
> >> regressions from landing, however more subtle, non-deterministic bugs
> >> can still get through, and these are the bugs that are currently
> >> plaguing developers with seemingly random test failures.
> >>
> >> Most of these bugs are not failures of the test system; they are real
> >> bugs.  Many of them have even been in OpenStack for a long time, but are
> >> only becoming visible now due to improvements in our tests.  That's not
> >> much help to developers whose patches are being hit with negative test
> >> results from unrelated failures.  We need to find a way to address the
> >> non-deterministic bugs that are lurking in OpenStack without making it
> >> easier for new bugs to creep in.
> >>
> >> The CI system and project infrastructure are not static.  They have
> >> evolved with the project to get to where they are today, and the
> >> challenge now is to continue to evolve them to address the problems
> >> we're seeing now.  The QA and Infrastructure teams recently hosted a
> >> sprint where we discussed some of these issues in depth.  This post from
> >> Sean Dague goes into a bit of the background: [1].  The rest of this
> >> email outlines the medium and long-term changes we would like to make to
> >> address these problems.
> >>
> >> [1] https://dague.net/2014/07/22/openstack-failures/
> >>
> >> ==Things we're already doing==
> >>
> >> The elastic-recheck tool[2] is used to identify "random" failures in
> >> test runs.  It tries to match failures to known bugs using signatures
> >> created from log messages.  It helps developers prioritize bugs by how
> >> frequently they manifest as test failures.  It also collects information
> >> on unclassified errors -- we can see how many (and which) test runs
> >> failed for an unknown reason and our overall progress on finding
> >> fingerprints for random failures.
> >>
> >> [2] http://status.openstack.org/elastic-recheck/
> >>
> >> We added a feature to Zuul that lets us manually "promote" changes to
> >> the top of the Gate pipeline.  When the QA team identifies a change that
> >> fixes a bug that is affecting overall gate stability, we can move that
> >> change to the top of the queue so that it may merge more quickly.
> >>
> >> We added the clean check facility in reaction to the January gate break
> >> down. While it does mean that any individual patch might see more tests
> >> run on it, it's now largely kept the gate queue at a countable number of
> >> hours, instead of regularly growing to more than a work day in
> >> length. It also means that a developer can Approve a code merge before
> >> tests have returned, and not ruin it for everyone else if there turned
> >> out to be a bug that the tests could catch.
> >>
> >> ==Future changes==
> >>
> >> ===Communication===
> >> We used to be better at communicating about the CI system.  As it and
> >> the project grew, we incrementally added to our institutional knowledge,
> >> but we haven't been good about maintaining that information in a form
> >> that new or existing contributors can consume to understand what's going
> >> on and why.
> >>
> >> We have started on a major effort in that direction that we call the
> >> "infra-manual" project -- it's designed to be a comprehensive "user
> >> manual" for the project infrastructure, including the CI process.  Even
> >> before that project is complete, we will write a document that
> >> summarizes the CI system and ensure it is included in new developer
> >> documentation and linked to from test results.
> >>
> >> There are also a number of ways for people to get involved in the CI
> >> system, whether focused on Infrastructure or QA, but it is not always
> >> clear how to do so.  We will improve our documentation to highlight how
> >> to contribute.
> >>
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixes to ge

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Steve Baker
On 25/07/14 11:18, Sean Dague wrote:
> On 07/25/2014 10:01 AM, Steven Hardy wrote:
>> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>> 
>>>   * Put the burden for a bunch of these tests back on the projects as
>>> "functional" tests. Basically a custom devstack environment that a
>>> project can create with a set of services that they minimally need
>>> to do their job. These functional tests will live in the project
>>> tree, not in Tempest, so can be atomically landed as part of the
>>> project normal development process.
>> +1 - FWIW I don't think the current process where we require tempest
>> cores to review our project test cases is working well, so allowing
>> projects to own their own tests will be a major improvement.
>>
>> In terms of how this works in practice, will the in-tree tests still be run
>> via tempest, e.g will there be a (relatively) stable tempest api we can
>> develop the tests against, as Angus has already mentioned?
> No, not run by tempest, not using tempest code.
>
> The vision is that you'd have:
>
> heat/tests/functional/
>
> And tox -e functional would run them. It would require some config for
> end points. But the point being it would be fully owned by the project
> team. That it could do both blackbox/whitebox testing (and because it's
> in the project tree would know things like the data model and could poke
> behind the scenes).
>
> The tight coupling of everything is part of what's gotten us into these
> deadlocks, decoupling here is really required in order to reduce the
> fragility of the system.
>
>
Since the tempest scenario orchestration tests use heatclient then
hopefully it wouldn't be too much effort to forklift them into
heat/tests/functional without any tempest dependencies.

We can leave the orchestration api tests where they are until the
tempest-lib process results in something ready to use.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Sean Dague
On 07/25/2014 10:01 AM, Steven Hardy wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> 
>>   * Put the burden for a bunch of these tests back on the projects as
>> "functional" tests. Basically a custom devstack environment that a
>> project can create with a set of services that they minimally need
>> to do their job. These functional tests will live in the project
>> tree, not in Tempest, so can be atomically landed as part of the
>> project normal development process.
> 
> +1 - FWIW I don't think the current process where we require tempest
> cores to review our project test cases is working well, so allowing
> projects to own their own tests will be a major improvement.
> 
> In terms of how this works in practice, will the in-tree tests still be run
> via tempest, e.g will there be a (relatively) stable tempest api we can
> develop the tests against, as Angus has already mentioned?

No, not run by tempest, not using tempest code.

The vision is that you'd have:

heat/tests/functional/

And tox -e functional would run them. It would require some config for
end points. But the point being it would be fully owned by the project
team. That it could do both blackbox/whitebox testing (and because it's
in the project tree would know things like the data model and could poke
behind the scenes).

The tight coupling of everything is part of what's gotten us into these
deadlocks, decoupling here is really required in order to reduce the
fragility of the system.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread David Kranz

On 07/25/2014 10:01 AM, Steven Hardy wrote:

On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:


   * Put the burden for a bunch of these tests back on the projects as
 "functional" tests. Basically a custom devstack environment that a
 project can create with a set of services that they minimally need
 to do their job. These functional tests will live in the project
 tree, not in Tempest, so can be atomically landed as part of the
 project normal development process.

+1 - FWIW I don't think the current process where we require tempest
cores to review our project test cases is working well, so allowing
projects to own their own tests will be a major improvement.

++
We will still need some way to make sure it is difficult to break api 
compatibility by submitting a change to both code and its tests, which
currently requires a "tempest two-step". Also, tempest will still need 
to retain integration testing of apis that use apis from other projects.


In terms of how this works in practice, will the in-tree tests still be run
via tempest, e.g will there be a (relatively) stable tempest api we can
develop the tests against, as Angus has already mentioned?
That is a really good question. I hope the answer is that they can still 
be run  by tempest, but don't have to be. I tried to address this in a 
message within the last hour 
http://lists.openstack.org/pipermail/openstack-dev/2014-July/041244.html


 -David


Steve

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Steven Hardy
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:

>   * Put the burden for a bunch of these tests back on the projects as
> "functional" tests. Basically a custom devstack environment that a
> project can create with a set of services that they minimally need
> to do their job. These functional tests will live in the project
> tree, not in Tempest, so can be atomically landed as part of the
> project normal development process.

+1 - FWIW I don't think the current process where we require tempest
cores to review our project test cases is working well, so allowing
projects to own their own tests will be a major improvement.

In terms of how this works in practice, will the in-tree tests still be run
via tempest, e.g will there be a (relatively) stable tempest api we can
develop the tests against, as Angus has already mentioned?

Steve

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-25 Thread Daniel P. Berrange
On Thu, Jul 24, 2014 at 04:01:39PM -0400, Sean Dague wrote:
> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> > 
> >> ==Future changes==
> > 
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixes to get us back to a reasonably passing gate. We
> >> think that optimizing this system will help us land fixes to critical
> >> bugs faster.
> >>
> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> >>
> >> The basic idea is to use the data from elastic recheck to identify that
> >> a patch is fixing a critical gate related bug. When one of these is
> >> found in the queues it will be given higher priority, including bubbling
> >> up to the top of the gate queue automatically. The manual promote
> >> process should no longer be needed, and instead bugs fixing elastic
> >> recheck tracked issues will be promoted automatically.
> >>
> >> At the same time we'll also promote review on critical gate bugs through
> >> making them visible in a number of different channels (like on elastic
> >> recheck pages, review day, and in the gerrit dashboards). The idea here
> >> again is to make the reviews that fix key bugs pop to the top of
> >> everyone's views.
> > 
> > In some of the harder gate bugs I've looked at (especially the infamous
> > 'live snapshot' timeout bug), it has been damn hard to actually figure
> > out what's wrong. AFAIK, no one has ever been able to reproduce it
> > outside of the gate infrastructure. I've even gone as far as setting up
> > identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> > running the tempest tests multiple times, but still can't reproduce what
> > happens on the gate machines themselves :-( As such we're relying on
> > code inspection and the collected log messages to try and figure out
> > what might be wrong.
> > 
> > The gate collects alot of info and publishes it, but in this case I
> > have found the published logs to be insufficient - I needed to get
> > the more verbose libvirtd.log file. devstack has the ability to turn
> > this on via an environment variable, but it is disabled by default
> > because it would add 3% to the total size of logs collected per gate
> > job.
> 
> Right now we're at 95% full on 14 TB (which is the max # of volumes you
> can attach to a single system in RAX), so every gig is sacred. There has
> been a big push, which included the sprint last week in Darmstadt, to
> get log data into swift, at which point our available storage goes way up.
> 
> So for right now, we're a little squashed. Hopefully within a month
> we'll have the full solution.
>
> As soon as we get those kinks out, I'd say we're in a position to flip
> on that logging in devstack by default.

I don't particularly mind if we don't have libvirtdd.log verbose
debugging enabled by default, if there were a way to turn it on
for individual reviews we're debugging with.

> > There's no way for me to get that environment variable for devstack
> > turned on for a specific review I want to test with. In the end I
> > uploaded a change to nova which abused rootwrap to elevate privileges,
> > install extra deb packages, reconfigure libvirtd logging and restart
> > the libvirtd daemon.
> > 
> >   
> > https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
> >   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> > 
> > This let me get further, but still not resolve it. My next attack is
> > to build a custom QEMU binary and hack nova further so that it can
> > download my custom QEMU binary from a website onto the gate machine
> > and run the test with it. Failing that I'm going to be hacking things
> > to try to attach to QEMU in the gate with GDB and get stack traces.
> > Anything is doable thanks to rootwrap giving us a way to elevate
> > privileges from Nova, but it is a somewhat tedious approach.
> > 
> > I'd like us to think about whether they is anything we can do to make
> > life easier in these kind of hard debugging scenarios where the regular
> > logs are not sufficient.
> 
> Agreed. Honestly, though we do also need to figure out first fail
> detection on our logs as well. Because realistically if we can't debug
> failures from those, then I really don't understand how we're ever going
> to expect large users to.

Ultimately there's always going

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Sean Dague
On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>> OpenStack has a substantial CI system that is core to its development
>> process.  The goals of the system are to facilitate merging good code,
>> prevent regressions, and ensure that there is at least one configuration
>> of upstream OpenStack that we know works as a whole.  The "project
>> gating" technique that we use is effective at preventing many kinds of
>> regressions from landing, however more subtle, non-deterministic bugs
>> can still get through, and these are the bugs that are currently
>> plaguing developers with seemingly random test failures.
>>
>> Most of these bugs are not failures of the test system; they are real
>> bugs.  Many of them have even been in OpenStack for a long time, but are
>> only becoming visible now due to improvements in our tests.  That's not
>> much help to developers whose patches are being hit with negative test
>> results from unrelated failures.  We need to find a way to address the
>> non-deterministic bugs that are lurking in OpenStack without making it
>> easier for new bugs to creep in.
>>
>> The CI system and project infrastructure are not static.  They have
>> evolved with the project to get to where they are today, and the
>> challenge now is to continue to evolve them to address the problems
>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>> sprint where we discussed some of these issues in depth.  This post from
>> Sean Dague goes into a bit of the background: [1].  The rest of this
>> email outlines the medium and long-term changes we would like to make to
>> address these problems.
>>
>> [1] https://dague.net/2014/07/22/openstack-failures/
>>
>> ==Things we're already doing==
>>
>> The elastic-recheck tool[2] is used to identify "random" failures in
>> test runs.  It tries to match failures to known bugs using signatures
>> created from log messages.  It helps developers prioritize bugs by how
>> frequently they manifest as test failures.  It also collects information
>> on unclassified errors -- we can see how many (and which) test runs
>> failed for an unknown reason and our overall progress on finding
>> fingerprints for random failures.
>>
>> [2] http://status.openstack.org/elastic-recheck/
>>
>> We added a feature to Zuul that lets us manually "promote" changes to
>> the top of the Gate pipeline.  When the QA team identifies a change that
>> fixes a bug that is affecting overall gate stability, we can move that
>> change to the top of the queue so that it may merge more quickly.
>>
>> We added the clean check facility in reaction to the January gate break
>> down. While it does mean that any individual patch might see more tests
>> run on it, it's now largely kept the gate queue at a countable number of
>> hours, instead of regularly growing to more than a work day in
>> length. It also means that a developer can Approve a code merge before
>> tests have returned, and not ruin it for everyone else if there turned
>> out to be a bug that the tests could catch.
>>
>> ==Future changes==
>>
>> ===Communication===
>> We used to be better at communicating about the CI system.  As it and
>> the project grew, we incrementally added to our institutional knowledge,
>> but we haven't been good about maintaining that information in a form
>> that new or existing contributors can consume to understand what's going
>> on and why.
>>
>> We have started on a major effort in that direction that we call the
>> "infra-manual" project -- it's designed to be a comprehensive "user
>> manual" for the project infrastructure, including the CI process.  Even
>> before that project is complete, we will write a document that
>> summarizes the CI system and ensure it is included in new developer
>> documentation and linked to from test results.
>>
>> There are also a number of ways for people to get involved in the CI
>> system, whether focused on Infrastructure or QA, but it is not always
>> clear how to do so.  We will improve our documentation to highlight how
>> to contribute.
>>
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use t

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread John Dickinson

On Jul 24, 2014, at 3:25 PM, Sean Dague  wrote:

> On 07/24/2014 06:15 PM, Angus Salkeld wrote:
>> On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
>>> OpenStack has a substantial CI system that is core to its development
>>> process.  The goals of the system are to facilitate merging good code,
>>> prevent regressions, and ensure that there is at least one configuration
>>> of upstream OpenStack that we know works as a whole.  The "project
>>> gating" technique that we use is effective at preventing many kinds of
>>> regressions from landing, however more subtle, non-deterministic bugs
>>> can still get through, and these are the bugs that are currently
>>> plaguing developers with seemingly random test failures.
>>> 
>>> Most of these bugs are not failures of the test system; they are real
>>> bugs.  Many of them have even been in OpenStack for a long time, but are
>>> only becoming visible now due to improvements in our tests.  That's not
>>> much help to developers whose patches are being hit with negative test
>>> results from unrelated failures.  We need to find a way to address the
>>> non-deterministic bugs that are lurking in OpenStack without making it
>>> easier for new bugs to creep in.
>>> 
>>> The CI system and project infrastructure are not static.  They have
>>> evolved with the project to get to where they are today, and the
>>> challenge now is to continue to evolve them to address the problems
>>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>>> sprint where we discussed some of these issues in depth.  This post from
>>> Sean Dague goes into a bit of the background: [1].  The rest of this
>>> email outlines the medium and long-term changes we would like to make to
>>> address these problems.
>>> 
>>> [1] https://dague.net/2014/07/22/openstack-failures/
>>> 
>>> ==Things we're already doing==
>>> 
>>> The elastic-recheck tool[2] is used to identify "random" failures in
>>> test runs.  It tries to match failures to known bugs using signatures
>>> created from log messages.  It helps developers prioritize bugs by how
>>> frequently they manifest as test failures.  It also collects information
>>> on unclassified errors -- we can see how many (and which) test runs
>>> failed for an unknown reason and our overall progress on finding
>>> fingerprints for random failures.
>>> 
>>> [2] http://status.openstack.org/elastic-recheck/
>>> 
>>> We added a feature to Zuul that lets us manually "promote" changes to
>>> the top of the Gate pipeline.  When the QA team identifies a change that
>>> fixes a bug that is affecting overall gate stability, we can move that
>>> change to the top of the queue so that it may merge more quickly.
>>> 
>>> We added the clean check facility in reaction to the January gate break
>>> down. While it does mean that any individual patch might see more tests
>>> run on it, it's now largely kept the gate queue at a countable number of
>>> hours, instead of regularly growing to more than a work day in
>>> length. It also means that a developer can Approve a code merge before
>>> tests have returned, and not ruin it for everyone else if there turned
>>> out to be a bug that the tests could catch.
>>> 
>>> ==Future changes==
>>> 
>>> ===Communication===
>>> We used to be better at communicating about the CI system.  As it and
>>> the project grew, we incrementally added to our institutional knowledge,
>>> but we haven't been good about maintaining that information in a form
>>> that new or existing contributors can consume to understand what's going
>>> on and why.
>>> 
>>> We have started on a major effort in that direction that we call the
>>> "infra-manual" project -- it's designed to be a comprehensive "user
>>> manual" for the project infrastructure, including the CI process.  Even
>>> before that project is complete, we will write a document that
>>> summarizes the CI system and ensure it is included in new developer
>>> documentation and linked to from test results.
>>> 
>>> There are also a number of ways for people to get involved in the CI
>>> system, whether focused on Infrastructure or QA, but it is not always
>>> clear how to do so.  We will improve our documentation to highlight how
>>> to contribute.
>>> 
>>> ===Fixing Faster===
>>> 
>>> We introduce bugs to OpenStack at some constant rate, which piles up
>>> over time. Our systems currently treat all changes as equally risky and
>>> important to the health of the system, which makes landing code changes
>>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>>> process of promoting changes today to get around this, but that's
>>> actually quite costly in people time, and takes getting all the right
>>> people together at once to promote changes. You can see a number of the
>>> changes we promoted during the gate storm in June [3], and it was no
>>> small number of fixes to get us back to a reasonably passing gate. We
>>> think that optimizing this system will

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Sean Dague
On 07/24/2014 06:15 PM, Angus Salkeld wrote:
> On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
>> OpenStack has a substantial CI system that is core to its development
>> process.  The goals of the system are to facilitate merging good code,
>> prevent regressions, and ensure that there is at least one configuration
>> of upstream OpenStack that we know works as a whole.  The "project
>> gating" technique that we use is effective at preventing many kinds of
>> regressions from landing, however more subtle, non-deterministic bugs
>> can still get through, and these are the bugs that are currently
>> plaguing developers with seemingly random test failures.
>>
>> Most of these bugs are not failures of the test system; they are real
>> bugs.  Many of them have even been in OpenStack for a long time, but are
>> only becoming visible now due to improvements in our tests.  That's not
>> much help to developers whose patches are being hit with negative test
>> results from unrelated failures.  We need to find a way to address the
>> non-deterministic bugs that are lurking in OpenStack without making it
>> easier for new bugs to creep in.
>>
>> The CI system and project infrastructure are not static.  They have
>> evolved with the project to get to where they are today, and the
>> challenge now is to continue to evolve them to address the problems
>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>> sprint where we discussed some of these issues in depth.  This post from
>> Sean Dague goes into a bit of the background: [1].  The rest of this
>> email outlines the medium and long-term changes we would like to make to
>> address these problems.
>>
>> [1] https://dague.net/2014/07/22/openstack-failures/
>>
>> ==Things we're already doing==
>>
>> The elastic-recheck tool[2] is used to identify "random" failures in
>> test runs.  It tries to match failures to known bugs using signatures
>> created from log messages.  It helps developers prioritize bugs by how
>> frequently they manifest as test failures.  It also collects information
>> on unclassified errors -- we can see how many (and which) test runs
>> failed for an unknown reason and our overall progress on finding
>> fingerprints for random failures.
>>
>> [2] http://status.openstack.org/elastic-recheck/
>>
>> We added a feature to Zuul that lets us manually "promote" changes to
>> the top of the Gate pipeline.  When the QA team identifies a change that
>> fixes a bug that is affecting overall gate stability, we can move that
>> change to the top of the queue so that it may merge more quickly.
>>
>> We added the clean check facility in reaction to the January gate break
>> down. While it does mean that any individual patch might see more tests
>> run on it, it's now largely kept the gate queue at a countable number of
>> hours, instead of regularly growing to more than a work day in
>> length. It also means that a developer can Approve a code merge before
>> tests have returned, and not ruin it for everyone else if there turned
>> out to be a bug that the tests could catch.
>>
>> ==Future changes==
>>
>> ===Communication===
>> We used to be better at communicating about the CI system.  As it and
>> the project grew, we incrementally added to our institutional knowledge,
>> but we haven't been good about maintaining that information in a form
>> that new or existing contributors can consume to understand what's going
>> on and why.
>>
>> We have started on a major effort in that direction that we call the
>> "infra-manual" project -- it's designed to be a comprehensive "user
>> manual" for the project infrastructure, including the CI process.  Even
>> before that project is complete, we will write a document that
>> summarizes the CI system and ensure it is included in new developer
>> documentation and linked to from test results.
>>
>> There are also a number of ways for people to get involved in the CI
>> system, whether focused on Infrastructure or QA, but it is not always
>> clear how to do so.  We will improve our documentation to highlight how
>> to contribute.
>>
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data fr

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Angus Salkeld
On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
> OpenStack has a substantial CI system that is core to its development
> process.  The goals of the system are to facilitate merging good code,
> prevent regressions, and ensure that there is at least one configuration
> of upstream OpenStack that we know works as a whole.  The "project
> gating" technique that we use is effective at preventing many kinds of
> regressions from landing, however more subtle, non-deterministic bugs
> can still get through, and these are the bugs that are currently
> plaguing developers with seemingly random test failures.
> 
> Most of these bugs are not failures of the test system; they are real
> bugs.  Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests.  That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures.  We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.
> 
> The CI system and project infrastructure are not static.  They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now.  The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth.  This post from
> Sean Dague goes into a bit of the background: [1].  The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
> 
> [1] https://dague.net/2014/07/22/openstack-failures/
> 
> ==Things we're already doing==
> 
> The elastic-recheck tool[2] is used to identify "random" failures in
> test runs.  It tries to match failures to known bugs using signatures
> created from log messages.  It helps developers prioritize bugs by how
> frequently they manifest as test failures.  It also collects information
> on unclassified errors -- we can see how many (and which) test runs
> failed for an unknown reason and our overall progress on finding
> fingerprints for random failures.
> 
> [2] http://status.openstack.org/elastic-recheck/
> 
> We added a feature to Zuul that lets us manually "promote" changes to
> the top of the Gate pipeline.  When the QA team identifies a change that
> fixes a bug that is affecting overall gate stability, we can move that
> change to the top of the queue so that it may merge more quickly.
> 
> We added the clean check facility in reaction to the January gate break
> down. While it does mean that any individual patch might see more tests
> run on it, it's now largely kept the gate queue at a countable number of
> hours, instead of regularly growing to more than a work day in
> length. It also means that a developer can Approve a code merge before
> tests have returned, and not ruin it for everyone else if there turned
> out to be a bug that the tests could catch.
> 
> ==Future changes==
> 
> ===Communication===
> We used to be better at communicating about the CI system.  As it and
> the project grew, we incrementally added to our institutional knowledge,
> but we haven't been good about maintaining that information in a form
> that new or existing contributors can consume to understand what's going
> on and why.
> 
> We have started on a major effort in that direction that we call the
> "infra-manual" project -- it's designed to be a comprehensive "user
> manual" for the project infrastructure, including the CI process.  Even
> before that project is complete, we will write a document that
> summarizes the CI system and ensure it is included in new developer
> documentation and linked to from test results.
> 
> There are also a number of ways for people to get involved in the CI
> system, whether focused on Infrastructure or QA, but it is not always
> clear how to do so.  We will improve our documentation to highlight how
> to contribute.
> 
> ===Fixing Faster===
> 
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
> 
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> 
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the q

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Matthew Treinish
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> OpenStack has a substantial CI system that is core to its development
> process.  The goals of the system are to facilitate merging good code,
> prevent regressions, and ensure that there is at least one configuration
> of upstream OpenStack that we know works as a whole.  The "project
> gating" technique that we use is effective at preventing many kinds of
> regressions from landing, however more subtle, non-deterministic bugs
> can still get through, and these are the bugs that are currently
> plaguing developers with seemingly random test failures.
> 
> Most of these bugs are not failures of the test system; they are real
> bugs.  Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests.  That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures.  We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.
> 
> The CI system and project infrastructure are not static.  They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now.  The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth.  This post from
> Sean Dague goes into a bit of the background: [1].  The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
> 
> [1] https://dague.net/2014/07/22/openstack-failures/
> 
> ==Things we're already doing==
> 
> The elastic-recheck tool[2] is used to identify "random" failures in
> test runs.  It tries to match failures to known bugs using signatures
> created from log messages.  It helps developers prioritize bugs by how
> frequently they manifest as test failures.  It also collects information
> on unclassified errors -- we can see how many (and which) test runs
> failed for an unknown reason and our overall progress on finding
> fingerprints for random failures.
> 
> [2] http://status.openstack.org/elastic-recheck/
> 
> We added a feature to Zuul that lets us manually "promote" changes to
> the top of the Gate pipeline.  When the QA team identifies a change that
> fixes a bug that is affecting overall gate stability, we can move that
> change to the top of the queue so that it may merge more quickly.
> 
> We added the clean check facility in reaction to the January gate break
> down. While it does mean that any individual patch might see more tests
> run on it, it's now largely kept the gate queue at a countable number of
> hours, instead of regularly growing to more than a work day in
> length. It also means that a developer can Approve a code merge before
> tests have returned, and not ruin it for everyone else if there turned
> out to be a bug that the tests could catch.
> 
> ==Future changes==
> 
> ===Communication===
> We used to be better at communicating about the CI system.  As it and
> the project grew, we incrementally added to our institutional knowledge,
> but we haven't been good about maintaining that information in a form
> that new or existing contributors can consume to understand what's going
> on and why.
> 
> We have started on a major effort in that direction that we call the
> "infra-manual" project -- it's designed to be a comprehensive "user
> manual" for the project infrastructure, including the CI process.  Even
> before that project is complete, we will write a document that
> summarizes the CI system and ensure it is included in new developer
> documentation and linked to from test results.
> 
> There are also a number of ways for people to get involved in the CI
> system, whether focused on Infrastructure or QA, but it is not always
> clear how to do so.  We will improve our documentation to highlight how
> to contribute.
> 
> ===Fixing Faster===
> 
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
> 
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> 
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found i

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Joshua Harlow

On Jul 24, 2014, at 12:54 PM, Sean Dague  wrote:

> On 07/24/2014 02:51 PM, Joshua Harlow wrote:
>> A potentially brilliant idea ;-)
>> 
>> Aren't all the machines the gate runs tests on VMs running via OpenStack 
>> APIs?
>> 
>> OpenStack supports snapshotting (last time I checked). So instead of 
>> providing back a whole bunch of log files, provide back a snapshot of the 
>> machine/s that ran the tests; let person who wants to download that snapshot 
>> download it (and then they can boot it up into virtualbox, qemu, there own 
>> OpenStack cloud...) and investigate all the log files they desire. 
>> 
>> Are we really being so conservative on space that we couldn't do this? I 
>> find it hard to believe that space is a concern for anything anymore (if it 
>> really matters store the snapshots in ceph, or glusterfs, swift, or 
>> something else... which should dedup the blocks). This is pretty common with 
>> how people use snapshots and what they back them with anyway so it would be 
>> nice if infra exposed the same thing...
>> 
>> Would something like that be possible? I'm not so familiar with all the 
>> inner workings of the infra project; but if it eventually boots VMs using an 
>> OpenStack cloud, it would seem reasonable that it could provide the same 
>> mechanisms we are all already used to using...
>> 
>> Thoughts?
> 
> There are actual space concerns. Especially when we're talking about 20k
> runs / week. At which point snapshots are probably in the neighborhood
> of 10G, so we're talking about 200 TB / week of storage. Plus there are
> actual technical details of the fact that glance end points are really
> quite beta in the clouds we use. Remember our tests runs aren't pets,
> they are cattle, we need to figure out the right distillation of data
> and move on, as there isn't enough space or time to keep everything around.

Sure not pets..., save only the failing ones then (the broken cattle)?

Is 200TB/week really how much is actually stored when ceph or other uses 
data-deduping? Does rackspace or HP (the VM providers for infra afaik) do 
this/or use a similar deduping technology for storing snapshots?

I agree with right distillation and maybe it's not always needed, but it 
would/could be nice to have a button on gerrit that u could activate within a 
certain amount of time after the run to get all the images that the VMs used 
during the tests (yes the download would be likely be huge) if you really want 
to setup the exact same environment that the test failed with. Maybe have that 
button expire after a week (then u only need 200 TB of *expiring* space).

> Also portability of system images is... limited between hypervisors.
> 
> If this is something you'd like to see if you could figure out the hard
> parts of, I invite you to dive in on the infra side. It's very easy to
> say it's easy. :) Actually coming up with a workable solution requires a
> ton more time and energy.

Of course, that goes without saying,

I guess I thought this is a ML for discussions and thoughts (in part the 
'thought' part of this subject) and need not be a solution off the bat.

Just an idea anyway...

> 
>   -Sean
> 
> -- 
> Sean Dague
> http://dague.net
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Sean Dague
On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> 
>> ==Future changes==
> 
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
> 
> In some of the harder gate bugs I've looked at (especially the infamous
> 'live snapshot' timeout bug), it has been damn hard to actually figure
> out what's wrong. AFAIK, no one has ever been able to reproduce it
> outside of the gate infrastructure. I've even gone as far as setting up
> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> running the tempest tests multiple times, but still can't reproduce what
> happens on the gate machines themselves :-( As such we're relying on
> code inspection and the collected log messages to try and figure out
> what might be wrong.
> 
> The gate collects alot of info and publishes it, but in this case I
> have found the published logs to be insufficient - I needed to get
> the more verbose libvirtd.log file. devstack has the ability to turn
> this on via an environment variable, but it is disabled by default
> because it would add 3% to the total size of logs collected per gate
> job.

Right now we're at 95% full on 14 TB (which is the max # of volumes you
can attach to a single system in RAX), so every gig is sacred. There has
been a big push, which included the sprint last week in Darmstadt, to
get log data into swift, at which point our available storage goes way up.

So for right now, we're a little squashed. Hopefully within a month
we'll have the full solution.

As soon as we get those kinks out, I'd say we're in a position to flip
on that logging in devstack by default.

> There's no way for me to get that environment variable for devstack
> turned on for a specific review I want to test with. In the end I
> uploaded a change to nova which abused rootwrap to elevate privileges,
> install extra deb packages, reconfigure libvirtd logging and restart
> the libvirtd daemon.
> 
>   
> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> 
> This let me get further, but still not resolve it. My next attack is
> to build a custom QEMU binary and hack nova further so that it can
> download my custom QEMU binary from a website onto the gate machine
> and run the test with it. Failing that I'm going to be hacking things
> to try to attach to QEMU in the gate with GDB and get stack traces.
> Anything is doable thanks to rootwrap giving us a way to elevate
> privileges from Nova, but it is a somewhat tedious approach.
> 
> I'd like us to think about whether they is anything we can do to make
> life easier in these kind of hard debugging scenarios where the regular
> logs are not sufficient.

Agreed. Honestly, though we do also need to figure out first fail
detection on our logs as well. Because realistically if we can't debug
failures from those, then I really don't understand how we're ever going
to expect large users to.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Sean Dague
On 07/24/2014 02:51 PM, Joshua Harlow wrote:
> A potentially brilliant idea ;-)
> 
> Aren't all the machines the gate runs tests on VMs running via OpenStack APIs?
> 
> OpenStack supports snapshotting (last time I checked). So instead of 
> providing back a whole bunch of log files, provide back a snapshot of the 
> machine/s that ran the tests; let person who wants to download that snapshot 
> download it (and then they can boot it up into virtualbox, qemu, there own 
> OpenStack cloud...) and investigate all the log files they desire. 
> 
> Are we really being so conservative on space that we couldn't do this? I find 
> it hard to believe that space is a concern for anything anymore (if it really 
> matters store the snapshots in ceph, or glusterfs, swift, or something 
> else... which should dedup the blocks). This is pretty common with how people 
> use snapshots and what they back them with anyway so it would be nice if 
> infra exposed the same thing...
> 
> Would something like that be possible? I'm not so familiar with all the inner 
> workings of the infra project; but if it eventually boots VMs using an 
> OpenStack cloud, it would seem reasonable that it could provide the same 
> mechanisms we are all already used to using...
> 
> Thoughts?

There are actual space concerns. Especially when we're talking about 20k
runs / week. At which point snapshots are probably in the neighborhood
of 10G, so we're talking about 200 TB / week of storage. Plus there are
actual technical details of the fact that glance end points are really
quite beta in the clouds we use. Remember our tests runs aren't pets,
they are cattle, we need to figure out the right distillation of data
and move on, as there isn't enough space or time to keep everything around.

Also portability of system images is... limited between hypervisors.

If this is something you'd like to see if you could figure out the hard
parts of, I invite you to dive in on the infra side. It's very easy to
say it's easy. :) Actually coming up with a workable solution requires a
ton more time and energy.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Joshua Harlow

On Jul 24, 2014, at 12:08 PM, Anita Kuno  wrote:

> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
>> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>> 
>>> ==Future changes==
>> 
>>> ===Fixing Faster===
>>> 
>>> We introduce bugs to OpenStack at some constant rate, which piles up
>>> over time. Our systems currently treat all changes as equally risky and
>>> important to the health of the system, which makes landing code changes
>>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>>> process of promoting changes today to get around this, but that's
>>> actually quite costly in people time, and takes getting all the right
>>> people together at once to promote changes. You can see a number of the
>>> changes we promoted during the gate storm in June [3], and it was no
>>> small number of fixes to get us back to a reasonably passing gate. We
>>> think that optimizing this system will help us land fixes to critical
>>> bugs faster.
>>> 
>>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>> 
>>> The basic idea is to use the data from elastic recheck to identify that
>>> a patch is fixing a critical gate related bug. When one of these is
>>> found in the queues it will be given higher priority, including bubbling
>>> up to the top of the gate queue automatically. The manual promote
>>> process should no longer be needed, and instead bugs fixing elastic
>>> recheck tracked issues will be promoted automatically.
>>> 
>>> At the same time we'll also promote review on critical gate bugs through
>>> making them visible in a number of different channels (like on elastic
>>> recheck pages, review day, and in the gerrit dashboards). The idea here
>>> again is to make the reviews that fix key bugs pop to the top of
>>> everyone's views.
>> 
>> In some of the harder gate bugs I've looked at (especially the infamous
>> 'live snapshot' timeout bug), it has been damn hard to actually figure
>> out what's wrong. AFAIK, no one has ever been able to reproduce it
>> outside of the gate infrastructure. I've even gone as far as setting up
>> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
>> running the tempest tests multiple times, but still can't reproduce what
>> happens on the gate machines themselves :-( As such we're relying on
>> code inspection and the collected log messages to try and figure out
>> what might be wrong.
>> 
>> The gate collects alot of info and publishes it, but in this case I
>> have found the published logs to be insufficient - I needed to get
>> the more verbose libvirtd.log file. devstack has the ability to turn
>> this on via an environment variable, but it is disabled by default
>> because it would add 3% to the total size of logs collected per gate
>> job.
>> 
>> There's no way for me to get that environment variable for devstack
>> turned on for a specific review I want to test with. In the end I
>> uploaded a change to nova which abused rootwrap to elevate privileges,
>> install extra deb packages, reconfigure libvirtd logging and restart
>> the libvirtd daemon.
>> 
>>  
>> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>>  https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
>> 
>> This let me get further, but still not resolve it. My next attack is
>> to build a custom QEMU binary and hack nova further so that it can
>> download my custom QEMU binary from a website onto the gate machine
>> and run the test with it. Failing that I'm going to be hacking things
>> to try to attach to QEMU in the gate with GDB and get stack traces.
>> Anything is doable thanks to rootwrap giving us a way to elevate
>> privileges from Nova, but it is a somewhat tedious approach.
>> 
>> I'd like us to think about whether they is anything we can do to make
>> life easier in these kind of hard debugging scenarios where the regular
>> logs are not sufficient.
>> 
>> Regards,
>> Daniel
>> 
> For really really difficult bugs that can't be reproduced outside the
> gate, we do have the ability to hold vms if we know they have are
> displaying the bug, if they are caught before the vm in question is
> scheduled for deletion. In this case, make your intentions known in a
> discussion with a member of infra-root. A conversation will ensue
> involving what to do to get you what you need to continue debugging.
> 

Why? Is space really that expensive? It boggles my mind a little that we have a 
well financed foundation (afaik, correct me if I am wrong...) but yet can't 
save 'all' the things in a smart manner (saving all the VMs snapshots doesn't 
mean saving hundreds/thousands of gigabytes when u are using de-duping 
cinder/glance... backends). Expire those VMs after a week if that helps but it 
feels like we shouldn't be so conservative about developers needs to have 
access to all the VMs that the gate used/created..., it's not like developers 
are trying to 'harm' openstack by investigating root 

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Anita Kuno
On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> 
>> ==Future changes==
> 
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
> 
> In some of the harder gate bugs I've looked at (especially the infamous
> 'live snapshot' timeout bug), it has been damn hard to actually figure
> out what's wrong. AFAIK, no one has ever been able to reproduce it
> outside of the gate infrastructure. I've even gone as far as setting up
> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> running the tempest tests multiple times, but still can't reproduce what
> happens on the gate machines themselves :-( As such we're relying on
> code inspection and the collected log messages to try and figure out
> what might be wrong.
> 
> The gate collects alot of info and publishes it, but in this case I
> have found the published logs to be insufficient - I needed to get
> the more verbose libvirtd.log file. devstack has the ability to turn
> this on via an environment variable, but it is disabled by default
> because it would add 3% to the total size of logs collected per gate
> job.
> 
> There's no way for me to get that environment variable for devstack
> turned on for a specific review I want to test with. In the end I
> uploaded a change to nova which abused rootwrap to elevate privileges,
> install extra deb packages, reconfigure libvirtd logging and restart
> the libvirtd daemon.
> 
>   
> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> 
> This let me get further, but still not resolve it. My next attack is
> to build a custom QEMU binary and hack nova further so that it can
> download my custom QEMU binary from a website onto the gate machine
> and run the test with it. Failing that I'm going to be hacking things
> to try to attach to QEMU in the gate with GDB and get stack traces.
> Anything is doable thanks to rootwrap giving us a way to elevate
> privileges from Nova, but it is a somewhat tedious approach.
> 
> I'd like us to think about whether they is anything we can do to make
> life easier in these kind of hard debugging scenarios where the regular
> logs are not sufficient.
> 
> Regards,
> Daniel
> 
For really really difficult bugs that can't be reproduced outside the
gate, we do have the ability to hold vms if we know they have are
displaying the bug, if they are caught before the vm in question is
scheduled for deletion. In this case, make your intentions known in a
discussion with a member of infra-root. A conversation will ensue
involving what to do to get you what you need to continue debugging.

It doesn't work in all cases, but some have found it helpful. Keep in
mind you will be asked to demonstrate you have tried all other avenues
before this one is exercised.

Thanks,
Anita.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Joshua Harlow
A potentially brilliant idea ;-)

Aren't all the machines the gate runs tests on VMs running via OpenStack APIs?

OpenStack supports snapshotting (last time I checked). So instead of providing 
back a whole bunch of log files, provide back a snapshot of the machine/s that 
ran the tests; let person who wants to download that snapshot download it (and 
then they can boot it up into virtualbox, qemu, there own OpenStack cloud...) 
and investigate all the log files they desire. 

Are we really being so conservative on space that we couldn't do this? I find 
it hard to believe that space is a concern for anything anymore (if it really 
matters store the snapshots in ceph, or glusterfs, swift, or something else... 
which should dedup the blocks). This is pretty common with how people use 
snapshots and what they back them with anyway so it would be nice if infra 
exposed the same thing...

Would something like that be possible? I'm not so familiar with all the inner 
workings of the infra project; but if it eventually boots VMs using an 
OpenStack cloud, it would seem reasonable that it could provide the same 
mechanisms we are all already used to using...

Thoughts?

On Jul 24, 2014, at 9:40 AM, Daniel P. Berrange  wrote:

> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> 
>> ==Future changes==
> 
>> ===Fixing Faster===
>> 
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>> 
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>> 
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>> 
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
> 
> In some of the harder gate bugs I've looked at (especially the infamous
> 'live snapshot' timeout bug), it has been damn hard to actually figure
> out what's wrong. AFAIK, no one has ever been able to reproduce it
> outside of the gate infrastructure. I've even gone as far as setting up
> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> running the tempest tests multiple times, but still can't reproduce what
> happens on the gate machines themselves :-( As such we're relying on
> code inspection and the collected log messages to try and figure out
> what might be wrong.
> 
> The gate collects alot of info and publishes it, but in this case I
> have found the published logs to be insufficient - I needed to get
> the more verbose libvirtd.log file. devstack has the ability to turn
> this on via an environment variable, but it is disabled by default
> because it would add 3% to the total size of logs collected per gate
> job.
> 
> There's no way for me to get that environment variable for devstack
> turned on for a specific review I want to test with. In the end I
> uploaded a change to nova which abused rootwrap to elevate privileges,
> install extra deb packages, reconfigure libvirtd logging and restart
> the libvirtd daemon.
> 
>  
> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>  https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> 
> This let me get further, but still not resolve it. My next attack is
> to build a custom QEMU binary and hack nova further so that it can
> download my custom QEMU binary from a website onto the gate machine
> and run the test with it. Failing that I'm going to be hacking things
> to try to attach to QEMU in the gate with GDB and get stack traces.
> Anything is doable thanks to rootwrap giving us a way to elevate
> privileges from Nova, but it is a somewhat tedious approach.
> 
> I'd like us to think about whether they is anything we can do to make
> life easier in these kind of hard debugging scenarios where the regular
> logs are no

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Daniel P. Berrange
On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:

> ==Future changes==

> ===Fixing Faster===
> 
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
> 
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> 
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the queues it will be given higher priority, including bubbling
> up to the top of the gate queue automatically. The manual promote
> process should no longer be needed, and instead bugs fixing elastic
> recheck tracked issues will be promoted automatically.
> 
> At the same time we'll also promote review on critical gate bugs through
> making them visible in a number of different channels (like on elastic
> recheck pages, review day, and in the gerrit dashboards). The idea here
> again is to make the reviews that fix key bugs pop to the top of
> everyone's views.

In some of the harder gate bugs I've looked at (especially the infamous
'live snapshot' timeout bug), it has been damn hard to actually figure
out what's wrong. AFAIK, no one has ever been able to reproduce it
outside of the gate infrastructure. I've even gone as far as setting up
identical Ubuntu VMs to the ones used in the gate on a local cloud, and
running the tempest tests multiple times, but still can't reproduce what
happens on the gate machines themselves :-( As such we're relying on
code inspection and the collected log messages to try and figure out
what might be wrong.

The gate collects alot of info and publishes it, but in this case I
have found the published logs to be insufficient - I needed to get
the more verbose libvirtd.log file. devstack has the ability to turn
this on via an environment variable, but it is disabled by default
because it would add 3% to the total size of logs collected per gate
job.

There's no way for me to get that environment variable for devstack
turned on for a specific review I want to test with. In the end I
uploaded a change to nova which abused rootwrap to elevate privileges,
install extra deb packages, reconfigure libvirtd logging and restart
the libvirtd daemon.

  https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
  https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py

This let me get further, but still not resolve it. My next attack is
to build a custom QEMU binary and hack nova further so that it can
download my custom QEMU binary from a website onto the gate machine
and run the test with it. Failing that I'm going to be hacking things
to try to attach to QEMU in the gate with GDB and get stack traces.
Anything is doable thanks to rootwrap giving us a way to elevate
privileges from Nova, but it is a somewhat tedious approach.

I'd like us to think about whether they is anything we can do to make
life easier in these kind of hard debugging scenarios where the regular
logs are not sufficient.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Russell Bryant
On 07/23/2014 05:39 PM, James E. Blair wrote:
> ==Final thoughts==
> 
> The current rate of test failures and subsequent rechecks is not
> sustainable in the long term.  It's not good for contributors,
> reveiewers, or the overall project quality.  While these bugs do need to
> be addressed, it's unlikely that the current process will cause that to
> happen.  Instead, we want to push more substantial testing into the
> projects themselves with functional and interface testing, and depend
> less on devstack-gate integration tests to catch all bugs.  This should
> help us catch bugs closer to the source and in an environment where
> debugging is easier.  We also want to reduce the scope of devstack gate
> tests to a gold standard while running tests of other configurations in
> a traditional CI process so that people interested in those
> configurations can focus on ensuring they work.

Very nice writeup.  I think these steps sound like a positive way forward.

Thanks!

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Sean Dague
On 07/24/2014 06:06 AM, Chmouel Boudjnah wrote:
> Hello,
> 
> Thanks for writing this summary, I like all those ideas and thanks working
> hard on fixing this.
> 
>>   * For all non gold standard configurations, we'll dedicate a part of
>> our infrastructure to running them in a continuous background loop,
>> as well as making these configs available as experimental jobs. The
>> idea here is that we'll actually be able to provide more
>> configurations that are operating in a more traditional CI (post
>> merge) context. People that are interested in keeping these bits
>> functional can monitor those jobs and help with fixes when needed.
>> The experimental jobs mean that if developers are concerned about
>> the effect of a particular change on one of these configs, it's easy
>> to request a pre-merge test run.  In the near term we might imagine
>> this would allow for things like ceph, mongodb, docker, and possibly
>> very new libvirt to be validated in some way upstream.
> 
> What about external CI ? is external CI would need to be post merge or
> still stay as is ? what would be the difference between external CI
> plugging on review changes and post CI merges?

External CI is *really* supposed to be for things that Infrastructure
can't or won't run (for technical or policy reasons). VMWare isn't open
source, so that would always need to be outside of infra. Xen is
something that there remains technical challenges on to get working in
infra, but I think everyone would like to see it there eventually.

Overall capacity and randomness issues means we can't do all these
configs in a pre-merge context. But moving to a fixed capacity post
merge world means we could create a ton of test data for these
configurations.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

2014-07-24 Thread Chmouel Boudjnah
Hello,

Thanks for writing this summary, I like all those ideas and thanks working
hard on fixing this.

>   * For all non gold standard configurations, we'll dedicate a part of
> our infrastructure to running them in a continuous background loop,
> as well as making these configs available as experimental jobs. The
> idea here is that we'll actually be able to provide more
> configurations that are operating in a more traditional CI (post
> merge) context. People that are interested in keeping these bits
> functional can monitor those jobs and help with fixes when needed.
> The experimental jobs mean that if developers are concerned about
> the effect of a particular change on one of these configs, it's easy
> to request a pre-merge test run.  In the near term we might imagine
> this would allow for things like ceph, mongodb, docker, and possibly
> very new libvirt to be validated in some way upstream.

What about external CI ? is external CI would need to be post merge or
still stay as is ? what would be the difference between external CI
plugging on review changes and post CI merges?

Chmouel

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev