Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-07-07 Thread Steven Dake (stdake)


On 7/6/16, 5:50 PM, "Paul Belanger"  wrote:

>On Thu, Jun 16, 2016 at 12:20:06PM +, Steven Dake (stdake) wrote:
>> David,
>> 
>> The gates are unreliable for a variety of reasons - some we can fix -
>>some
>> we can't directly.
>> 
>> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
>> reliably to drop dramatically.  Prior to this change, our gate was
>>running
>> 95% reliability or better - assuming the code wasn¹t busted.
>> The gate gear is different - meaning different setup.  We have been
>> working on debugging all these various gate provider issues with infra
>> team and I think that is mostly concluded.
>> The gate changed to something called bindeps which has been less
>>reliable
>> for us.
>
>I would be curious to hear your issues with bindep. A quick look at kolla
>show
>you are not using other-requirements.txt yet, so you are using our default
>fallback.txt file. I am unsure how that could be impacting you.
>
>> We do not have mirrors of CentOS repos - although it is in the works.
>> Mirrors will ensure that images always get built.  At the moment many of
>> the gate failures are triggered by build failures (the mirrors are too
>> busy).
>
>This is no longer the case, openstack-infra is now mirroring both
>centos-7[1]
>and epel-7[2]. And just this week we brought Ubuntu Cloud Archive[3]
>online. It
>would be pretty trivial to update kolla to start using them.
>
>[1] http://mirror.dfw.rax.openstack.org/centos/7/
>[2] http://mirror.dfw.rax.openstack.org/epel/7/
>[3] http://mirror.dfw.rax.openstack.org/ubuntu-cloud-archive/

Thanks I was aware that infra made mirrors available; I have not had a
chance to personally modify the gate to make use of these mirrors.

I am not sure if there is an issue with bindep or not.  A whole lot of
things changed at once and our gate went from pretty stable to super
unstable.  One of those things was bindeps but there were a bunch of other
changes.  I wouldn't pin it all on binddep.
 
>
>> We do not have mirrors of the other 5-10 repos and files we use.  This
>> causes more build failures.
>> 
>We do have the infrastructure in AFS to do this, it would require you to
>write
>the patch and submit it to openstack-infra so we can bring it online.  In
>fact,
>the OpenStack Ansible team was responsible for UCA mirror above, I simply
>did
>the last 5% to bring it into production.

Wow that’s huge!  I was not aware of this.  Do you have an example patch
which brings a mirror into service??

Thanks
-steve

>
>> Complicating matters, any of theses 5 things above can crater one gate
>>job
>> of which we run about 15 jobs, which causes the entire gate to fail (if
>> they were voting).  I really want a voting gate for kolla's jobs.  I
>>super
>> want it.  The reason we can't make the gates voting at this time is
>> because of the sheer unreliability of the gate.
>> 
>> If anyone is up for a thorough analysis of *why* the gates are failing,
>> that would help us fix them.
>> 
>> Regards
>> -steve
>> 
>> On 6/15/16, 3:27 AM, "Paul Bourke"  wrote:
>> 
>> >Hi David,
>> >
>> >I agree with this completely. Gates continue to be a problem for Kolla,
>> >reasons why have been discussed in the past but at least for me it's
>>not
>> >clear what the key issues are.
>> >
>> >I've added this item to agenda for todays IRC meeting (16:00 UTC -
>> >https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>> >hand we can brainstorm a list of the most common problems here
>>beforehand.
>> >
>> >To kick things off, rabbitmq seems to cause a disproportionate amount
>>of
>> >issues, and the problems are difficult to diagnose, particularly when
>> >the only way to debug is to summit "DO NOT MERGE" patch sets over and
>> >over. Here's an example of a failed centos binary gate from a simple
>> >patch set I was reviewing this morning:
>> 
>>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-cento
>>>s-
>> >binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>> >
>> >Cheers,
>> >-Paul
>> >
>> >On 15/06/16 04:26, David Moreau Simard wrote:
>> >> Hi Kolla o/
>> >>
>> >> I'm writing to you because I'm concerned.
>> >>
>> >> In case you didn't already know, the RDO community collaborates with
>> >> upstream deployment and installation projects to test it's packaging.
>> >>
>> >> This relationship is beneficial in a lot of ways for both parties, in
>> >>summary:
>> >> - RDO has improved test coverage (because it's otherwise hard to test
>> >> different ways of installing, configuring and deploying OpenStack by
>> >> ourselves)
>> >> - The RDO community works with upstream projects (deployment or core
>> >> projects) to fix issues that we find
>> >> - In return, the collaborating deployment project can feel more
>> >> confident that the RDO packages it consumes have already been tested
>> >> using it's platform and should work
>> >>
>> >> To make a long story short, we do this with a project called 

Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-07-06 Thread Paul Belanger
On Thu, Jun 16, 2016 at 12:20:06PM +, Steven Dake (stdake) wrote:
> David,
> 
> The gates are unreliable for a variety of reasons - some we can fix - some
> we can't directly.
> 
> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
> reliably to drop dramatically.  Prior to this change, our gate was running
> 95% reliability or better - assuming the code wasn¹t busted.
> The gate gear is different - meaning different setup.  We have been
> working on debugging all these various gate provider issues with infra
> team and I think that is mostly concluded.
> The gate changed to something called bindeps which has been less reliable
> for us.

I would be curious to hear your issues with bindep. A quick look at kolla show
you are not using other-requirements.txt yet, so you are using our default
fallback.txt file. I am unsure how that could be impacting you.

> We do not have mirrors of CentOS repos - although it is in the works.
> Mirrors will ensure that images always get built.  At the moment many of
> the gate failures are triggered by build failures (the mirrors are too
> busy).

This is no longer the case, openstack-infra is now mirroring both centos-7[1]
and epel-7[2]. And just this week we brought Ubuntu Cloud Archive[3] online. It
would be pretty trivial to update kolla to start using them.

[1] http://mirror.dfw.rax.openstack.org/centos/7/
[2] http://mirror.dfw.rax.openstack.org/epel/7/
[3] http://mirror.dfw.rax.openstack.org/ubuntu-cloud-archive/

> We do not have mirrors of the other 5-10 repos and files we use.  This
> causes more build failures.
> 
We do have the infrastructure in AFS to do this, it would require you to write
the patch and submit it to openstack-infra so we can bring it online.  In fact,
the OpenStack Ansible team was responsible for UCA mirror above, I simply did
the last 5% to bring it into production.

> Complicating matters, any of theses 5 things above can crater one gate job
> of which we run about 15 jobs, which causes the entire gate to fail (if
> they were voting).  I really want a voting gate for kolla's jobs.  I super
> want it.  The reason we can't make the gates voting at this time is
> because of the sheer unreliability of the gate.
> 
> If anyone is up for a thorough analysis of *why* the gates are failing,
> that would help us fix them.
> 
> Regards
> -steve
> 
> On 6/15/16, 3:27 AM, "Paul Bourke"  wrote:
> 
> >Hi David,
> >
> >I agree with this completely. Gates continue to be a problem for Kolla,
> >reasons why have been discussed in the past but at least for me it's not
> >clear what the key issues are.
> >
> >I've added this item to agenda for todays IRC meeting (16:00 UTC -
> >https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
> >hand we can brainstorm a list of the most common problems here beforehand.
> >
> >To kick things off, rabbitmq seems to cause a disproportionate amount of
> >issues, and the problems are difficult to diagnose, particularly when
> >the only way to debug is to summit "DO NOT MERGE" patch sets over and
> >over. Here's an example of a failed centos binary gate from a simple
> >patch set I was reviewing this morning:
> >http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
> >binary/3486d03/console.html#_2016-06-14_15_36_19_425413
> >
> >Cheers,
> >-Paul
> >
> >On 15/06/16 04:26, David Moreau Simard wrote:
> >> Hi Kolla o/
> >>
> >> I'm writing to you because I'm concerned.
> >>
> >> In case you didn't already know, the RDO community collaborates with
> >> upstream deployment and installation projects to test it's packaging.
> >>
> >> This relationship is beneficial in a lot of ways for both parties, in
> >>summary:
> >> - RDO has improved test coverage (because it's otherwise hard to test
> >> different ways of installing, configuring and deploying OpenStack by
> >> ourselves)
> >> - The RDO community works with upstream projects (deployment or core
> >> projects) to fix issues that we find
> >> - In return, the collaborating deployment project can feel more
> >> confident that the RDO packages it consumes have already been tested
> >> using it's platform and should work
> >>
> >> To make a long story short, we do this with a project called WeIRDO
> >> [1] which essentially runs gate jobs outside of the gate.
> >>
> >> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
> >> I really did.
> >> I contributed the necessary features I needed in Kolla in order to
> >> make this work, like the configurable Yum repositories for example.
> >>
> >> However, in the end, I had to put off the initiative because the gate
> >> jobs were very flappy and unreliable.
> >> We cannot afford to have a job that is *expected* to flap in our
> >> testing pipeline, it leads to a lot of wasted time, effort and
> >> resources.
> >>
> >> I think there's been a lot of improvements since my last attempt but
> >> to get a sample of data, I looked at ~30 recently 

Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-07-06 Thread Steven Dake (stdake)
David,

Thanks for the feedback.  We know we have more work to do on our
integration gate.  It is a matter of finding people that have been trained
on gating development to do gate work.

Regards
-steve

On 7/4/16, 12:39 PM, "David Moreau Simard"  wrote:

>I mentioned this on IRC to some extent but I'm going to post it here
>for posterity.
>
>I think we can all agree that Integration tests are pretty darn
>important and I'm convinced I don't need to remind you why.
>I'm going to re-iterate that I am very concerned about the state of
>the jobs but also their coverage.
>
>Kolla provides an implementation for a lot of the big tents projects
>but they are not properly (if at all) tested in the gate.
>Only the core services are tested in an "all-in-one" fashion and if a
>commit happens to break a project that isn't tested in that all-in-one
>test, no one will know about it.
>
>This is very dangerous territory -- you can't guarantee that what
>Kolla supports really works on every commit.
>Both Packstack [1] and Puppet-OpenStack [2] have an extensive matrix
>of test coverage across different jobs and different operating systems
>to work around the memory constraints of the gate virtual machines.
>They test themselves with their project implementations in different
>ways (i.e, glance with file, glance with swift, cinder with lvm,
>cinder with ceph, neutron with ovs, neutron with linuxbridge, etc.)
>and do so successfully.
>
>I don't see why Kolla should be different if it is to be taken seriously.
>My apologies if it feels I am being harsh - I am being open and honest
>about Kolla's loss of credibility from my perspective.
>
>I've put my attempts to put Kolla in RDO's testing pipeline on hold
>for the Newton cycle.
>I hope we can straighten out all of this -- I care about Kolla and I
>want it to succeed, which is why I started this thread in the first
>place.
>
>While I don't really have the bandwidth to contribute to Kolla, I hope
>you can at least consider my feedback and you can also find me on IRC
>if you have questions.
>
>[1]: https://github.com/openstack/packstack#packstack-integration-tests
>[2]: https://github.com/openstack/puppet-openstack-integration#description
>
>David Moreau Simard
>Senior Software Engineer | Openstack RDO
>
>dmsimard = [irc, github, twitter]
>
>
>On Thu, Jun 16, 2016 at 8:20 AM, Steven Dake (stdake) 
>wrote:
>> David,
>>
>> The gates are unreliable for a variety of reasons - some we can fix -
>>some
>> we can't directly.
>>
>> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
>> reliably to drop dramatically.  Prior to this change, our gate was
>>running
>> 95% reliability or better - assuming the code wasn¹t busted.
>> The gate gear is different - meaning different setup.  We have been
>> working on debugging all these various gate provider issues with infra
>> team and I think that is mostly concluded.
>> The gate changed to something called bindeps which has been less
>>reliable
>> for us.
>> We do not have mirrors of CentOS repos - although it is in the works.
>> Mirrors will ensure that images always get built.  At the moment many of
>> the gate failures are triggered by build failures (the mirrors are too
>> busy).
>> We do not have mirrors of the other 5-10 repos and files we use.  This
>> causes more build failures.
>>
>> Complicating matters, any of theses 5 things above can crater one gate
>>job
>> of which we run about 15 jobs, which causes the entire gate to fail (if
>> they were voting).  I really want a voting gate for kolla's jobs.  I
>>super
>> want it.  The reason we can't make the gates voting at this time is
>> because of the sheer unreliability of the gate.
>>
>> If anyone is up for a thorough analysis of *why* the gates are failing,
>> that would help us fix them.
>>
>> Regards
>> -steve
>>
>> On 6/15/16, 3:27 AM, "Paul Bourke"  wrote:
>>
>>>Hi David,
>>>
>>>I agree with this completely. Gates continue to be a problem for Kolla,
>>>reasons why have been discussed in the past but at least for me it's not
>>>clear what the key issues are.
>>>
>>>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>>>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>>>hand we can brainstorm a list of the most common problems here
>>>beforehand.
>>>
>>>To kick things off, rabbitmq seems to cause a disproportionate amount of
>>>issues, and the problems are difficult to diagnose, particularly when
>>>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>>>over. Here's an example of a failed centos binary gate from a simple
>>>patch set I was reviewing this morning:
>>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-cento
>>>s-
>>>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>>>
>>>Cheers,
>>>-Paul
>>>
>>>On 15/06/16 04:26, David Moreau Simard wrote:
 Hi Kolla o/

 I'm writing to you because I'm concerned.


Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-07-04 Thread David Moreau Simard
I mentioned this on IRC to some extent but I'm going to post it here
for posterity.

I think we can all agree that Integration tests are pretty darn
important and I'm convinced I don't need to remind you why.
I'm going to re-iterate that I am very concerned about the state of
the jobs but also their coverage.

Kolla provides an implementation for a lot of the big tents projects
but they are not properly (if at all) tested in the gate.
Only the core services are tested in an "all-in-one" fashion and if a
commit happens to break a project that isn't tested in that all-in-one
test, no one will know about it.

This is very dangerous territory -- you can't guarantee that what
Kolla supports really works on every commit.
Both Packstack [1] and Puppet-OpenStack [2] have an extensive matrix
of test coverage across different jobs and different operating systems
to work around the memory constraints of the gate virtual machines.
They test themselves with their project implementations in different
ways (i.e, glance with file, glance with swift, cinder with lvm,
cinder with ceph, neutron with ovs, neutron with linuxbridge, etc.)
and do so successfully.

I don't see why Kolla should be different if it is to be taken seriously.
My apologies if it feels I am being harsh - I am being open and honest
about Kolla's loss of credibility from my perspective.

I've put my attempts to put Kolla in RDO's testing pipeline on hold
for the Newton cycle.
I hope we can straighten out all of this -- I care about Kolla and I
want it to succeed, which is why I started this thread in the first
place.

While I don't really have the bandwidth to contribute to Kolla, I hope
you can at least consider my feedback and you can also find me on IRC
if you have questions.

[1]: https://github.com/openstack/packstack#packstack-integration-tests
[2]: https://github.com/openstack/puppet-openstack-integration#description

David Moreau Simard
Senior Software Engineer | Openstack RDO

dmsimard = [irc, github, twitter]


On Thu, Jun 16, 2016 at 8:20 AM, Steven Dake (stdake)  wrote:
> David,
>
> The gates are unreliable for a variety of reasons - some we can fix - some
> we can't directly.
>
> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
> reliably to drop dramatically.  Prior to this change, our gate was running
> 95% reliability or better - assuming the code wasn¹t busted.
> The gate gear is different - meaning different setup.  We have been
> working on debugging all these various gate provider issues with infra
> team and I think that is mostly concluded.
> The gate changed to something called bindeps which has been less reliable
> for us.
> We do not have mirrors of CentOS repos - although it is in the works.
> Mirrors will ensure that images always get built.  At the moment many of
> the gate failures are triggered by build failures (the mirrors are too
> busy).
> We do not have mirrors of the other 5-10 repos and files we use.  This
> causes more build failures.
>
> Complicating matters, any of theses 5 things above can crater one gate job
> of which we run about 15 jobs, which causes the entire gate to fail (if
> they were voting).  I really want a voting gate for kolla's jobs.  I super
> want it.  The reason we can't make the gates voting at this time is
> because of the sheer unreliability of the gate.
>
> If anyone is up for a thorough analysis of *why* the gates are failing,
> that would help us fix them.
>
> Regards
> -steve
>
> On 6/15/16, 3:27 AM, "Paul Bourke"  wrote:
>
>>Hi David,
>>
>>I agree with this completely. Gates continue to be a problem for Kolla,
>>reasons why have been discussed in the past but at least for me it's not
>>clear what the key issues are.
>>
>>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>>hand we can brainstorm a list of the most common problems here beforehand.
>>
>>To kick things off, rabbitmq seems to cause a disproportionate amount of
>>issues, and the problems are difficult to diagnose, particularly when
>>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>>over. Here's an example of a failed centos binary gate from a simple
>>patch set I was reviewing this morning:
>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
>>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>>
>>Cheers,
>>-Paul
>>
>>On 15/06/16 04:26, David Moreau Simard wrote:
>>> Hi Kolla o/
>>>
>>> I'm writing to you because I'm concerned.
>>>
>>> In case you didn't already know, the RDO community collaborates with
>>> upstream deployment and installation projects to test it's packaging.
>>>
>>> This relationship is beneficial in a lot of ways for both parties, in
>>>summary:
>>> - RDO has improved test coverage (because it's otherwise hard to test
>>> different ways of installing, configuring and deploying OpenStack by
>>> ourselves)

Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-06-16 Thread Steven Dake (stdake)
David,

The gates are unreliable for a variety of reasons - some we can fix - some
we can't directly.

RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
reliably to drop dramatically.  Prior to this change, our gate was running
95% reliability or better - assuming the code wasn¹t busted.
The gate gear is different - meaning different setup.  We have been
working on debugging all these various gate provider issues with infra
team and I think that is mostly concluded.
The gate changed to something called bindeps which has been less reliable
for us.
We do not have mirrors of CentOS repos - although it is in the works.
Mirrors will ensure that images always get built.  At the moment many of
the gate failures are triggered by build failures (the mirrors are too
busy).
We do not have mirrors of the other 5-10 repos and files we use.  This
causes more build failures.

Complicating matters, any of theses 5 things above can crater one gate job
of which we run about 15 jobs, which causes the entire gate to fail (if
they were voting).  I really want a voting gate for kolla's jobs.  I super
want it.  The reason we can't make the gates voting at this time is
because of the sheer unreliability of the gate.

If anyone is up for a thorough analysis of *why* the gates are failing,
that would help us fix them.

Regards
-steve

On 6/15/16, 3:27 AM, "Paul Bourke"  wrote:

>Hi David,
>
>I agree with this completely. Gates continue to be a problem for Kolla,
>reasons why have been discussed in the past but at least for me it's not
>clear what the key issues are.
>
>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>hand we can brainstorm a list of the most common problems here beforehand.
>
>To kick things off, rabbitmq seems to cause a disproportionate amount of
>issues, and the problems are difficult to diagnose, particularly when
>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>over. Here's an example of a failed centos binary gate from a simple
>patch set I was reviewing this morning:
>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>
>Cheers,
>-Paul
>
>On 15/06/16 04:26, David Moreau Simard wrote:
>> Hi Kolla o/
>>
>> I'm writing to you because I'm concerned.
>>
>> In case you didn't already know, the RDO community collaborates with
>> upstream deployment and installation projects to test it's packaging.
>>
>> This relationship is beneficial in a lot of ways for both parties, in
>>summary:
>> - RDO has improved test coverage (because it's otherwise hard to test
>> different ways of installing, configuring and deploying OpenStack by
>> ourselves)
>> - The RDO community works with upstream projects (deployment or core
>> projects) to fix issues that we find
>> - In return, the collaborating deployment project can feel more
>> confident that the RDO packages it consumes have already been tested
>> using it's platform and should work
>>
>> To make a long story short, we do this with a project called WeIRDO
>> [1] which essentially runs gate jobs outside of the gate.
>>
>> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
>> I really did.
>> I contributed the necessary features I needed in Kolla in order to
>> make this work, like the configurable Yum repositories for example.
>>
>> However, in the end, I had to put off the initiative because the gate
>> jobs were very flappy and unreliable.
>> We cannot afford to have a job that is *expected* to flap in our
>> testing pipeline, it leads to a lot of wasted time, effort and
>> resources.
>>
>> I think there's been a lot of improvements since my last attempt but
>> to get a sample of data, I looked at ~30 recently merged reviews.
>> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
>> didn't account for rechecks, just the last known status of the check
>> jobs.
>> I put up the results of those jobs here [2].
>>
>> In the case that interests me most, CentOS binary jobs, it's 5
>> failures out of 50 jobs, so 10%. Not as bad but still a concern for
>> me.
>>
>> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
>> Packstack and TripleO have quite a bit of *voting* integration testing
>> jobs.
>> Why are Kolla's jobs non-voting and so unreliable ?
>>
>> Thanks,
>>
>> [1]: https://github.com/rdo-infra/weirdo
>> [2]: 
>>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_z
>>dFfuLjquG4/edit#gid=0
>>
>> David Moreau Simard
>> Senior Software Engineer | Openstack RDO
>>
>> dmsimard = [irc, github, twitter]
>>
>> 
>>_
>>_
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>>openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> 

Re: [openstack-dev] [kolla] Stability and reliability of gate jobs

2016-06-15 Thread Paul Bourke

Hi David,

I agree with this completely. Gates continue to be a problem for Kolla, 
reasons why have been discussed in the past but at least for me it's not 
clear what the key issues are.


I've added this item to agenda for todays IRC meeting (16:00 UTC - 
https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before 
hand we can brainstorm a list of the most common problems here beforehand.


To kick things off, rabbitmq seems to cause a disproportionate amount of 
issues, and the problems are difficult to diagnose, particularly when 
the only way to debug is to summit "DO NOT MERGE" patch sets over and 
over. Here's an example of a failed centos binary gate from a simple 
patch set I was reviewing this morning: 
http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-binary/3486d03/console.html#_2016-06-14_15_36_19_425413


Cheers,
-Paul

On 15/06/16 04:26, David Moreau Simard wrote:

Hi Kolla o/

I'm writing to you because I'm concerned.

In case you didn't already know, the RDO community collaborates with
upstream deployment and installation projects to test it's packaging.

This relationship is beneficial in a lot of ways for both parties, in summary:
- RDO has improved test coverage (because it's otherwise hard to test
different ways of installing, configuring and deploying OpenStack by
ourselves)
- The RDO community works with upstream projects (deployment or core
projects) to fix issues that we find
- In return, the collaborating deployment project can feel more
confident that the RDO packages it consumes have already been tested
using it's platform and should work

To make a long story short, we do this with a project called WeIRDO
[1] which essentially runs gate jobs outside of the gate.

I tried to get Kolla in our testing pipeline during the Mitaka cycle.
I really did.
I contributed the necessary features I needed in Kolla in order to
make this work, like the configurable Yum repositories for example.

However, in the end, I had to put off the initiative because the gate
jobs were very flappy and unreliable.
We cannot afford to have a job that is *expected* to flap in our
testing pipeline, it leads to a lot of wasted time, effort and
resources.

I think there's been a lot of improvements since my last attempt but
to get a sample of data, I looked at ~30 recently merged reviews.
Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
didn't account for rechecks, just the last known status of the check
jobs.
I put up the results of those jobs here [2].

In the case that interests me most, CentOS binary jobs, it's 5
failures out of 50 jobs, so 10%. Not as bad but still a concern for
me.

Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
Packstack and TripleO have quite a bit of *voting* integration testing
jobs.
Why are Kolla's jobs non-voting and so unreliable ?

Thanks,

[1]: https://github.com/rdo-infra/weirdo
[2]: 
https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_zdFfuLjquG4/edit#gid=0

David Moreau Simard
Senior Software Engineer | Openstack RDO

dmsimard = [irc, github, twitter]

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [kolla] Stability and reliability of gate jobs

2016-06-14 Thread David Moreau Simard
Hi Kolla o/

I'm writing to you because I'm concerned.

In case you didn't already know, the RDO community collaborates with
upstream deployment and installation projects to test it's packaging.

This relationship is beneficial in a lot of ways for both parties, in summary:
- RDO has improved test coverage (because it's otherwise hard to test
different ways of installing, configuring and deploying OpenStack by
ourselves)
- The RDO community works with upstream projects (deployment or core
projects) to fix issues that we find
- In return, the collaborating deployment project can feel more
confident that the RDO packages it consumes have already been tested
using it's platform and should work

To make a long story short, we do this with a project called WeIRDO
[1] which essentially runs gate jobs outside of the gate.

I tried to get Kolla in our testing pipeline during the Mitaka cycle.
I really did.
I contributed the necessary features I needed in Kolla in order to
make this work, like the configurable Yum repositories for example.

However, in the end, I had to put off the initiative because the gate
jobs were very flappy and unreliable.
We cannot afford to have a job that is *expected* to flap in our
testing pipeline, it leads to a lot of wasted time, effort and
resources.

I think there's been a lot of improvements since my last attempt but
to get a sample of data, I looked at ~30 recently merged reviews.
Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
didn't account for rechecks, just the last known status of the check
jobs.
I put up the results of those jobs here [2].

In the case that interests me most, CentOS binary jobs, it's 5
failures out of 50 jobs, so 10%. Not as bad but still a concern for
me.

Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
Packstack and TripleO have quite a bit of *voting* integration testing
jobs.
Why are Kolla's jobs non-voting and so unreliable ?

Thanks,

[1]: https://github.com/rdo-infra/weirdo
[2]: 
https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_zdFfuLjquG4/edit#gid=0

David Moreau Simard
Senior Software Engineer | Openstack RDO

dmsimard = [irc, github, twitter]

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev