Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-20 Thread Ben Nemec



On 04/19/2017 12:17 PM, Justin Kilpatrick wrote:

More nodes is always better but I don't think we need to push the host
cloud to it's absolute limits right away. I have a list of several
pain points I expect to find with just 30ish nodes that should keep us
busy for a while.

I think the optimizations are a good idea though, especially if we
want to pave the way for the next level of this sort of effort. Which
is devs being able to ask for a 'scale ci' run on gerrit and schedule
a decent sized job for whenever it's convenient. The closer we can get
devs to large environments on demand the faster and easier these
issues can be solved.

But for now baby steps.


https://review.openstack.org/458651 should enable the heterogeneous 
environments I mentioned.  We need to be a little careful with a patch 
like that since they aren't actually tested in the gate, but something 
along those lines should work.  It may also need a larger controller 
flavor so the controller(s) don't OOM with that many computes, but if 
we've got a custom env for this use case anyway that should be doable too.




On Wed, Apr 19, 2017 at 12:30 PM, Ben Nemec  wrote:

TLDR: We have the capacity to do this.  One scale job can be absorbed into
our existing test infrastructure with minimal impact.


On 04/19/2017 07:50 AM, Flavio Percoco wrote:


On 18/04/17 14:28 -0400, Emilien Macchi wrote:


On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
 wrote:


Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.



Wait no, I know some folks at 3am on a Saturday night who use TripleO
CI (ok that was a joke).



Jokes apart, it really depends on the TZ and when you schedule it. 3:00
UTC on a
Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
remember
that some countries work on Sundays.



With the exception of the brief period where the ovb jobs were running at
full capacity 24 hours a day, there has always been a lull in activity
during early morning UTC.  Yes, there are people working during that time,
but generally far fewer and the load on TripleO CI is at its lowest point.
Honestly I'd be okay running this scale job every night, not just on the
weekend.  A week of changes is a lot to sift through if a scaling issue
creeps into one of the many, many projects that affect such things in
TripleO.

Also, I should note that we're not currently being constrained by absolute
hardware limits in rh1.  The reason I haven't scaled our concurrent jobs
higher is that there is already performance degradation when we have a full
70 jobs running at once.  This type of scale job would require a lot of
theoretical resources, but those 30 compute nodes are mostly going to be
sitting there idle while the controller(s) get deployed, so in reality their
impact on the infrastructure is going to be less than if we just added more
concurrent jobs that used 30 additional nodes.  And we do have the
memory/cpu/disk to spare in rh1 to spin up more vms.

We could also take advantage of heterogeneous OVB environments now so that
the compute nodes are only 3 GB VMs instead of 8 as they are now. That would
further reduce the impact of this sort of job.  It would require some tweaks
to how the testenvs are created, but that shouldn't be a problem.




To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.



Yes this proposal is really cool. There is an alternative to run this
periodic scenario outside TripleO CI and send results via email maybe.
But it is something we need to discuss with RDO Cloud people and see
if we would have such resources to make it on a weekly frequency.

Thanks for bringing this up, it's crucial for us to have this kind of
feedback, now let's take actions.



+1

Flavio




Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-19 Thread Justin Kilpatrick
More nodes is always better but I don't think we need to push the host
cloud to it's absolute limits right away. I have a list of several
pain points I expect to find with just 30ish nodes that should keep us
busy for a while.

I think the optimizations are a good idea though, especially if we
want to pave the way for the next level of this sort of effort. Which
is devs being able to ask for a 'scale ci' run on gerrit and schedule
a decent sized job for whenever it's convenient. The closer we can get
devs to large environments on demand the faster and easier these
issues can be solved.

But for now baby steps.

On Wed, Apr 19, 2017 at 12:30 PM, Ben Nemec  wrote:
> TLDR: We have the capacity to do this.  One scale job can be absorbed into
> our existing test infrastructure with minimal impact.
>
>
> On 04/19/2017 07:50 AM, Flavio Percoco wrote:
>>
>> On 18/04/17 14:28 -0400, Emilien Macchi wrote:
>>>
>>> On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
>>>  wrote:

 Because CI jobs tend to max out about 5 nodes there's a whole class of
 minor bugs that make it into releases.

 What happens is that they never show up in small clouds, then when
 they do show up in larger testing clouds the people deploying those
 simply work around the issue and get onto what they where supposed to
 be testing. These workarounds do get documented/BZ'd but since they
 don't block anyone and only show up in large environments they become
 hard for developers to fix.

 So the issue gets stuck in limbo, with nowhere to test a patchset and
 no one owning the issue.

 These issues pile up and pretty soon there is a significant difference
 between the default documented workflow and the 'scale' workflow which
 is filled with workarounds which may or may not be documented
 upstream.

 I'd like to propose getting these issues more visibility to having a
 periodic upstream job that uses 20-30 ovb instances to do a larger
 deployment. Maybe at 3am on a Sunday or some other time where there's
 idle execution capability to exploit. The goal being to make these
 sorts of issues more visible and hopefully get better at fixing them.
>>>
>>>
>>> Wait no, I know some folks at 3am on a Saturday night who use TripleO
>>> CI (ok that was a joke).
>>
>>
>> Jokes apart, it really depends on the TZ and when you schedule it. 3:00
>> UTC on a
>> Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
>> remember
>> that some countries work on Sundays.
>
>
> With the exception of the brief period where the ovb jobs were running at
> full capacity 24 hours a day, there has always been a lull in activity
> during early morning UTC.  Yes, there are people working during that time,
> but generally far fewer and the load on TripleO CI is at its lowest point.
> Honestly I'd be okay running this scale job every night, not just on the
> weekend.  A week of changes is a lot to sift through if a scaling issue
> creeps into one of the many, many projects that affect such things in
> TripleO.
>
> Also, I should note that we're not currently being constrained by absolute
> hardware limits in rh1.  The reason I haven't scaled our concurrent jobs
> higher is that there is already performance degradation when we have a full
> 70 jobs running at once.  This type of scale job would require a lot of
> theoretical resources, but those 30 compute nodes are mostly going to be
> sitting there idle while the controller(s) get deployed, so in reality their
> impact on the infrastructure is going to be less than if we just added more
> concurrent jobs that used 30 additional nodes.  And we do have the
> memory/cpu/disk to spare in rh1 to spin up more vms.
>
> We could also take advantage of heterogeneous OVB environments now so that
> the compute nodes are only 3 GB VMs instead of 8 as they are now. That would
> further reduce the impact of this sort of job.  It would require some tweaks
> to how the testenvs are created, but that shouldn't be a problem.
>
>>
 To be honest I'm not sure this is the best solution, but I'm seeing
 this anti pattern across several issues and I think we should try and
 come up with a solution.

>>>
>>> Yes this proposal is really cool. There is an alternative to run this
>>> periodic scenario outside TripleO CI and send results via email maybe.
>>> But it is something we need to discuss with RDO Cloud people and see
>>> if we would have such resources to make it on a weekly frequency.
>>>
>>> Thanks for bringing this up, it's crucial for us to have this kind of
>>> feedback, now let's take actions.
>>
>>
>> +1
>>
>> Flavio
>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> 

Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-19 Thread Ben Nemec
TLDR: We have the capacity to do this.  One scale job can be absorbed 
into our existing test infrastructure with minimal impact.


On 04/19/2017 07:50 AM, Flavio Percoco wrote:

On 18/04/17 14:28 -0400, Emilien Macchi wrote:

On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
 wrote:

Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.


Wait no, I know some folks at 3am on a Saturday night who use TripleO
CI (ok that was a joke).


Jokes apart, it really depends on the TZ and when you schedule it. 3:00
UTC on a
Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
remember
that some countries work on Sundays.


With the exception of the brief period where the ovb jobs were running 
at full capacity 24 hours a day, there has always been a lull in 
activity during early morning UTC.  Yes, there are people working during 
that time, but generally far fewer and the load on TripleO CI is at its 
lowest point.  Honestly I'd be okay running this scale job every night, 
not just on the weekend.  A week of changes is a lot to sift through if 
a scaling issue creeps into one of the many, many projects that affect 
such things in TripleO.


Also, I should note that we're not currently being constrained by 
absolute hardware limits in rh1.  The reason I haven't scaled our 
concurrent jobs higher is that there is already performance degradation 
when we have a full 70 jobs running at once.  This type of scale job 
would require a lot of theoretical resources, but those 30 compute nodes 
are mostly going to be sitting there idle while the controller(s) get 
deployed, so in reality their impact on the infrastructure is going to 
be less than if we just added more concurrent jobs that used 30 
additional nodes.  And we do have the memory/cpu/disk to spare in rh1 to 
spin up more vms.


We could also take advantage of heterogeneous OVB environments now so 
that the compute nodes are only 3 GB VMs instead of 8 as they are now. 
That would further reduce the impact of this sort of job.  It would 
require some tweaks to how the testenvs are created, but that shouldn't 
be a problem.





To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.



Yes this proposal is really cool. There is an alternative to run this
periodic scenario outside TripleO CI and send results via email maybe.
But it is something we need to discuss with RDO Cloud people and see
if we would have such resources to make it on a weekly frequency.

Thanks for bringing this up, it's crucial for us to have this kind of
feedback, now let's take actions.


+1

Flavio



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-19 Thread Flavio Percoco

On 18/04/17 14:28 -0400, Emilien Macchi wrote:

On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick  wrote:

Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.


Wait no, I know some folks at 3am on a Saturday night who use TripleO
CI (ok that was a joke).


Jokes apart, it really depends on the TZ and when you schedule it. 3:00 UTC on a
Sunday is Monday 13:00 in Sydney :) Saturdays might work better but remember
that some countries work on Sundays.


To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.



Yes this proposal is really cool. There is an alternative to run this
periodic scenario outside TripleO CI and send results via email maybe.
But it is something we need to discuss with RDO Cloud people and see
if we would have such resources to make it on a weekly frequency.

Thanks for bringing this up, it's crucial for us to have this kind of
feedback, now let's take actions.


+1

Flavio

--
@flaper87
Flavio Percoco


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-18 Thread Wesley Hayutin
On Tue, Apr 18, 2017 at 2:28 PM, Emilien Macchi  wrote:

> On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick 
> wrote:
> > Because CI jobs tend to max out about 5 nodes there's a whole class of
> > minor bugs that make it into releases.
> >
> > What happens is that they never show up in small clouds, then when
> > they do show up in larger testing clouds the people deploying those
> > simply work around the issue and get onto what they where supposed to
> > be testing. These workarounds do get documented/BZ'd but since they
> > don't block anyone and only show up in large environments they become
> > hard for developers to fix.
> >
> > So the issue gets stuck in limbo, with nowhere to test a patchset and
> > no one owning the issue.
> >
> > These issues pile up and pretty soon there is a significant difference
> > between the default documented workflow and the 'scale' workflow which
> > is filled with workarounds which may or may not be documented
> > upstream.
> >
> > I'd like to propose getting these issues more visibility to having a
> > periodic upstream job that uses 20-30 ovb instances to do a larger
> > deployment. Maybe at 3am on a Sunday or some other time where there's
> > idle execution capability to exploit. The goal being to make these
> > sorts of issues more visible and hopefully get better at fixing them.
>
> Wait no, I know some folks at 3am on a Saturday night who use TripleO
> CI (ok that was a joke).
>
> > To be honest I'm not sure this is the best solution, but I'm seeing
> > this anti pattern across several issues and I think we should try and
> > come up with a solution.
> >
>
> Yes this proposal is really cool. There is an alternative to run this
> periodic scenario outside TripleO CI and send results via email maybe.
> But it is something we need to discuss with RDO Cloud people and see
> if we would have such resources to make it on a weekly frequency.
>

+1
I think with RDO Cloud it's possible to run a test of that scale either in
the
tripleo system or just report results, either would be great.  Until RDO
Cloud
is full production we might as well begin by running a job internally with
the master-tripleo-ci release config file.  The browbeat jobs are logging
[1] here
will be fairly simple step to run w/ the upstream content.

Adding Arx Cruz as he is point on a tool that distrubutes test results from
the tripleo periodic jobs that may come in handy for this scale test.  I'll
probably put you two in touch tomorrow.

I'm still looking for opportunities to run browbeat in upstream tripleo as
well.
Could be a productive sync up :)

[1] https://thirdparty-logs.rdoproject.org/

Thanks!



>
> Thanks for bringing this up, it's crucial for us to have this kind of
> feedback, now let's take actions.
> --
> Emilien Macchi
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-18 Thread Emilien Macchi
On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick  wrote:
> Because CI jobs tend to max out about 5 nodes there's a whole class of
> minor bugs that make it into releases.
>
> What happens is that they never show up in small clouds, then when
> they do show up in larger testing clouds the people deploying those
> simply work around the issue and get onto what they where supposed to
> be testing. These workarounds do get documented/BZ'd but since they
> don't block anyone and only show up in large environments they become
> hard for developers to fix.
>
> So the issue gets stuck in limbo, with nowhere to test a patchset and
> no one owning the issue.
>
> These issues pile up and pretty soon there is a significant difference
> between the default documented workflow and the 'scale' workflow which
> is filled with workarounds which may or may not be documented
> upstream.
>
> I'd like to propose getting these issues more visibility to having a
> periodic upstream job that uses 20-30 ovb instances to do a larger
> deployment. Maybe at 3am on a Sunday or some other time where there's
> idle execution capability to exploit. The goal being to make these
> sorts of issues more visible and hopefully get better at fixing them.

Wait no, I know some folks at 3am on a Saturday night who use TripleO
CI (ok that was a joke).

> To be honest I'm not sure this is the best solution, but I'm seeing
> this anti pattern across several issues and I think we should try and
> come up with a solution.
>

Yes this proposal is really cool. There is an alternative to run this
periodic scenario outside TripleO CI and send results via email maybe.
But it is something we need to discuss with RDO Cloud people and see
if we would have such resources to make it on a weekly frequency.

Thanks for bringing this up, it's crucial for us to have this kind of
feedback, now let's take actions.
-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-18 Thread Ben Nemec



On 04/17/2017 02:52 PM, Justin Kilpatrick wrote:

Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.

To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.


I like this idea a lot, and I think we discussed it previously on IRC 
and worked through some potential issues with setting up such a job. 
One other thing that occurred to me since then is that deployments at 
scale generally require a larger undercloud than we have in CI. 
Unfortunately I'm not sure whether we can change that just for a 
periodic job.  There are a couple of potential workarounds for that, but 
they would add some complication so we'll need to keep that in mind.


Overall +1 to the idea though.  Larger scale deployments are clearly 
something we won't be able to run on every patch set so a periodic job 
seems like the right fit here.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-17 Thread Justin Kilpatrick
Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.

To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev