Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-09 Thread Dan Prince
On Tue, 2016-03-08 at 17:58 +, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec  wrote:
> > 
> > On 03/07/2016 11:33 AM, Derek Higgins wrote:
> > > 
> > > On 7 March 2016 at 15:24, Derek Higgins 
> > > wrote:
> > > > 
> > > > On 6 March 2016 at 16:58, James Slagle 
> > > > wrote:
> > > > > 
> > > > > On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  > > > > at.com> wrote:
> > > > > > 
> > > > > > I'm kind of hijacking Dan's e-mail but I would like to
> > > > > > propose some
> > > > > > technical improvements to stop having so much CI failures.
> > > > > > 
> > > > > > 
> > > > > > 1/ Stop creating swap files. We don't have SSD, this is
> > > > > > IMHO a terrible
> > > > > > mistake to swap on files because we don't have enough RAM.
> > > > > > In my
> > > > > > experience, swaping on non-SSD disks is even worst that not
> > > > > > having
> > > > > > enough RAM. We should stop doing that I think.
> > > > > We have been relying on swap in tripleo-ci for a little
> > > > > while. While
> > > > > not ideal, it has been an effective way to at least be able
> > > > > to test
> > > > > what we've been testing given the amount of physical RAM that
> > > > > is
> > > > > available.
> > > > Ok, so I have a few points here, in places where I'm making
> > > > assumptions I'll try to point it out
> > > > 
> > > > o Yes I agree using swap should be avoided if at all possible
> > > > 
> > > > o We are currently looking into adding more RAM to our testenv
> > > > hosts,
> > > > it which point we can afford to be a little more liberal with
> > > > Memory
> > > > and this problem should become less of an issue, having said
> > > > that
> > > > 
> > > > o Even though using swap is bad, if we have some processes with
> > > > a
> > > > large Mem footprint that don't require constant access to a
> > > > portion of
> > > > the footprint swaping it out over the duration of the CI test
> > > > isn't as
> > > > expensive as it would suggest (assuming it doesn't need to be
> > > > swapped
> > > > back in and the kernel has selected good candidates to swap
> > > > out)
> > > > 
> > > > o The test envs that host the undercloud and overcloud nodes
> > > > have 64G
> > > > of RAM each, they each host 4 testenvs and each test env if
> > > > running a
> > > > HA job can use up to 21G of RAM so we have over committed
> > > > there, it
> > > > this is only a problem if a test env host gets 4 HA jobs that
> > > > are
> > > > started around the same time (and as a result a each have 4
> > > > overcloud
> > > > nodes running at the same time), to allow this to happen
> > > > without VM's
> > > > being killed by the OOM we've also enabled swap there. The
> > > > majority of
> > > > the time this swap isn't in use, only if all 4 testenvs are
> > > > being
> > > > simultaneously used and they are all running the second half of
> > > > a CI
> > > > test at the same time.
> > > > 
> > > > o The overcloud nodes are VM's running with a "unsafe" disk
> > > > caching
> > > > mechanism, this causes sync requests from guest to be ignored
> > > > and as a
> > > > result if the instances being hosted on these nodes are going
> > > > into
> > > > swap this swap will be cached on the host as long as RAM is
> > > > available.
> > > > i.e. swap being used in the undercloud or overcloud isn't being
> > > > synced
> > > > to the disk on the host unless it has to be.
> > > > 
> > > > o What I'd like us to avoid is simply bumping up the memory
> > > > every time
> > > > we hit a OOM error without at least
> > > >   1. Explaining why we need more memory all of a sudden
> > > >   2. Looking into a way we may be able to avoid simply bumping
> > > > the RAM
> > > > (at peak times we are memory constrained)
> > > > 
> > > > as an example, Lets take a look at the swap usage on the
> > > > undercloud of
> > > > a recent ci nonha job[1][2], These insances have 5G of RAM with
> > > > 2G or
> > > > swap enabled via a swapfile
> > > > the overcloud deploy started @22:07:46 and finished at
> > > > @22:28:06
> > > > 
> > > > In the graph you'll see a spike in memory being swapped out
> > > > around
> > > > 22:09, this corresponds almost exactly to when the overcloud
> > > > image is
> > > > being downloaded from swift[3], looking the top output at the
> > > > end of
> > > > the test you'll see that swift-proxy is using over 500M of
> > > > Mem[4].
> > > > 
> > > > I'd much prefer we spend time looking into why the swift proxy
> > > > is
> > > > using this much memory rather then blindly bump the memory
> > > > allocated
> > > > to the VM, perhaps we have something configured incorrectly or
> > > > we've
> > > > hit a bug in swift.
> > > > 
> > > > Having said all that we can bump the memory allocated to each
> > > > node but
> > > > we have to accept 1 of 2 possible consequences
> > > > 1. We'll env up using the swap on the testenv hosts more then
> > > > we
> > > > currently are or
> > > > 2. We'll have to reduce the number of test envs per host from 4
> > > >

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-09 Thread Derek Higgins
On 9 March 2016 at 07:08, Richard Su  wrote:
>
>
> On 03/08/2016 09:58 AM, Derek Higgins wrote:
>>
>> On 7 March 2016 at 18:22, Ben Nemec  wrote:
>>>
>>> On 03/07/2016 11:33 AM, Derek Higgins wrote:

 On 7 March 2016 at 15:24, Derek Higgins  wrote:
>
> On 6 March 2016 at 16:58, James Slagle  wrote:
>>
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi 
>> wrote:
>>>
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a
>>> terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>
> Ok, so I have a few points here, in places where I'm making
> assumptions I'll try to point it out
>
> o Yes I agree using swap should be avoided if at all possible
>
> o We are currently looking into adding more RAM to our testenv hosts,
> it which point we can afford to be a little more liberal with Memory
> and this problem should become less of an issue, having said that
>
> o Even though using swap is bad, if we have some processes with a
> large Mem footprint that don't require constant access to a portion of
> the footprint swaping it out over the duration of the CI test isn't as
> expensive as it would suggest (assuming it doesn't need to be swapped
> back in and the kernel has selected good candidates to swap out)
>
> o The test envs that host the undercloud and overcloud nodes have 64G
> of RAM each, they each host 4 testenvs and each test env if running a
> HA job can use up to 21G of RAM so we have over committed there, it
> this is only a problem if a test env host gets 4 HA jobs that are
> started around the same time (and as a result a each have 4 overcloud
> nodes running at the same time), to allow this to happen without VM's
> being killed by the OOM we've also enabled swap there. The majority of
> the time this swap isn't in use, only if all 4 testenvs are being
> simultaneously used and they are all running the second half of a CI
> test at the same time.
>
> o The overcloud nodes are VM's running with a "unsafe" disk caching
> mechanism, this causes sync requests from guest to be ignored and as a
> result if the instances being hosted on these nodes are going into
> swap this swap will be cached on the host as long as RAM is available.
> i.e. swap being used in the undercloud or overcloud isn't being synced
> to the disk on the host unless it has to be.
>
> o What I'd like us to avoid is simply bumping up the memory every time
> we hit a OOM error without at least
>1. Explaining why we need more memory all of a sudden
>2. Looking into a way we may be able to avoid simply bumping the RAM
> (at peak times we are memory constrained)
>
> as an example, Lets take a look at the swap usage on the undercloud of
> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
> swap enabled via a swapfile
> the overcloud deploy started @22:07:46 and finished at @22:28:06
>
> In the graph you'll see a spike in memory being swapped out around
> 22:09, this corresponds almost exactly to when the overcloud image is
> being downloaded from swift[3], looking the top output at the end of
> the test you'll see that swift-proxy is using over 500M of Mem[4].
>
> I'd much prefer we spend time looking into why the swift proxy is
> using this much memory rather then blindly bump the memory allocated
> to the VM, perhaps we have something configured incorrectly or we've
> hit a bug in swift.
>
> Having said all that we can bump the memory allocated to each node but
> we have to accept 1 of 2 possible consequences
> 1. We'll env up using the swap on the testenv hosts more then we
> currently are or
> 2. We'll have to reduce the number of test envs per host from 4 down
> to 3, wiping 25% of our capacity

 Thinking about this a little more, we could do a radical experiment
 for a week and just do this, i.e. bump up the RAM on each env and
 accept we loose 25 of our capacity, maybe it doesn't matter, if our
 success rate goes up then we'd be running less rechecks anyways.
 The downside is that we'd probably hit less timing errors (assuming
 the tight resources is whats showing them up), I say downside because
 this just means downstream users mi

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Richard Su



On 03/08/2016 09:58 AM, Derek Higgins wrote:

On 7 March 2016 at 18:22, Ben Nemec  wrote:

On 03/07/2016 11:33 AM, Derek Higgins wrote:

On 7 March 2016 at 15:24, Derek Higgins  wrote:

On 6 March 2016 at 16:58, James Slagle  wrote:

On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:

I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.

We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
   1. Explaining why we need more memory all of a sudden
   2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

Thinking about this a little more, we could do a radical experiment
for a week and just do this, i.e. bump up the RAM on each env and
accept we loose 25 of our capacity, maybe it doesn't matter, if our
success rate goes up then we'd be running less rechecks anyways.
The downside is that we'd probably hit less timing errors (assuming
the tight resources is whats showing them up), I say downside because
this just means downstream users might hit them more often if CI
isn't. Anyways maybe worth discussing at tomorrows meeting.

+1 to reducing the number of testenvs and allocating more memory to
each.  The huge number of rechecks we're having to do is definitely
contributing to our CI load in a big way, so if we could cut those down
by 50% I bet it would offset the lost testenvs.  And it would reduce
developer aggravation by about a million percent. :-)

Also, on some level I'm not too concerned about the absolute minimum
memory use case.  Nobody deploying OpenStack in the real world is doing
so on 4 GB nodes.  I doubt 99% of them ar

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread James Slagle
On Tue, Mar 8, 2016 at 12:58 PM, Derek Higgins  wrote:
> We discussed this at today's meeting but never really came to a
> conclusion except to say most people wanted to try it. The main
> objection brought up was that we shouldn't go dropping the nonha job,
> that isn't what I was proposing so let me rephrase here and see if we
> can gather +/-1's
>
> I'm proposing we redeploy our testenvs with more RAM allocated per
> env, specifically we would go from
> 5G undercloud and 4G overcloud nodes to
> 6G undercloud and 5G overcloud nodes to

+1

>
> In addition to accommodate this we would reduce the number of env's
> available from 48 (the actually number varies from time to time) to 36
> (3 envs per host)
>
> No changes would be happening on the jobs we actually run

+1

>
> The assumption is that with the increased resources we would hit less
> false negative test results and as a result recheck jobs less (so the
> 25% reduction in capacity wouldn't hit us as hard as it might seem),
> we also may not be able to easily undo this if it doesn't work out as
> once we start merging things that use the extra RAM it will be hard to
> go back.

The CPU load is also very high. When I have been looking at jobs that
appear stuck, it takes almost 3 minutes just to do a nova list
sometimes. So I think 1 less testenv on each host will help that as
well.

-- 
-- James Slagle
--

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Ben Nemec
On 03/08/2016 11:58 AM, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec  wrote:
>> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>>> On 7 March 2016 at 15:24, Derek Higgins  wrote:
 On 6 March 2016 at 16:58, James Slagle  wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  
> wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.

 Ok, so I have a few points here, in places where I'm making
 assumptions I'll try to point it out

 o Yes I agree using swap should be avoided if at all possible

 o We are currently looking into adding more RAM to our testenv hosts,
 it which point we can afford to be a little more liberal with Memory
 and this problem should become less of an issue, having said that

 o Even though using swap is bad, if we have some processes with a
 large Mem footprint that don't require constant access to a portion of
 the footprint swaping it out over the duration of the CI test isn't as
 expensive as it would suggest (assuming it doesn't need to be swapped
 back in and the kernel has selected good candidates to swap out)

 o The test envs that host the undercloud and overcloud nodes have 64G
 of RAM each, they each host 4 testenvs and each test env if running a
 HA job can use up to 21G of RAM so we have over committed there, it
 this is only a problem if a test env host gets 4 HA jobs that are
 started around the same time (and as a result a each have 4 overcloud
 nodes running at the same time), to allow this to happen without VM's
 being killed by the OOM we've also enabled swap there. The majority of
 the time this swap isn't in use, only if all 4 testenvs are being
 simultaneously used and they are all running the second half of a CI
 test at the same time.

 o The overcloud nodes are VM's running with a "unsafe" disk caching
 mechanism, this causes sync requests from guest to be ignored and as a
 result if the instances being hosted on these nodes are going into
 swap this swap will be cached on the host as long as RAM is available.
 i.e. swap being used in the undercloud or overcloud isn't being synced
 to the disk on the host unless it has to be.

 o What I'd like us to avoid is simply bumping up the memory every time
 we hit a OOM error without at least
   1. Explaining why we need more memory all of a sudden
   2. Looking into a way we may be able to avoid simply bumping the RAM
 (at peak times we are memory constrained)

 as an example, Lets take a look at the swap usage on the undercloud of
 a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
 swap enabled via a swapfile
 the overcloud deploy started @22:07:46 and finished at @22:28:06

 In the graph you'll see a spike in memory being swapped out around
 22:09, this corresponds almost exactly to when the overcloud image is
 being downloaded from swift[3], looking the top output at the end of
 the test you'll see that swift-proxy is using over 500M of Mem[4].

 I'd much prefer we spend time looking into why the swift proxy is
 using this much memory rather then blindly bump the memory allocated
 to the VM, perhaps we have something configured incorrectly or we've
 hit a bug in swift.

 Having said all that we can bump the memory allocated to each node but
 we have to accept 1 of 2 possible consequences
 1. We'll env up using the swap on the testenv hosts more then we
 currently are or
 2. We'll have to reduce the number of test envs per host from 4 down
 to 3, wiping 25% of our capacity
>>>
>>> Thinking about this a little more, we could do a radical experiment
>>> for a week and just do this, i.e. bump up the RAM on each env and
>>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>>> success rate goes up then we'd be running less rechecks anyways.
>>> The downside is that we'd probably hit less timing errors (assuming
>>> the tight resources is whats showing them up), I say downside because
>>> this just means downstream users might hit them more often if CI
>>> isn't. Anyways maybe worth discussing at tomorrows meeting.
>>
>> +1 to reducing the number of testenvs and allocating more memory to
>> each.  The

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Derek Higgins
On 7 March 2016 at 18:22, Ben Nemec  wrote:
> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>> On 7 March 2016 at 15:24, Derek Higgins  wrote:
>>> On 6 March 2016 at 16:58, James Slagle  wrote:
 On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
>
>
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.

 We have been relying on swap in tripleo-ci for a little while. While
 not ideal, it has been an effective way to at least be able to test
 what we've been testing given the amount of physical RAM that is
 available.
>>>
>>> Ok, so I have a few points here, in places where I'm making
>>> assumptions I'll try to point it out
>>>
>>> o Yes I agree using swap should be avoided if at all possible
>>>
>>> o We are currently looking into adding more RAM to our testenv hosts,
>>> it which point we can afford to be a little more liberal with Memory
>>> and this problem should become less of an issue, having said that
>>>
>>> o Even though using swap is bad, if we have some processes with a
>>> large Mem footprint that don't require constant access to a portion of
>>> the footprint swaping it out over the duration of the CI test isn't as
>>> expensive as it would suggest (assuming it doesn't need to be swapped
>>> back in and the kernel has selected good candidates to swap out)
>>>
>>> o The test envs that host the undercloud and overcloud nodes have 64G
>>> of RAM each, they each host 4 testenvs and each test env if running a
>>> HA job can use up to 21G of RAM so we have over committed there, it
>>> this is only a problem if a test env host gets 4 HA jobs that are
>>> started around the same time (and as a result a each have 4 overcloud
>>> nodes running at the same time), to allow this to happen without VM's
>>> being killed by the OOM we've also enabled swap there. The majority of
>>> the time this swap isn't in use, only if all 4 testenvs are being
>>> simultaneously used and they are all running the second half of a CI
>>> test at the same time.
>>>
>>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>>> mechanism, this causes sync requests from guest to be ignored and as a
>>> result if the instances being hosted on these nodes are going into
>>> swap this swap will be cached on the host as long as RAM is available.
>>> i.e. swap being used in the undercloud or overcloud isn't being synced
>>> to the disk on the host unless it has to be.
>>>
>>> o What I'd like us to avoid is simply bumping up the memory every time
>>> we hit a OOM error without at least
>>>   1. Explaining why we need more memory all of a sudden
>>>   2. Looking into a way we may be able to avoid simply bumping the RAM
>>> (at peak times we are memory constrained)
>>>
>>> as an example, Lets take a look at the swap usage on the undercloud of
>>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>>> swap enabled via a swapfile
>>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>>
>>> In the graph you'll see a spike in memory being swapped out around
>>> 22:09, this corresponds almost exactly to when the overcloud image is
>>> being downloaded from swift[3], looking the top output at the end of
>>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>>
>>> I'd much prefer we spend time looking into why the swift proxy is
>>> using this much memory rather then blindly bump the memory allocated
>>> to the VM, perhaps we have something configured incorrectly or we've
>>> hit a bug in swift.
>>>
>>> Having said all that we can bump the memory allocated to each node but
>>> we have to accept 1 of 2 possible consequences
>>> 1. We'll env up using the swap on the testenv hosts more then we
>>> currently are or
>>> 2. We'll have to reduce the number of test envs per host from 4 down
>>> to 3, wiping 25% of our capacity
>>
>> Thinking about this a little more, we could do a radical experiment
>> for a week and just do this, i.e. bump up the RAM on each env and
>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>> success rate goes up then we'd be running less rechecks anyways.
>> The downside is that we'd probably hit less timing errors (assuming
>> the tight resources is whats showing them up), I say downside because
>> this just means downstream users might hit them more often if CI
>> isn't. Anyways maybe worth discussing at tomorrows meeting.
>
> +1 to reducing the number of testenvs and allocating more memory to
> each.  The huge number of rechecks we're having to do is definitely
> contributing to our CI load in a big way, so if we could cut those down
> by 50% I b

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Ben Nemec
On 03/07/2016 11:33 AM, Derek Higgins wrote:
> On 7 March 2016 at 15:24, Derek Higgins  wrote:
>> On 6 March 2016 at 16:58, James Slagle  wrote:
>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
 I'm kind of hijacking Dan's e-mail but I would like to propose some
 technical improvements to stop having so much CI failures.


 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
 mistake to swap on files because we don't have enough RAM. In my
 experience, swaping on non-SSD disks is even worst that not having
 enough RAM. We should stop doing that I think.
>>>
>>> We have been relying on swap in tripleo-ci for a little while. While
>>> not ideal, it has been an effective way to at least be able to test
>>> what we've been testing given the amount of physical RAM that is
>>> available.
>>
>> Ok, so I have a few points here, in places where I'm making
>> assumptions I'll try to point it out
>>
>> o Yes I agree using swap should be avoided if at all possible
>>
>> o We are currently looking into adding more RAM to our testenv hosts,
>> it which point we can afford to be a little more liberal with Memory
>> and this problem should become less of an issue, having said that
>>
>> o Even though using swap is bad, if we have some processes with a
>> large Mem footprint that don't require constant access to a portion of
>> the footprint swaping it out over the duration of the CI test isn't as
>> expensive as it would suggest (assuming it doesn't need to be swapped
>> back in and the kernel has selected good candidates to swap out)
>>
>> o The test envs that host the undercloud and overcloud nodes have 64G
>> of RAM each, they each host 4 testenvs and each test env if running a
>> HA job can use up to 21G of RAM so we have over committed there, it
>> this is only a problem if a test env host gets 4 HA jobs that are
>> started around the same time (and as a result a each have 4 overcloud
>> nodes running at the same time), to allow this to happen without VM's
>> being killed by the OOM we've also enabled swap there. The majority of
>> the time this swap isn't in use, only if all 4 testenvs are being
>> simultaneously used and they are all running the second half of a CI
>> test at the same time.
>>
>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>> mechanism, this causes sync requests from guest to be ignored and as a
>> result if the instances being hosted on these nodes are going into
>> swap this swap will be cached on the host as long as RAM is available.
>> i.e. swap being used in the undercloud or overcloud isn't being synced
>> to the disk on the host unless it has to be.
>>
>> o What I'd like us to avoid is simply bumping up the memory every time
>> we hit a OOM error without at least
>>   1. Explaining why we need more memory all of a sudden
>>   2. Looking into a way we may be able to avoid simply bumping the RAM
>> (at peak times we are memory constrained)
>>
>> as an example, Lets take a look at the swap usage on the undercloud of
>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>> swap enabled via a swapfile
>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>
>> In the graph you'll see a spike in memory being swapped out around
>> 22:09, this corresponds almost exactly to when the overcloud image is
>> being downloaded from swift[3], looking the top output at the end of
>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>
>> I'd much prefer we spend time looking into why the swift proxy is
>> using this much memory rather then blindly bump the memory allocated
>> to the VM, perhaps we have something configured incorrectly or we've
>> hit a bug in swift.
>>
>> Having said all that we can bump the memory allocated to each node but
>> we have to accept 1 of 2 possible consequences
>> 1. We'll env up using the swap on the testenv hosts more then we
>> currently are or
>> 2. We'll have to reduce the number of test envs per host from 4 down
>> to 3, wiping 25% of our capacity
> 
> Thinking about this a little more, we could do a radical experiment
> for a week and just do this, i.e. bump up the RAM on each env and
> accept we loose 25 of our capacity, maybe it doesn't matter, if our
> success rate goes up then we'd be running less rechecks anyways.
> The downside is that we'd probably hit less timing errors (assuming
> the tight resources is whats showing them up), I say downside because
> this just means downstream users might hit them more often if CI
> isn't. Anyways maybe worth discussing at tomorrows meeting.

+1 to reducing the number of testenvs and allocating more memory to
each.  The huge number of rechecks we're having to do is definitely
contributing to our CI load in a big way, so if we could cut those down
by 50% I bet it would offset the lost testenvs.  And it would reduce
developer aggravation by about a million percent. :-)

Also, on some level I'm no

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Ben Nemec
On 03/07/2016 12:00 PM, Derek Higgins wrote:
> On 7 March 2016 at 12:11, John Trowbridge  wrote:
>>
>>
>> On 03/06/2016 11:58 AM, James Slagle wrote:
>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
 I'm kind of hijacking Dan's e-mail but I would like to propose some
 technical improvements to stop having so much CI failures.


 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
 mistake to swap on files because we don't have enough RAM. In my
 experience, swaping on non-SSD disks is even worst that not having
 enough RAM. We should stop doing that I think.
>>>
>>> We have been relying on swap in tripleo-ci for a little while. While
>>> not ideal, it has been an effective way to at least be able to test
>>> what we've been testing given the amount of physical RAM that is
>>> available.
>>>
>>> The recent change to add swap to the overcloud nodes has proved to be
>>> unstable. But that has more to do with it being racey with the
>>> validation deployment afaict. There are some patches currently up to
>>> address those issues.
>>>


 2/ Split CI jobs in scenarios.

 Currently we have CI jobs for ceph, HA, non-ha, containers and the
 current situation is that jobs fail randomly, due to performances issues.

 Puppet OpenStack CI had the same issue where we had one integration job
 and we never stopped adding more services until all becomes *very*
 unstable. We solved that issue by splitting the jobs and creating 
 scenarios:

 https://github.com/openstack/puppet-openstack-integration#description

 What I propose is to split TripleO jobs in more jobs, but with less
 services.

 The benefit of that:

 * more services coverage
 * jobs will run faster
 * less random issues due to bad performances

 The cost is of course it will consume more resources.
 That's why I suggest 3/.

 We could have:

 * HA job with ceph and a full compute scenario (glance, nova, cinder,
 ceilometer, aodh & gnocchi).
 * Same with IPv6 & SSL.
 * HA job without ceph and full compute scenario too
 * HA job without ceph and basic compute (glance and nova), with extra
 services like Trove, Sahara, etc.
 * ...
 (note: all jobs would have network isolation, which is to me a
 requirement when testing an installer like TripleO).
>>>
>>> Each of those jobs would at least require as much memory as our
>>> current HA job. I don't see how this gets us to using less memory. The
>>> HA job we have now already deploys the minimal amount of services that
>>> is possible given our current architecture. Without the composable
>>> service roles work, we can't deploy less services than we already are.
>>>
>>>
>>>

 3/ Drop non-ha job.
 I'm not sure why we have it, and the benefit of testing that comparing
 to HA.
>>>
>>> In my opinion, I actually think that we could drop the ceph and non-ha
>>> job from the check-tripleo queue.
>>>
>>> non-ha doesn't test anything realistic, and it doesn't really provide
>>> any faster feedback on patches. It seems at most it might run 15-20
>>> minutes faster than the HA job on average. Sometimes it even runs
>>> slower than the HA job.
>>>
>>> The ceph job we could move to the experimental queue to run on demand
>>> on patches that might affect ceph, and it could also be a daily
>>> periodic job.
>>>
>>> The same could be done for the containers job, an IPv6 job, and an
>>> upgrades job. Ideally with a way to run an individual job as needed.
>>> Would we need different experimental queues to do that?
>>>
>>> That would leave only the HA job in the check queue, which we should
>>> run with SSL and network isolation. We could deploy less testenv's
>>> since we'd have less jobs running, but give the ones we do deploy more
>>> RAM. I think this would really alleviate a lot of the transient
>>> intermittent failures we get in CI currently. It would also likely run
>>> faster.
>>>
>>> It's probably worth seeking out some exact evidence from the RDO
>>> centos-ci, because I think they are testing with virtual environments
>>> that have a lot more RAM than tripleo-ci does. It'd be good to
>>> understand if they have some of the transient failures that tripleo-ci
>>> does as well.
>>>
>>
>> The HA job in RDO CI is also more unstable than nonHA, although this is
>> usually not to do with memory contention. Most of the time that I see
>> the HA job fail spuriously in RDO CI, it is because of the Nova
>> scheduler race. I would bet that this race is the cause for the
>> fluctuating amount of time jobs take as well, because the recovery
>> mechanism for this is just to retry. Those retries can add 15 min. per
>> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
>> well. If we can't deploy to virtual machines in under an hour, to me
>> that is a bug. (Note, I am speaking of `openstack overclo

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 7 March 2016 at 12:11, John Trowbridge  wrote:
>
>
> On 03/06/2016 11:58 AM, James Slagle wrote:
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>>
>> The recent change to add swap to the overcloud nodes has proved to be
>> unstable. But that has more to do with it being racey with the
>> validation deployment afaict. There are some patches currently up to
>> address those issues.
>>
>>>
>>>
>>> 2/ Split CI jobs in scenarios.
>>>
>>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>>> current situation is that jobs fail randomly, due to performances issues.
>>>
>>> Puppet OpenStack CI had the same issue where we had one integration job
>>> and we never stopped adding more services until all becomes *very*
>>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>>
>>> https://github.com/openstack/puppet-openstack-integration#description
>>>
>>> What I propose is to split TripleO jobs in more jobs, but with less
>>> services.
>>>
>>> The benefit of that:
>>>
>>> * more services coverage
>>> * jobs will run faster
>>> * less random issues due to bad performances
>>>
>>> The cost is of course it will consume more resources.
>>> That's why I suggest 3/.
>>>
>>> We could have:
>>>
>>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>>> ceilometer, aodh & gnocchi).
>>> * Same with IPv6 & SSL.
>>> * HA job without ceph and full compute scenario too
>>> * HA job without ceph and basic compute (glance and nova), with extra
>>> services like Trove, Sahara, etc.
>>> * ...
>>> (note: all jobs would have network isolation, which is to me a
>>> requirement when testing an installer like TripleO).
>>
>> Each of those jobs would at least require as much memory as our
>> current HA job. I don't see how this gets us to using less memory. The
>> HA job we have now already deploys the minimal amount of services that
>> is possible given our current architecture. Without the composable
>> service roles work, we can't deploy less services than we already are.
>>
>>
>>
>>>
>>> 3/ Drop non-ha job.
>>> I'm not sure why we have it, and the benefit of testing that comparing
>>> to HA.
>>
>> In my opinion, I actually think that we could drop the ceph and non-ha
>> job from the check-tripleo queue.
>>
>> non-ha doesn't test anything realistic, and it doesn't really provide
>> any faster feedback on patches. It seems at most it might run 15-20
>> minutes faster than the HA job on average. Sometimes it even runs
>> slower than the HA job.
>>
>> The ceph job we could move to the experimental queue to run on demand
>> on patches that might affect ceph, and it could also be a daily
>> periodic job.
>>
>> The same could be done for the containers job, an IPv6 job, and an
>> upgrades job. Ideally with a way to run an individual job as needed.
>> Would we need different experimental queues to do that?
>>
>> That would leave only the HA job in the check queue, which we should
>> run with SSL and network isolation. We could deploy less testenv's
>> since we'd have less jobs running, but give the ones we do deploy more
>> RAM. I think this would really alleviate a lot of the transient
>> intermittent failures we get in CI currently. It would also likely run
>> faster.
>>
>> It's probably worth seeking out some exact evidence from the RDO
>> centos-ci, because I think they are testing with virtual environments
>> that have a lot more RAM than tripleo-ci does. It'd be good to
>> understand if they have some of the transient failures that tripleo-ci
>> does as well.
>>
>
> The HA job in RDO CI is also more unstable than nonHA, although this is
> usually not to do with memory contention. Most of the time that I see
> the HA job fail spuriously in RDO CI, it is because of the Nova
> scheduler race. I would bet that this race is the cause for the
> fluctuating amount of time jobs take as well, because the recovery
> mechanism for this is just to retry. Those retries can add 15 min. per
> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
> well. If we can't deploy to virtual machines in under an hour, to me
> that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
> I say deploy, though start to finish can take less than an hour with
> decent CPUs)
>
> RDO CI uses the following layout:
> Undercloud: 12G RAM, 4 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 7 March 2016 at 15:24, Derek Higgins  wrote:
> On 6 March 2016 at 16:58, James Slagle  wrote:
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>
> Ok, so I have a few points here, in places where I'm making
> assumptions I'll try to point it out
>
> o Yes I agree using swap should be avoided if at all possible
>
> o We are currently looking into adding more RAM to our testenv hosts,
> it which point we can afford to be a little more liberal with Memory
> and this problem should become less of an issue, having said that
>
> o Even though using swap is bad, if we have some processes with a
> large Mem footprint that don't require constant access to a portion of
> the footprint swaping it out over the duration of the CI test isn't as
> expensive as it would suggest (assuming it doesn't need to be swapped
> back in and the kernel has selected good candidates to swap out)
>
> o The test envs that host the undercloud and overcloud nodes have 64G
> of RAM each, they each host 4 testenvs and each test env if running a
> HA job can use up to 21G of RAM so we have over committed there, it
> this is only a problem if a test env host gets 4 HA jobs that are
> started around the same time (and as a result a each have 4 overcloud
> nodes running at the same time), to allow this to happen without VM's
> being killed by the OOM we've also enabled swap there. The majority of
> the time this swap isn't in use, only if all 4 testenvs are being
> simultaneously used and they are all running the second half of a CI
> test at the same time.
>
> o The overcloud nodes are VM's running with a "unsafe" disk caching
> mechanism, this causes sync requests from guest to be ignored and as a
> result if the instances being hosted on these nodes are going into
> swap this swap will be cached on the host as long as RAM is available.
> i.e. swap being used in the undercloud or overcloud isn't being synced
> to the disk on the host unless it has to be.
>
> o What I'd like us to avoid is simply bumping up the memory every time
> we hit a OOM error without at least
>   1. Explaining why we need more memory all of a sudden
>   2. Looking into a way we may be able to avoid simply bumping the RAM
> (at peak times we are memory constrained)
>
> as an example, Lets take a look at the swap usage on the undercloud of
> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
> swap enabled via a swapfile
> the overcloud deploy started @22:07:46 and finished at @22:28:06
>
> In the graph you'll see a spike in memory being swapped out around
> 22:09, this corresponds almost exactly to when the overcloud image is
> being downloaded from swift[3], looking the top output at the end of
> the test you'll see that swift-proxy is using over 500M of Mem[4].
>
> I'd much prefer we spend time looking into why the swift proxy is
> using this much memory rather then blindly bump the memory allocated
> to the VM, perhaps we have something configured incorrectly or we've
> hit a bug in swift.
>
> Having said all that we can bump the memory allocated to each node but
> we have to accept 1 of 2 possible consequences
> 1. We'll env up using the swap on the testenv hosts more then we
> currently are or
> 2. We'll have to reduce the number of test envs per host from 4 down
> to 3, wiping 25% of our capacity

Thinking about this a little more, we could do a radical experiment
for a week and just do this, i.e. bump up the RAM on each env and
accept we loose 25 of our capacity, maybe it doesn't matter, if our
success rate goes up then we'd be running less rechecks anyways.
The downside is that we'd probably hit less timing errors (assuming
the tight resources is whats showing them up), I say downside because
this just means downstream users might hit them more often if CI
isn't. Anyways maybe worth discussing at tomorrows meeting.


>
> [1] - 
> http://logs.openstack.org/85/289085/2/check-tripleo/gate-tripleo-ci-f22-nonha/6fda33c/
> [2] - http://goodsquishy.com/downloads/20160307/swap.png
> [3] - 22:09:03 21678 INFO [-] Master cache miss for image
> b6a96213-7955-4c4d-829e-871350939e03, starting download
>   22:09:41 21678 DEBUG [-] Running cmd (subprocess): qemu-img info
> /var/lib/ironic/master_images/tmpvjAlCU/b6a96213-7955-4c4d-829e-871350939e03.part
> [4] - 17690 swift 20   0  804824 547724   1780 S   0.0 10.

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 6 March 2016 at 16:58, James Slagle  wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
  1. Explaining why we need more memory all of a sudden
  2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

[1] - 
http://logs.openstack.org/85/289085/2/check-tripleo/gate-tripleo-ci-f22-nonha/6fda33c/
[2] - http://goodsquishy.com/downloads/20160307/swap.png
[3] - 22:09:03 21678 INFO [-] Master cache miss for image
b6a96213-7955-4c4d-829e-871350939e03, starting download
  22:09:41 21678 DEBUG [-] Running cmd (subprocess): qemu-img info
/var/lib/ironic/master_images/tmpvjAlCU/b6a96213-7955-4c4d-829e-871350939e03.part
[4] - 17690 swift 20   0  804824 547724   1780 S   0.0 10.8
0:04.82 swift-prox+


>
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
>
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.

We don't know it due to performance issues, Your probably correct that
we wouldn't see them if we were allocating more resources to the ci
tests but this just means we have timing issues that are more
prevalent when resource constrained, I think that answer here is for
some

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Dan Prince
On Sat, 2016-03-05 at 11:15 -0500, Emilien Macchi wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
> 
> 
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a
> terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.
> 
> 
> 2/ Split CI jobs in scenarios.
> 
> Currently we have CI jobs for ceph, HA, non-ha, containers and the
> current situation is that jobs fail randomly, due to performances
> issues.
> 
> Puppet OpenStack CI had the same issue where we had one integration
> job
> and we never stopped adding more services until all becomes *very*
> unstable. We solved that issue by splitting the jobs and creating
> scenarios:
> 
> https://github.com/openstack/puppet-openstack-integration#description
> 
> What I propose is to split TripleO jobs in more jobs, but with less
> services.
> 
> The benefit of that:
> 
> * more services coverage
> * jobs will run faster
> * less random issues due to bad performances
> 
> The cost is of course it will consume more resources.
> That's why I suggest 3/.
> 
> We could have:
> 
> * HA job with ceph and a full compute scenario (glance, nova, cinder,
> ceilometer, aodh & gnocchi).
> * Same with IPv6 & SSL.
> * HA job without ceph and full compute scenario too
> * HA job without ceph and basic compute (glance and nova), with extra
> services like Trove, Sahara, etc.
> * ...
> (note: all jobs would have network isolation, which is to me a
> requirement when testing an installer like TripleO).

I'm not sure we have enough resources to entertain this option. I would
like to see us split the jobs up but not in exactly the way you
describe above. I would rather see us put the effort into architecture
changes like "split stack" which cloud allow us to test the
configuration side of our Heat stack on normal Cloud instances. Once we
have this in place I think we would have more potential resources and
could entertain running more jobs to and thus could split things out to
run in parallel if we choose to do so.

> 
> 3/ Drop non-ha job.
> I'm not sure why we have it, and the benefit of testing that
> comparing
> to HA.

A couple of reasons we have the nonha job I think. First is that not
everyone wants to use HA. We run our own TripleO CI cloud without HA at
this point and I think there is interest in maintaining this as a less
complex installation alternative where HA isn't needed.

Second is need to support functionally testing TripleO where developers
don't have enough resources for 3 controller nodes. At the very least
we'd need a second single node HA job (which wouldn't really be doing
HA) but would allow us to continue supporting the compressed
installation for developer testing, etc.

Dan

> 
> 
> Any comment / feedback is welcome,
> _
> _
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread John Trowbridge


On 03/06/2016 11:58 AM, James Slagle wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
> 
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.
> 
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
> 
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.
>>
>> Puppet OpenStack CI had the same issue where we had one integration job
>> and we never stopped adding more services until all becomes *very*
>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>
>> https://github.com/openstack/puppet-openstack-integration#description
>>
>> What I propose is to split TripleO jobs in more jobs, but with less
>> services.
>>
>> The benefit of that:
>>
>> * more services coverage
>> * jobs will run faster
>> * less random issues due to bad performances
>>
>> The cost is of course it will consume more resources.
>> That's why I suggest 3/.
>>
>> We could have:
>>
>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>> ceilometer, aodh & gnocchi).
>> * Same with IPv6 & SSL.
>> * HA job without ceph and full compute scenario too
>> * HA job without ceph and basic compute (glance and nova), with extra
>> services like Trove, Sahara, etc.
>> * ...
>> (note: all jobs would have network isolation, which is to me a
>> requirement when testing an installer like TripleO).
> 
> Each of those jobs would at least require as much memory as our
> current HA job. I don't see how this gets us to using less memory. The
> HA job we have now already deploys the minimal amount of services that
> is possible given our current architecture. Without the composable
> service roles work, we can't deploy less services than we already are.
> 
> 
> 
>>
>> 3/ Drop non-ha job.
>> I'm not sure why we have it, and the benefit of testing that comparing
>> to HA.
> 
> In my opinion, I actually think that we could drop the ceph and non-ha
> job from the check-tripleo queue.
> 
> non-ha doesn't test anything realistic, and it doesn't really provide
> any faster feedback on patches. It seems at most it might run 15-20
> minutes faster than the HA job on average. Sometimes it even runs
> slower than the HA job.
> 
> The ceph job we could move to the experimental queue to run on demand
> on patches that might affect ceph, and it could also be a daily
> periodic job.
> 
> The same could be done for the containers job, an IPv6 job, and an
> upgrades job. Ideally with a way to run an individual job as needed.
> Would we need different experimental queues to do that?
> 
> That would leave only the HA job in the check queue, which we should
> run with SSL and network isolation. We could deploy less testenv's
> since we'd have less jobs running, but give the ones we do deploy more
> RAM. I think this would really alleviate a lot of the transient
> intermittent failures we get in CI currently. It would also likely run
> faster.
> 
> It's probably worth seeking out some exact evidence from the RDO
> centos-ci, because I think they are testing with virtual environments
> that have a lot more RAM than tripleo-ci does. It'd be good to
> understand if they have some of the transient failures that tripleo-ci
> does as well.
> 

The HA job in RDO CI is also more unstable than nonHA, although this is
usually not to do with memory contention. Most of the time that I see
the HA job fail spuriously in RDO CI, it is because of the Nova
scheduler race. I would bet that this race is the cause for the
fluctuating amount of time jobs take as well, because the recovery
mechanism for this is just to retry. Those retries can add 15 min. per
retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
well. If we can't deploy to virtual machines in under an hour, to me
that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
I say deploy, though start to finish can take less than an hour with
decent CPUs)

RDO CI uses the following layout:
Undercloud: 12G RAM, 4 CPUs
3x Control Nodes: 4G RAM, 1 CPU
Compute Node: 4G RAM, 1 CPU

Is there any ability in our current CI setup to auto-identify the cause
of a failure? The nova schedu

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Dmitry Tantsur

On 03/06/2016 05:58 PM, James Slagle wrote:

On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:

I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.


We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

The recent change to add swap to the overcloud nodes has proved to be
unstable. But that has more to do with it being racey with the
validation deployment afaict. There are some patches currently up to
address those issues.




2/ Split CI jobs in scenarios.

Currently we have CI jobs for ceph, HA, non-ha, containers and the
current situation is that jobs fail randomly, due to performances issues.

Puppet OpenStack CI had the same issue where we had one integration job
and we never stopped adding more services until all becomes *very*
unstable. We solved that issue by splitting the jobs and creating scenarios:

https://github.com/openstack/puppet-openstack-integration#description

What I propose is to split TripleO jobs in more jobs, but with less
services.

The benefit of that:

* more services coverage
* jobs will run faster
* less random issues due to bad performances

The cost is of course it will consume more resources.
That's why I suggest 3/.

We could have:

* HA job with ceph and a full compute scenario (glance, nova, cinder,
ceilometer, aodh & gnocchi).
* Same with IPv6 & SSL.
* HA job without ceph and full compute scenario too
* HA job without ceph and basic compute (glance and nova), with extra
services like Trove, Sahara, etc.
* ...
(note: all jobs would have network isolation, which is to me a
requirement when testing an installer like TripleO).


Each of those jobs would at least require as much memory as our
current HA job. I don't see how this gets us to using less memory. The
HA job we have now already deploys the minimal amount of services that
is possible given our current architecture. Without the composable
service roles work, we can't deploy less services than we already are.





3/ Drop non-ha job.
I'm not sure why we have it, and the benefit of testing that comparing
to HA.


In my opinion, I actually think that we could drop the ceph and non-ha
job from the check-tripleo queue.

non-ha doesn't test anything realistic, and it doesn't really provide
any faster feedback on patches. It seems at most it might run 15-20
minutes faster than the HA job on average. Sometimes it even runs
slower than the HA job.


The non-HA job is the only job with introspection. So you'll have to 
enable introspection on the HA job, bumping its run time.




The ceph job we could move to the experimental queue to run on demand
on patches that might affect ceph, and it could also be a daily
periodic job.

The same could be done for the containers job, an IPv6 job, and an
upgrades job. Ideally with a way to run an individual job as needed.
Would we need different experimental queues to do that?

That would leave only the HA job in the check queue, which we should
run with SSL and network isolation. We could deploy less testenv's
since we'd have less jobs running, but give the ones we do deploy more
RAM. I think this would really alleviate a lot of the transient
intermittent failures we get in CI currently. It would also likely run
faster.

It's probably worth seeking out some exact evidence from the RDO
centos-ci, because I think they are testing with virtual environments
that have a lot more RAM than tripleo-ci does. It'd be good to
understand if they have some of the transient failures that tripleo-ci
does as well.

We really are deploying on the absolute minimum cpu/ram requirements
that is even possible. I think it's unrealistic to expect a lot of
stability in that scenario. And I think that's a big reason why we get
so many transient failures.

In summary: give the testenv's more ram, have one job in the
check-tripleo queue, as many jobs as needed in the experimental queue,
and as many periodic jobs as necessary.





Any comment / feedback is welcome,
--
Emilien Macchi


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev








__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openst

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-06 Thread James Slagle
On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
>
>
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.

We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

The recent change to add swap to the overcloud nodes has proved to be
unstable. But that has more to do with it being racey with the
validation deployment afaict. There are some patches currently up to
address those issues.

>
>
> 2/ Split CI jobs in scenarios.
>
> Currently we have CI jobs for ceph, HA, non-ha, containers and the
> current situation is that jobs fail randomly, due to performances issues.
>
> Puppet OpenStack CI had the same issue where we had one integration job
> and we never stopped adding more services until all becomes *very*
> unstable. We solved that issue by splitting the jobs and creating scenarios:
>
> https://github.com/openstack/puppet-openstack-integration#description
>
> What I propose is to split TripleO jobs in more jobs, but with less
> services.
>
> The benefit of that:
>
> * more services coverage
> * jobs will run faster
> * less random issues due to bad performances
>
> The cost is of course it will consume more resources.
> That's why I suggest 3/.
>
> We could have:
>
> * HA job with ceph and a full compute scenario (glance, nova, cinder,
> ceilometer, aodh & gnocchi).
> * Same with IPv6 & SSL.
> * HA job without ceph and full compute scenario too
> * HA job without ceph and basic compute (glance and nova), with extra
> services like Trove, Sahara, etc.
> * ...
> (note: all jobs would have network isolation, which is to me a
> requirement when testing an installer like TripleO).

Each of those jobs would at least require as much memory as our
current HA job. I don't see how this gets us to using less memory. The
HA job we have now already deploys the minimal amount of services that
is possible given our current architecture. Without the composable
service roles work, we can't deploy less services than we already are.



>
> 3/ Drop non-ha job.
> I'm not sure why we have it, and the benefit of testing that comparing
> to HA.

In my opinion, I actually think that we could drop the ceph and non-ha
job from the check-tripleo queue.

non-ha doesn't test anything realistic, and it doesn't really provide
any faster feedback on patches. It seems at most it might run 15-20
minutes faster than the HA job on average. Sometimes it even runs
slower than the HA job.

The ceph job we could move to the experimental queue to run on demand
on patches that might affect ceph, and it could also be a daily
periodic job.

The same could be done for the containers job, an IPv6 job, and an
upgrades job. Ideally with a way to run an individual job as needed.
Would we need different experimental queues to do that?

That would leave only the HA job in the check queue, which we should
run with SSL and network isolation. We could deploy less testenv's
since we'd have less jobs running, but give the ones we do deploy more
RAM. I think this would really alleviate a lot of the transient
intermittent failures we get in CI currently. It would also likely run
faster.

It's probably worth seeking out some exact evidence from the RDO
centos-ci, because I think they are testing with virtual environments
that have a lot more RAM than tripleo-ci does. It'd be good to
understand if they have some of the transient failures that tripleo-ci
does as well.

We really are deploying on the absolute minimum cpu/ram requirements
that is even possible. I think it's unrealistic to expect a lot of
stability in that scenario. And I think that's a big reason why we get
so many transient failures.

In summary: give the testenv's more ram, have one job in the
check-tripleo queue, as many jobs as needed in the experimental queue,
and as many periodic jobs as necessary.


>
>
> Any comment / feedback is welcome,
> --
> Emilien Macchi
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
-- James Slagle
--

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev