Re: [openstack-dev] [tripleo] CI jobs failures

Derek Higgins Wed, 09 Mar 2016 02:03:37 -0800

On 9 March 2016 at 07:08, Richard Su <[email protected]> wrote:
>
>
> On 03/08/2016 09:58 AM, Derek Higgins wrote:
>>
>> On 7 March 2016 at 18:22, Ben Nemec <[email protected]> wrote:
>>>
>>> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>>>>
>>>> On 7 March 2016 at 15:24, Derek Higgins <[email protected]> wrote:
>>>>>
>>>>> On 6 March 2016 at 16:58, James Slagle <[email protected]> wrote:
>>>>>>
>>>>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>>>>>> technical improvements to stop having so much CI failures.
>>>>>>>
>>>>>>>
>>>>>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a
>>>>>>> terrible
>>>>>>> mistake to swap on files because we don't have enough RAM. In my
>>>>>>> experience, swaping on non-SSD disks is even worst that not having
>>>>>>> enough RAM. We should stop doing that I think.
>>>>>>
>>>>>> We have been relying on swap in tripleo-ci for a little while. While
>>>>>> not ideal, it has been an effective way to at least be able to test
>>>>>> what we've been testing given the amount of physical RAM that is
>>>>>> available.
>>>>>
>>>>> Ok, so I have a few points here, in places where I'm making
>>>>> assumptions I'll try to point it out
>>>>>
>>>>> o Yes I agree using swap should be avoided if at all possible
>>>>>
>>>>> o We are currently looking into adding more RAM to our testenv hosts,
>>>>> it which point we can afford to be a little more liberal with Memory
>>>>> and this problem should become less of an issue, having said that
>>>>>
>>>>> o Even though using swap is bad, if we have some processes with a
>>>>> large Mem footprint that don't require constant access to a portion of
>>>>> the footprint swaping it out over the duration of the CI test isn't as
>>>>> expensive as it would suggest (assuming it doesn't need to be swapped
>>>>> back in and the kernel has selected good candidates to swap out)
>>>>>
>>>>> o The test envs that host the undercloud and overcloud nodes have 64G
>>>>> of RAM each, they each host 4 testenvs and each test env if running a
>>>>> HA job can use up to 21G of RAM so we have over committed there, it
>>>>> this is only a problem if a test env host gets 4 HA jobs that are
>>>>> started around the same time (and as a result a each have 4 overcloud
>>>>> nodes running at the same time), to allow this to happen without VM's
>>>>> being killed by the OOM we've also enabled swap there. The majority of
>>>>> the time this swap isn't in use, only if all 4 testenvs are being
>>>>> simultaneously used and they are all running the second half of a CI
>>>>> test at the same time.
>>>>>
>>>>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>>>>> mechanism, this causes sync requests from guest to be ignored and as a
>>>>> result if the instances being hosted on these nodes are going into
>>>>> swap this swap will be cached on the host as long as RAM is available.
>>>>> i.e. swap being used in the undercloud or overcloud isn't being synced
>>>>> to the disk on the host unless it has to be.
>>>>>
>>>>> o What I'd like us to avoid is simply bumping up the memory every time
>>>>> we hit a OOM error without at least
>>>>>    1. Explaining why we need more memory all of a sudden
>>>>>    2. Looking into a way we may be able to avoid simply bumping the RAM
>>>>> (at peak times we are memory constrained)
>>>>>
>>>>> as an example, Lets take a look at the swap usage on the undercloud of
>>>>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>>>>> swap enabled via a swapfile
>>>>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>>>>
>>>>> In the graph you'll see a spike in memory being swapped out around
>>>>> 22:09, this corresponds almost exactly to when the overcloud image is
>>>>> being downloaded from swift[3], looking the top output at the end of
>>>>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>>>>
>>>>> I'd much prefer we spend time looking into why the swift proxy is
>>>>> using this much memory rather then blindly bump the memory allocated
>>>>> to the VM, perhaps we have something configured incorrectly or we've
>>>>> hit a bug in swift.
>>>>>
>>>>> Having said all that we can bump the memory allocated to each node but
>>>>> we have to accept 1 of 2 possible consequences
>>>>> 1. We'll env up using the swap on the testenv hosts more then we
>>>>> currently are or
>>>>> 2. We'll have to reduce the number of test envs per host from 4 down
>>>>> to 3, wiping 25% of our capacity
>>>>
>>>> Thinking about this a little more, we could do a radical experiment
>>>> for a week and just do this, i.e. bump up the RAM on each env and
>>>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>>>> success rate goes up then we'd be running less rechecks anyways.
>>>> The downside is that we'd probably hit less timing errors (assuming
>>>> the tight resources is whats showing them up), I say downside because
>>>> this just means downstream users might hit them more often if CI
>>>> isn't. Anyways maybe worth discussing at tomorrows meeting.
>>>
>>> +1 to reducing the number of testenvs and allocating more memory to
>>> each.  The huge number of rechecks we're having to do is definitely
>>> contributing to our CI load in a big way, so if we could cut those down
>>> by 50% I bet it would offset the lost testenvs.  And it would reduce
>>> developer aggravation by about a million percent. :-)
>>>
>>> Also, on some level I'm not too concerned about the absolute minimum
>>> memory use case.  Nobody deploying OpenStack in the real world is doing
>>> so on 4 GB nodes.  I doubt 99% of them are doing so on less than 32 GB
>>> nodes.  Until we have composable services, I don't know that we can
>>> support the 4 GB use case anymore.  We've just added too many services
>>> to the overcloud.
>>
>> We discussed this at today's meeting but never really came to a
>> conclusion except to say most people wanted to try it. The main
>> objection brought up was that we shouldn't go dropping the nonha job,
>> that isn't what I was proposing so let me rephrase here and see if we
>> can gather +/-1's
>>
>> I'm proposing we redeploy our testenvs with more RAM allocated per
>> env, specifically we would go from
>> 5G undercloud and 4G overcloud nodes to
>> 6G undercloud and 5G overcloud nodes to
>>
>> In addition to accommodate this we would reduce the number of env's
>> available from 48 (the actually number varies from time to time) to 36
>> (3 envs per host)
>>
>> No changes would be happening on the jobs we actually run
>>
>> The assumption is that with the increased resources we would hit less
>> false negative test results and as a result recheck jobs less (so the
>> 25% reduction in capacity wouldn't hit us as hard as it might seem),
>> we also may not be able to easily undo this if it doesn't work out as
>> once we start merging things that use the extra RAM it will be hard to
>> go back.
>
> +1, so many false negatives today. I think it would also be useful to add a
> memory usage/free report at the end of the CI run. This can be used to gauge
> how new changes are affecting consumption.


We have top reporting memory usage on the undercloud and overcloud
nodes in the /var/log/host_info.txt file for each host, so the
information is at least available (although you have to go looking for
it), dstat is also run on the undercloud nodes giving more detailed
information over the duration of the test.

>>>
>>> That said though, keeping service memory usage under control is still
>>> valuable and we should figure out why Swift is using so much memory when
>>> it's not under much load at all.  That's actually the undercloud, so
>>> it's sort of tangential to this discussion.
>>
>> <snip/>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] CI jobs failures

Reply via email to