On 03/08/2016 11:58 AM, Derek Higgins wrote: > On 7 March 2016 at 18:22, Ben Nemec <openst...@nemebean.com> wrote: >> On 03/07/2016 11:33 AM, Derek Higgins wrote: >>> On 7 March 2016 at 15:24, Derek Higgins <der...@redhat.com> wrote: >>>> On 6 March 2016 at 16:58, James Slagle <james.sla...@gmail.com> wrote: >>>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emil...@redhat.com> >>>>> wrote: >>>>>> I'm kind of hijacking Dan's e-mail but I would like to propose some >>>>>> technical improvements to stop having so much CI failures. >>>>>> >>>>>> >>>>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible >>>>>> mistake to swap on files because we don't have enough RAM. In my >>>>>> experience, swaping on non-SSD disks is even worst that not having >>>>>> enough RAM. We should stop doing that I think. >>>>> >>>>> We have been relying on swap in tripleo-ci for a little while. While >>>>> not ideal, it has been an effective way to at least be able to test >>>>> what we've been testing given the amount of physical RAM that is >>>>> available. >>>> >>>> Ok, so I have a few points here, in places where I'm making >>>> assumptions I'll try to point it out >>>> >>>> o Yes I agree using swap should be avoided if at all possible >>>> >>>> o We are currently looking into adding more RAM to our testenv hosts, >>>> it which point we can afford to be a little more liberal with Memory >>>> and this problem should become less of an issue, having said that >>>> >>>> o Even though using swap is bad, if we have some processes with a >>>> large Mem footprint that don't require constant access to a portion of >>>> the footprint swaping it out over the duration of the CI test isn't as >>>> expensive as it would suggest (assuming it doesn't need to be swapped >>>> back in and the kernel has selected good candidates to swap out) >>>> >>>> o The test envs that host the undercloud and overcloud nodes have 64G >>>> of RAM each, they each host 4 testenvs and each test env if running a >>>> HA job can use up to 21G of RAM so we have over committed there, it >>>> this is only a problem if a test env host gets 4 HA jobs that are >>>> started around the same time (and as a result a each have 4 overcloud >>>> nodes running at the same time), to allow this to happen without VM's >>>> being killed by the OOM we've also enabled swap there. The majority of >>>> the time this swap isn't in use, only if all 4 testenvs are being >>>> simultaneously used and they are all running the second half of a CI >>>> test at the same time. >>>> >>>> o The overcloud nodes are VM's running with a "unsafe" disk caching >>>> mechanism, this causes sync requests from guest to be ignored and as a >>>> result if the instances being hosted on these nodes are going into >>>> swap this swap will be cached on the host as long as RAM is available. >>>> i.e. swap being used in the undercloud or overcloud isn't being synced >>>> to the disk on the host unless it has to be. >>>> >>>> o What I'd like us to avoid is simply bumping up the memory every time >>>> we hit a OOM error without at least >>>> 1. Explaining why we need more memory all of a sudden >>>> 2. Looking into a way we may be able to avoid simply bumping the RAM >>>> (at peak times we are memory constrained) >>>> >>>> as an example, Lets take a look at the swap usage on the undercloud of >>>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or >>>> swap enabled via a swapfile >>>> the overcloud deploy started @22:07:46 and finished at @22:28:06 >>>> >>>> In the graph you'll see a spike in memory being swapped out around >>>> 22:09, this corresponds almost exactly to when the overcloud image is >>>> being downloaded from swift[3], looking the top output at the end of >>>> the test you'll see that swift-proxy is using over 500M of Mem[4]. >>>> >>>> I'd much prefer we spend time looking into why the swift proxy is >>>> using this much memory rather then blindly bump the memory allocated >>>> to the VM, perhaps we have something configured incorrectly or we've >>>> hit a bug in swift. >>>> >>>> Having said all that we can bump the memory allocated to each node but >>>> we have to accept 1 of 2 possible consequences >>>> 1. We'll env up using the swap on the testenv hosts more then we >>>> currently are or >>>> 2. We'll have to reduce the number of test envs per host from 4 down >>>> to 3, wiping 25% of our capacity >>> >>> Thinking about this a little more, we could do a radical experiment >>> for a week and just do this, i.e. bump up the RAM on each env and >>> accept we loose 25 of our capacity, maybe it doesn't matter, if our >>> success rate goes up then we'd be running less rechecks anyways. >>> The downside is that we'd probably hit less timing errors (assuming >>> the tight resources is whats showing them up), I say downside because >>> this just means downstream users might hit them more often if CI >>> isn't. Anyways maybe worth discussing at tomorrows meeting. >> >> +1 to reducing the number of testenvs and allocating more memory to >> each. The huge number of rechecks we're having to do is definitely >> contributing to our CI load in a big way, so if we could cut those down >> by 50% I bet it would offset the lost testenvs. And it would reduce >> developer aggravation by about a million percent. :-) >> >> Also, on some level I'm not too concerned about the absolute minimum >> memory use case. Nobody deploying OpenStack in the real world is doing >> so on 4 GB nodes. I doubt 99% of them are doing so on less than 32 GB >> nodes. Until we have composable services, I don't know that we can >> support the 4 GB use case anymore. We've just added too many services >> to the overcloud. > > We discussed this at today's meeting but never really came to a > conclusion except to say most people wanted to try it. The main > objection brought up was that we shouldn't go dropping the nonha job, > that isn't what I was proposing so let me rephrase here and see if we > can gather +/-1's > > I'm proposing we redeploy our testenvs with more RAM allocated per > env, specifically we would go from > 5G undercloud and 4G overcloud nodes to > 6G undercloud and 5G overcloud nodes to > > In addition to accommodate this we would reduce the number of env's > available from 48 (the actually number varies from time to time) to 36 > (3 envs per host) > > No changes would be happening on the jobs we actually run > > The assumption is that with the increased resources we would hit less > false negative test results and as a result recheck jobs less (so the > 25% reduction in capacity wouldn't hit us as hard as it might seem), > we also may not be able to easily undo this if it doesn't work out as > once we start merging things that use the extra RAM it will be hard to > go back.
I think the problem is we already merged things that use the extra RAM, but the RAM isn't actually there. :-) So +1 from me. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev