On Wed, Aug 24, 2016 at 2:11 PM, James Slagle <james.sla...@gmail.com> wrote: > The latest recurring problem that is failing a lot of the nonha ssl > jobs in tripleo-ci is: > > https://bugs.launchpad.net/tripleo/+bug/1616144 > tripleo-ci: nonha jobs failing with Unable to establish connection to > https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1 > > This error happens while polling for events from the overcloud stack > by tripleoclient. > > I can reproduce this error very easily locally by deploying with an > ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap, > something gets OOM killed. If I do enable swap, swap gets used (< 1GB) > and then I hit this error almost every time. > > The stack keeps deploying but the client has died, so the job fails. > My investigation so far has only pointed out that it's the swap > allocation that is delaying things enough to cause the failure. > > We do not see this error in the ha job even though it deploys more > nodes. As of now, my only suspect is that it's the overhead of the > initial SSL connections causing the error. > > If I test with 6GB ram and 4 vcpus I can't reproduce the error, > although much more swap is used due to the increased number of default > workers for each API service. > > However, I suggest we just raise the undercloud specs in our jobs to > 8GB ram and 4 vcpus. These seem reasonable to me because those are the > default specs used by infra in all of their devstack single and > multinode jobs spawned on all their other cloud providers. Our own > multinode job for the undercloud/overcloud and undercloud only job are > running on instances of these sizes. > > Yes, this is just sidestepping the problem by throwing more resources > at it. The reality is that we do not prioritize working on optimizing > for speed/performance/resources. We prioritize feature work that > indirectly (or maybe it's directly?) makes everything slower, > especially at this point in the development cycle. > > We should therefore expect to have to continue to provide more and > more resources to our CI jobs until we prioritize optimizing them to > run with less. > > Let me know if there is any disagreement on making these changes. If > there isn't, I'll apply them in the next day or so. If there are any > other ideas on how to address this particular bug for some immediate > short term relief, please let me know.
For short term, +1 for extending the flavor and add the required RAM. For long term, I'm working on extending our CI jobs to cover multiple scenarios with less services installed on them. I hope it will help to consume less resources on every job. Any help is welcome. > -- > -- James Slagle > -- > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Emilien Macchi __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev