Re: [openstack-dev] [TripleO] CI outage

Derek Higgins Mon, 30 Mar 2015 06:51:47 -0700


Tl;dr tripleo ci is back up and running, see below for more


On 21/03/15 01:41, Dan Prince wrote:

Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.

RH1 has been running as expected since last Thursday afternoon whichmeans the cloud was down for almost a week, I'm left not entirely surewhat some problems were, at various times during the week we tried anumber of different interventions which may have caused (or exposed)some of our problems, e.g.

at one stage we restarted openvswitch in an attempt to ensure nothinghad gone wrong with our ovs tunnels, around the same time (and possiblecaused by the restart), we started getting progressively worseconnections to some of our servers. With lots of entries like this onour bastion serverMar 20 13:22:49 host01-rack01 kernel: bond0.5: received packet with ownaddress as source address

Not linking the restart with the looping packets message and insteadthinking we may have a problem with the switch we put in a call with ourswitch vendor.

Continuing to chase down a problem on our own servers we noticed thattcpdump was reporting at times about 100,000 ARP packets per second(sometimes more).


Various interventions stopped the excess broadcast traffic e.g.

Shutting down most of the compute nodes stopped the excess traffic,but the problem wasn't linked to any one particular compute nodeRunning the tripleo os-refresh-config script on each compute nodestopped the excess traffic


But restarting the controller node caused the excess traffic to return

Eventually we got the cloud running without the flood of broadcasttraffic, with a small number of compute nodes, but instances stillweren't getting IP address, with nova and neutron in debug mode we sawan error where nova failing to mount the qcow image (iirc it wasattempting to resize the image).

Unable to figure out why this was working in the past but now isn't weredeployed this single compute node using the original image that wasused (over a year ago), instances on this compute node we're booting butfailing to get an IP address, we noticed this was because of adifference between the time on the controller when compared to thecompute node. After resetting the time, now instances were booting andnetworking was working as expected (this was now Wednesday evening).

Looking back at the error while mounting the qcow image, I believe thiswas a red herring, it looks like this problem was always present on oursystem but we didn't have scary looking tracebacks in the logs until weswitched to debug mode.

Now pretty confident we can get back to a running system by starting upall the compute nodes again and ensuring the os-refresh-config scriptswere run then ensuring the times were all set on each host properly wedecided to remove any entropy the may have built up while debuggingproblems on each computes node so we redeployed all of our compute nodesfrom scratch. This all went as expected but was a little time consumingas we spent time to verify each step as we went along, the steps wentsomething like this

o with the exception of the overcloud controller, "nova delete" all ofthe hosts on the undercloud (31 hosts)

o we now have a problem, in tripleo the controller and compute nodes aretied together in a single heat template, so we need the heat templatethat was used a year ago to deploy the whole overcloud along with theparameters that were passed into it, we had actually done this beforewhen adding new compute nodes to the cloud so it wasn't new territory.o Use "heat template-show ci-overcloud" to get the original heattemplate (a json version) that was used and remove anything referring tothe controllero Edit a version of devtest overcloud to use this template and toskip various other steps

o Our overcloud had 3 VM instances used by CI that now needed to bereplaced, each needed specific IP addresses and to be on both thedefault and test network

   o squid     - nova boot the image we had for this
   o bandersnatch -  nova boot the image we had for this

o te-broker - we didn't have an image for this, we booted a vanillaFedora image and manually installed geard


o testenv hosts - we also redeployed the test env hosts

o set the times on all of the hosts so that they are in sync

Its thursday lunchtime and our cloud is now back up and running, at thispoint we remove the IP tables rule preventing nodepool from talking toour cloud

We still have problems, regressions have been committed to various nontripleo repositories causing our tests to fail in various ways (4regressions in total)1. Fedora instances started by nodepool were failing to boot, weeventually tracked this down to an update to the scripts nodpool uses tobuild this image, this update was only needed for a specific cloud, soour fix here was to only run it on that cloud [1]

2. A neutron regression [2]
3. A keystone regression [3]
4. A horizon regression[4]


[1] https://review.openstack.org/#/c/168196/
[2] https://bugs.launchpad.net/tripleo/+bug/1437116
[3] https://bugs.launchpad.net/tripleo/+bug/1437032
[4] https://bugs.launchpad.net/horizon/+bug/1438133

With these four problems now fixed,reverted or temporarily worked aroundthe tripleo CI is back running and jobs are passing.

I'm pretty confident we'll never be sure what the initial problem wasbut here is what I believe

At the start of all this our cloud had only a single symptomatic problembeing caused by time drifting on compute nodes. This was causing a highpercentage of our instances to fail to boot (depending on when variousagents last reported back to the controller), this problem was probablygetting progressively worse over the last while but nodepool washandling it by deleting instances and restarting new ones until in gotan instance that worked (infra may be able to verify this and if thesituation has improved).

The change to the nodepool script which broke our fedora image caused usto start looking at the cloud for problems, while debugging the cloudour test instances we're not getting IP addresses (due to time drift)which set us down the path of a real issue but not the same issue thatprompted use to start looking for problems in the first place.

All of the other problems were either caused by us trying to debug orfix the cloud or were red herrings.

This cloud was deployed using tripleo over a year ago, at the time ntpwasn't part of the default install, it is now installed by default. Atsome stage in the future when we redeploy it using ironic (its currentlynova-bm) we should ensure it is used to avoid hitting this again. Alsowhen redeploying we should take advantage of some monitoring that is nowpart of tripleo or add some in places where it is lacking.

The rh1 cloud has been up and running for just over a year without anymajor problems/outages (at least that we know of), given that its run bydevelopers in the tripleo community, I think this is a reasonable timebetween outages, although I would hope we could debug and fix any futureproblems if they arise, in a shorter time frame.

Feel free to ask questions if you want me to elaborate on anything, Ihope I didn't ramble too much,


Derek.


Dan


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [TripleO] CI outage

Reply via email to