Re: [openstack-dev] [tripleo] rh1 issues post-mortem

Ben Nemec Fri, 24 Mar 2017 10:51:11 -0700

To follow-up on this, we've continued to hit this issue on other computenodes. Not surprising, of course. They've all been up for about thesame period of time and have had largely even workloads.

It has caused problems though because it is cropping up faster than Ican respond (it takes a few hours to cycle all the instances off acompute node, and I need to sleep sometime :-), so I've startedpre-emptively rebooting compute nodes to get ahead of it. HopefullyI'll be able to get all of the potentially broken nodes at leastdisabled by the end of the day so we'll have another 3 months before wehave to worry about this again.


On 03/24/2017 11:47 AM, Derek Higgins wrote:

On 22 March 2017 at 22:36, Ben Nemec <openst...@nemebean.com> wrote:

Hi all (owl?),

You may have missed it in all the ci excitement the past couple of days, but
we had a partial outage of rh1 last night.  It turns out the OVS port issue
Derek discussed in
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
reared its ugly head on a few of our compute nodes, which caused them to be
unable to spawn new instances.  They kept getting scheduled since it looked
like they were underutilized, which caused most of our testenvs to fail.

I've rebooted the affected nodes, as well as a few more that looked like
they might run into the same problem in the near future.  Everything looks
to be working well again since sometime this morning (when I disabled the
broken compute nodes), but there aren't many jobs passing due to the
plethora of other issues we're hitting in ci.  There have been some stable
job passes though so I believe things are working again.

As far as preventing this in the future, the right thing to do would
probably be to move to a later release of OpenStack (either point or major)
where hopefully this problem would be fixed.  However, I'm hesitant to do
that for a few reasons.  First is "the devil you know". Outside of this
issue, we've gotten rh1 pretty rock solid lately.  It's been overworked, but
has been cranking away for months with no major cloud-related outages.
Second is that an upgrade would be a major process, probably involving some
amount of downtime.  Since the long-term plan is to move everything to RDO
cloud I'm not sure that's the best use of our time at this point.


+1 on keeping the status quo until moving to rdo-cloud.


Instead, my plan for the near term is to keep a closer eye on the error
notifications from the services.  We previously haven't had anything
consuming those, but I've dropped a little tool on the controller that will
dump out error notifications so we can watch for signs of this happening
again.  I suspect the signs were there long before the actual breakage
happened, but nobody was looking for them.  Now I will be.

So that's where things stand with rh1.  Any comments or concerns welcome.

Thanks.

-Ben

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] rh1 issues post-mortem

Reply via email to