We recently went through a very difficult situation (both technically and
politically) as a result of poor infrastructure design and implementation
by our data services provider. On Saturday, a network issue caused our
entire environment to be offline and we are still dealing with straggler
issues while our customers verify their applications are online and
functioning correctly. All told, we were offline for well over a day and a
half.

Not only should this never have happened, but it was third or fourth time
this year. After the last outage the provider was given a strict mandate
from our contracting agency to ensure it never happened again.

The impact, while obvious on the surface goes even deeper. We have over
1400 servers, a vast majority of which are Red Hat. All but about 50 of
them are VMs. The VMs are stored and run from backend storage. The backend
storage is connected to the compute nodes via the aforementioned, poorly
designed and implemented infrastructure. When the network goes out, the
compute loses its storage and the servers are left in a very precarious
state.

We end up having to run reports against all of our VMs to determine which
have been affected and left in a read-only state. This is simple enough. My
colleague has written a script which is executed remotely against all of
our systems and provides this feedback.

We are then left with the worst part of it. We must then log in to each
system via the console, reboot to our provisioning network and load up the
rescue environment to perform manual filesystem checks and repairs. Doing
this for 1400 servers is, needless to say, a chore when there isn't a more
robust solution in place.

What makes this worse is the fact that we don't have access to Vcenter,
Vsphere, or any of the infrastructure/storage/etc.

I'm at a loss as to how to make recovery from such an outage more
expeditious so I'm hoping someone here can provide some guidance.

Has anyone else dealt with a similar situation or at least have insight
into steps we can take and tools we can implement to make our lives easier?

-Mathew

"When you do things right, people won't be sure you've done anything at
all." - God; Futurama

"We'll get along much better once you accept that you're wrong and neither
am I." - Me
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to