We recently went through a very difficult situation (both technically and politically) as a result of poor infrastructure design and implementation by our data services provider. On Saturday, a network issue caused our entire environment to be offline and we are still dealing with straggler issues while our customers verify their applications are online and functioning correctly. All told, we were offline for well over a day and a half.
Not only should this never have happened, but it was third or fourth time this year. After the last outage the provider was given a strict mandate from our contracting agency to ensure it never happened again. The impact, while obvious on the surface goes even deeper. We have over 1400 servers, a vast majority of which are Red Hat. All but about 50 of them are VMs. The VMs are stored and run from backend storage. The backend storage is connected to the compute nodes via the aforementioned, poorly designed and implemented infrastructure. When the network goes out, the compute loses its storage and the servers are left in a very precarious state. We end up having to run reports against all of our VMs to determine which have been affected and left in a read-only state. This is simple enough. My colleague has written a script which is executed remotely against all of our systems and provides this feedback. We are then left with the worst part of it. We must then log in to each system via the console, reboot to our provisioning network and load up the rescue environment to perform manual filesystem checks and repairs. Doing this for 1400 servers is, needless to say, a chore when there isn't a more robust solution in place. What makes this worse is the fact that we don't have access to Vcenter, Vsphere, or any of the infrastructure/storage/etc. I'm at a loss as to how to make recovery from such an outage more expeditious so I'm hoping someone here can provide some guidance. Has anyone else dealt with a similar situation or at least have insight into steps we can take and tools we can implement to make our lives easier? -Mathew "When you do things right, people won't be sure you've done anything at all." - God; Futurama "We'll get along much better once you accept that you're wrong and neither am I." - Me
_______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
