Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Mathew Snyder Tue, 29 Oct 2013 12:36:52 -0700

Definitely not on my shoulders, thanks. :) We've done as much as we can to
ensure we are able to recover as quickly as possible, but as I described,
we've exceeded the scale of our current method.


They are RHEL6 systems (with a few RHEL5) so primarily we are looking at
ext4 with a handful of ext3. It does seem like ext4 is robust enough to be
able to handle this type of outage. It seems like we simply don't have the
proper configurations in place to take advantage of it.

I guess now the question is what have I done wrong that prevents this from
happening?


-Mathew

"When you do things right, people won't be sure you've done anything at
all." - God; Futurama

"We'll get along much better once you accept that you're wrong and neither
am I." - Me


On Tue, Oct 29, 2013 at 12:10 PM, matthewhall <[email protected]>wrote:

> Sorry to hear of your troubles ... I trust it doesn't fall squarely
> (politically) on you and your co-worker's shoulders.
>
> May I enquire as to the nature of the filesystems on these VMs?
> It surprises me that a sudden inability to write to the block device
> beneath is causing such hassle at the FS layer, ext3 upward (as is standard
> under RH) has a pretty robust journal system.
>
> Maybe I've just been extremely lucky :)
>
>
>
>
> On 29.10.2013 18:32, Mathew Snyder wrote:
>
>> We recently went through a very difficult situation (both technically
>> and politically) as a result of poor infrastructure design and
>> implementation by our data services provider. On Saturday, a network
>> issue caused our entire environment to be offline and we are still
>> dealing with straggler issues while our customers verify their
>> applications are online and functioning correctly. All told, we were
>> offline for well over a day and a half.
>>
>> Not only should this never have happened, but it was third or fourth
>> time this year. After the last outage the provider was given a strict
>> mandate from our contracting agency to ensure it never happened again.
>>
>>
>> The impact, while obvious on the surface goes even deeper. We have
>> over 1400 servers, a vast majority of which are Red Hat. All but about
>> 50 of them are VMs. The VMs are stored and run from backend storage.
>> The backend storage is connected to the compute nodes via the
>> aforementioned, poorly designed and implemented infrastructure. When
>> the network goes out, the compute loses its storage and the servers
>> are left in a very precarious state.
>>
>> We end up having to run reports against all of our VMs to determine
>> which have been affected and left in a read-only state. This is simple
>> enough. My colleague has written a script which is executed remotely
>> against all of our systems and provides this feedback.
>>
>> We are then left with the worst part of it. We must then log in to
>> each system via the console, reboot to our provisioning network and
>> load up the rescue environment to perform manual filesystem checks and
>> repairs. Doing this for 1400 servers is, needless to say, a chore when
>> there isnt a more robust solution in place.
>>
>> What makes this worse is the fact that we dont have access to
>>
>> Vcenter, Vsphere, or any of the infrastructure/storage/etc.
>>
>> Im at a loss as to how to make recovery from such an outage more
>> expeditious so Im hoping someone here can provide some guidance.
>>
>>
>> Has anyone else dealt with a similar situation or at least have
>> insight into steps we can take and tools we can implement to make our
>> lives easier?
>>
>> -Mathew
>>
>> "When you do things right, people wont be sure youve done anything at
>> all." - God; Futurama
>>
>> "Well get along much better once you accept that youre wrong and
>>
>> neither am I." - Me
>> ______________________________**_________________
>> Tech mailing list
>> [email protected]
>> https://lists.lopsa.org/cgi-**bin/mailman/listinfo/tech<https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech>
>> This list provided by the League of Professional System Administrators
>>  http://lopsa.org/
>>
>
> ______________________________**_________________
> Tech mailing list
> [email protected]
> https://lists.lopsa.org/cgi-**bin/mailman/listinfo/tech<https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech>
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
>

_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Reply via email to