Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Jared Moore Tue, 29 Oct 2013 13:22:01 -0700

The fstab option only specifies the order in which the fsck will
happen if at all. tune2fs determines the fsck check interval, wtih
most systems having that value set to at least 5; meaning 5 reboots
before an fsck is done. To force a fsck you can create an empty file
/forcefsck that should fsck all filesystems at reboot. /forcefsck gets
deleted after the checks, so a simple rc.local script to recreate it
would be fine.


We had a similar outage at the beginning of the year that I was
thinking was similar to yours, but in our case we actually lost an
entire SAN box, multiple drive failure across multiple arrays. Our fix
was the same as yours: identify the bad servers, boot a rescue disk
and fix by either fsck or tape restore for the affected filesystem.
Granted we only had around 200 servers and 5 guys to fix, so not a big
as your scale, but it was still painful.

I've since switched jobs, but at the time I was planning on coming up
with a powercli scrip to boot problem boxes to our rescue iso to save
that step. Dealing with the console across our vpn was a pain as well,
so a rescue iso that got an IP and started ssh was something I was
going to come up with too.

Jared

>> On 29.10.2013 19:51, Mathew Snyder wrote:
>>>
>>> I agree. Others have mentioned this as well.
>>>
>>> I just need to work out how to ensure that fsck is performed after
>>> EVERY reboot so we can ensure this is corrected when it happens rather
>>> than logging in and running tune2fs on each one. I suppose a croned
>>>
>>> script that checks the state of the filesystem and forces a fsck if
>>> any are in read-only mode when they shouldnt be would be a start.
>>>
>>>
>>> If there is a method to configure the OS to do this without a script
>>> that would be ideal. Wed prefer just flipping a setting that tells the
>>>
>>> OS to run a fsck on reboot whether the filesystem is clean or dirty.
>>>
>>> -Mathew
>>>
>>> "When you do things right, people wont be sure youve done anything at
>>> all." - God; Futurama
>>>
>>> "Well get along much better once you accept that youre wrong and
>>>
>>> neither am I." - Me
>>>
>>> On Tue, Oct 29, 2013 at 12:45 PM, John Stoffel <[email protected]> wrote:
>>>
>>>> Hi Mathew,
>>>>
>>>> One question I have is why dont your 1400 servers just do filesystem
>>>>
>>>> checks on reboot then?  Since you have to stop them and reboot them,
>>>> whats wrong with letting the OS do the work?  This should be more
>>>>
>>>> scriptable than having to manually boot into a recovery setup.
>>>>
>>>> Do you have the data stored on the VMs?  It might be quicker to just
>>>> rebuild the VMs from known good configs and then get them running
>>>> again.
>>>>
>>>> Honestly, if youre going to reboot them anyway (probably by a hard
>>>>
>>>> reset, try letting the redhat OS do the filesystem checks on reboot
>>>> instead.
>>>>
>>>> John
>>>
>>>
>>>
>>> _______________________________________________
>>> Tech mailing list
>>> [email protected]
>>> https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
>>> This list provided by the League of Professional System Administrators
>>>  http://lopsa.org/
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Reply via email to