Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

John Stoffel Wed, 30 Oct 2013 16:11:12 -0700

Mathew> It seems I have a few solutions to work with, albeit none that
Mathew> are ideal.


Mathew> I do understand that *ideally* ext4 will automatically force a
Mathew> fsck upon the next reboot after the return of drives that have
Mathew> been unexpectedly "yanked". Historically, though, this doesn't
Mathew> happen.

Mathew> When we've rebooted (and I'm pretty sure I left this part out
Mathew> in previous descriptions) we have been presented with the
Mathew> directive to manually run fsck. Perhaps this is due to the
Mathew> fact that it is not being told to correct any errors with the
Mathew> -y option which would result in our always booting into rescue
Mathew> mode and manually performing the operation. Perhaps simply
Mathew> adding the /fsckoptions file containing -y will solve the
Mathew> problem. Perhaps not.

Mathew> We are going to evaluate the solutions that have been presented

Mathew>    - onerror=panic with the kernel.panic sysctl option set to 0.
Mathew>    - tune2fs to set the -c option to 1 which will force a fsck after 
each
Mathew>    mount (ie, each reboot).
Mathew>    - creating the /forcefsck file each time the system boots either by 
way
Mathew>    of a cron job which checks for it and creates it if it doesn't exist 
or
Mathew>    simply by way of an entry in /etc/rc.local which touches the file 
(this
Mathew>    will include the /fsckoptions file which will contain the -y option).

Mathew> Again, none are ideal, but the situation is such that we have
Mathew> to pick * something*. Hopefully, we aren't in a position any
Mathew> time soon to be required to test any one of them.

I would think that with 1400 systems, you have a bunch of test boxes
that are perfect for testing ahead of time, so you can try to catch
errors and issues like these earlier on.  I realize you're more of a
hosting organization, but can't you spin up 10 VMs that are
representative of your workloads and use them to test out solutions as
outlined above?  Then push those that work out via your configuration
system (puppet, cfengine, pssh, chef, etc) to the rest of the systems?  

Faking some problems should be easy enough to do, esp if you can get
the cooperation of the backend SAN/iSCSI group to have them just pull
some connections on a test box to simulate this type of failure.

When you have this many systems, doing some testing would seem to me
to be easy to justify.  Just keeping track of 1,400 systems is a big
job deserving automation, etc.  

I'm also not sure that you really want to do a tune2fs to force fsck
on each boot, since that will just slow down things when you reboot a
system for other reasons.  

But doing the:  

    tune2fs -e panic /dev/dsk/... 

should be easy to test out too.  I guess what I'm saying is that you
should take the time to stress test some options and then push them
out since you have such a large setup.

John

_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Need ideas/suggestions for bringing several VMs back online after an outage

Reply via email to