Sorry for cross posting, but maybe this will be interesting on both.

Does anyone know of any lists that discuss reliability and disaster
recovery?  I have some ideas below, but they are from my experience only.

George Kasica wrote:
[...]

> What I'm asking is: Is there ANYONE out there that know how to do this
> and would be willing to do the job FOR PAY in the very near future?
> REBUILDING THE BOX FROM SCRATCH IS NOT AN OPTION. Though I do do a
> full tape backup nightly, the time between the attempts and noticing
> they broke something is beyond the length of tapes I have here...

It would be cheaper for you, I think, to fix it yourself.  If you can't
have the box down, get another one and build a system on it (you can start
with backup tapes if you think that's easier).  Copy over all your
production stuff and test it out.  When it's ready, stick it into the
network and pull the old one.  The switch should be done during a low
traffic period, and if you're careful about how you do it (and how your
network is arranged), your customers won't notice.

If you don't have time to do this yourself, then I guess you're into paying
someone.  But I would not try to fix the problem in place.  gcc is one
thing but when you mess with glibc you risk breaking what's running and
your customers will notice.  I don't think I'd trust even an expert to do
this, since he's unlikely to know all the details of your system.

Here are some suggestions to avoid this problem in the future:

You should have a test box in addition to your production box.  They should
be identical hardware and software (you probably don't need gcc or other
development tools on the production box, though).  When you need to upgrade
a package, build it on the test box in such a way that you get a tar file -
there's usually some sort of distribution make target for this sort of
thing.

Install the package by untarring, both on the test box and the production
box.  To be even safer, get yourself a development box and do the building
there.  That way the process you use on the test box and the production box
will be identical, and you'll know what to expect.

Test out your new package *thouroughly* before you put it on the production
box.  You have to make sure that config and data files get updated at the
same time.  Ideally, you'll be able to back out changes in your test system
and start over so that if there's a problem you can fix it and redo the
whole process.  Your upgrade procedure needs to be straightforward so you
don't mess it up on the production system.

Bigger packages, like glibc, will need more testing than something like
apache.  But you should only upgrade the base system when it's necessary to
support a new package you need.  This system should be stable, not bleeding
edge.

If your customers are changing their data frequently, then backing out a
package by restoring from tape becomes problematic.  You should think about
isolating all your application packages in their own directory trees so
they can be removed and reinstalled easily.  You should save each build you
do, and have a way to identify which configurations you used with which
build (a version control system can be very good for this).  You might also
consider backing up data separately from the applications, so you can
downgrade an application without losing customer data.

Finally, you should expect your production box to be blown into tiny pieces
some day.  Pretend that it happened and do what you'd need to recover from
it.  That will test your backups and documentation to make sure that it all
works.  It will also give you some idea of how long you'll be down, worst
case.  Maybe this is overly idealistic, to expect you to know that you can
handle disasters.  None of the providers I know does this (and they aren't
small businesses).

HTH,
Dave


_______________________________________________
Linux-users mailing list
Archives, Digests, etc at http://linux.nf/mailman/listinfo/linux-users

Reply via email to