Sorry for cross posting, but maybe this will be interesting on both. Does anyone know of any lists that discuss reliability and disaster recovery? I have some ideas below, but they are from my experience only.
George Kasica wrote: [...] > What I'm asking is: Is there ANYONE out there that know how to do this > and would be willing to do the job FOR PAY in the very near future? > REBUILDING THE BOX FROM SCRATCH IS NOT AN OPTION. Though I do do a > full tape backup nightly, the time between the attempts and noticing > they broke something is beyond the length of tapes I have here... It would be cheaper for you, I think, to fix it yourself. If you can't have the box down, get another one and build a system on it (you can start with backup tapes if you think that's easier). Copy over all your production stuff and test it out. When it's ready, stick it into the network and pull the old one. The switch should be done during a low traffic period, and if you're careful about how you do it (and how your network is arranged), your customers won't notice. If you don't have time to do this yourself, then I guess you're into paying someone. But I would not try to fix the problem in place. gcc is one thing but when you mess with glibc you risk breaking what's running and your customers will notice. I don't think I'd trust even an expert to do this, since he's unlikely to know all the details of your system. Here are some suggestions to avoid this problem in the future: You should have a test box in addition to your production box. They should be identical hardware and software (you probably don't need gcc or other development tools on the production box, though). When you need to upgrade a package, build it on the test box in such a way that you get a tar file - there's usually some sort of distribution make target for this sort of thing. Install the package by untarring, both on the test box and the production box. To be even safer, get yourself a development box and do the building there. That way the process you use on the test box and the production box will be identical, and you'll know what to expect. Test out your new package *thouroughly* before you put it on the production box. You have to make sure that config and data files get updated at the same time. Ideally, you'll be able to back out changes in your test system and start over so that if there's a problem you can fix it and redo the whole process. Your upgrade procedure needs to be straightforward so you don't mess it up on the production system. Bigger packages, like glibc, will need more testing than something like apache. But you should only upgrade the base system when it's necessary to support a new package you need. This system should be stable, not bleeding edge. If your customers are changing their data frequently, then backing out a package by restoring from tape becomes problematic. You should think about isolating all your application packages in their own directory trees so they can be removed and reinstalled easily. You should save each build you do, and have a way to identify which configurations you used with which build (a version control system can be very good for this). You might also consider backing up data separately from the applications, so you can downgrade an application without losing customer data. Finally, you should expect your production box to be blown into tiny pieces some day. Pretend that it happened and do what you'd need to recover from it. That will test your backups and documentation to make sure that it all works. It will also give you some idea of how long you'll be down, worst case. Maybe this is overly idealistic, to expect you to know that you can handle disasters. None of the providers I know does this (and they aren't small businesses). HTH, Dave _______________________________________________ Linux-users mailing list Archives, Digests, etc at http://linux.nf/mailman/listinfo/linux-users