Hello, we like to have a more reliable, fast and easy way to do a disaster recovery with amanda.
For this we have done two things, which you find both in the appended document: - write a (less tested) disaster recovery guide - start plannings for a specialized amanda disaster recovery system We like to create such a specialized system or to partizipate in a similar project. It will be nice, if you read through the document and give us your annotations and ideas. We like to know: - what we can make better - if there is someone working one something similar - if there is anybody how like to participate in our project. It is planed to publish everything under the GPL, but we can discuss about a similar license. Every feedback is welcome, Bernd Harmsen ds-DATASYSTEME PS: If you want I can send you a PDF, PS, LYX by private mail, which is much nicer to read. =============================================== Planning for an Amanda Disaster Recovery System =============================================== Bernd Harmsen [EMAIL PROTECTED] www.datasysteme.de -------- Contents -------- 1 Introduction 1.1 Why we need a specialized Amanda Disaster Recovery System? 2 Goals 3 Disaster recovery with native tools and possible optimizations. 3.1 Provide working Hardware and Emergency System 3.2 Restore a Linux-Backup-Client 3.3 Restore Linux-Backup-Server 3.4 Make the System bootable 4 Starting points for optimization 4.1 Essential Backup Tool 4.1.1 Easy Amanda Database export / import 4.2 Specialized Amanda Recovery System on CD 4.2.1 Remote Access 4.2.2 Full automatic partitioning, formating and mounting 4.2.3 Amrestore Scripts -------------- 1 Introduction -------------- This document was written to provide information about how to do a disaster recovery with Amanda and to plan a specialized disaster recovery system for Amanda. We (ds-DATASYSTEME) are a small company, specialized on Linux networks that provide Amanda backup system to our customers. We think that Amanda is a great backup tool, very fast, reliable and with low hardware recommendations. But we also think, that Amanda is lacking some features for recovery. Recovery is more complicated than backup. This is normal, because during a recovery you have to deal with an undefined, unknown situation. (E.g. a customer who want to get some files back normally only knows parts of the filename.) But, this is OK. The real problem for us is the case of a disaster recovery. In case the harddisk of an importand server is broken (or the server is completely lost) there are high costs, less time and impatient customers. For this we need a more secure, reliable and fast way to get the system working again. We like to create a specialized Amanda Disaster Recovery System, maybe together with other members of the Amanda community, or to participate in an existing system. We like to publish this system under the GPL or a similar license. 1.1 Why we need a specialized Amanda Disaster Recovery System? -------------------------------------------------------------- Because the disaster recovery process as described in Chapter [Disater Recovery naitive] is to complicated (less reliable because of human errors) and to slow. A disaster recovery consist of many different steps that all need time and care. On the other hand there are customers who want their server back. The following timesheet shows what we think about the maximum time we have for a disaster recovery 0.0h A server fails 0.5h The customer call the support. A member of the support team do a diagnostic talk with the customer and pack some hardware for replacement. 1.5h Now the support is on the way to the customer 2.0h The support member arrived at the customer, analyzes the problem and repairs the system. 3.0h The hardware is working again. Now the support member starts to recover the data from the Amanda backups. For this we plan: 1.5h Active work with the recovery tools. 2.0h Data transport over the network. 6.5h The system is mostly working again. 8.0h The system is well tested. All the upcoming small problem are solved. As you can see, it takes a whole working day to get the system up and running again. This is very long and we should try to save some time at some points. But this timesheet is also optimistic. We think that it is hard to meet its deadlines without a specialized disaster recovery system. It assumes that the support worker makes no bigger errors. With a less trained worker it can even take 16 hours. ------- 2 Goals ------- What are the goals of an specialized Amanda Disaster Recovery System. 1. Make the Disaster Recovery more easy and reliable (less affected from human errors). 2. Make the Disaster Recovery more fast. ----------------------------------------------------------------- 3 Disaster recovery with native tools and possible optimizations. <Disater Recovery naitive> ----------------------------------------------------------------- This section describes how a disaster recovery can be done without a specialized system. It uses only the installation media of an Debian GNU/Linux 3.0 system and the Amanda backup. The concept is to install a separate minimal Debian system on a own partition and use this to restore the original partitions. This section has two intentions: 1. Provide a step by step guide for a disaster recovery. You can use it as guide. But the procedure is not well tested, because I write it after my last disaster recovery. Feel free to send me corrections and suggestions. 2. Show how complicated and time-consuming a disaster recovery can be and find some good points to start optimization. This is the main goal. The described way is too time-consuming and too complicated for a stressfull situation with an impatient customer behind you. So we like to build or participate in an more optimized and automated disaster recovery system. 3.1 Provide working Hardware and Emergency System ------------------------------------------------- 1. Provide working hardware. 2. Plan partition table. Additional to the partitions for the system you want to recover (destination-system), you must provide a partition for the emergency system. Put this partition at the beginning of the table and give it e.g. 300MB. You need a Backup of all your partition tables for that. Possible optimization: Full automatic partitioning, initialization and mounting (see [Full-automatic-partitioning]). 3. Install a Debian-Base-System Use your normal Debian installation method/media to install a base system on the additional partition.We will use this as emergency system. Create the partitions as planed above but only initialize and mount the partition for the emergency system. Install the following additional packages: Amanda: amanda-client, amanda-server, tar, dump Remote-Access: ssh, isdnutils-base, ipppd Possible optimization: Use an specialized Amanda Recovery System on a bootable CD (see [Amanda Recovery System on CD]). 4. Boot the emergency system. 5. Configure the IP-Network manually using ifconfig and route. 6. If you need remote access, e.g. for assistance from your office, configure ipppd manually. Possible_optimization: Provide good defaults for the isdn config files (see [Remote Access]). 7. Initialize and mount the destination partitions. Possible optimization: Full automatic partitioning, initialization and mounting (see [Full-automatic-partitioning]). (a) Initialize the Swap-Partition mkswap <DEVICE> (b) Initialize destination filesystem partitions Initialize ext2-filesystems with the following command: mke2fs /dev/<DEVICE> (c) Mount destination partition. Compose the destination partitions under the mountpoint /mnt. Use the following steps for that: i. Mount destination-"/"-partition under /mnt. mount /dev/<DEVICE> /mnt ii. Create mountpoints for other partitions in the destination-"/"-filesystem. e.g.: /var, /home, /groups, /usr mkdir /mnt/<MOUNTPOINT> iii. Mount all other destination partitions. mount /dev/<DEVICE> /mnt/<MOUNTPOINT> 8. Set correct date and time. date <MMDDhhmmYYYY> 3.2 Restore a Linux-Backup-Client --------------------------------- Use this step if you have a working Amanda-Backup-Server and want to restore a Linux-Backup-Client. Now we restore the data from our Backup-Server to the inactive destination system. For each partition we first restore the last level "0" backup and then the last backup of each higher level. 1. Get root permissions. su 2. Go to the highest directory of the selected destination partition. cd /mnt/<MOUNTPOINT> 3. Run Amrecover <Disaster-Linux-Client-Amrecover-starten> amrecover <CONFIG> -s <BACKUP-SERVER> -t <BACKUP-SERVER> 4. Set source partition. sethost <NAME> setdisk <MOUNTPOINT> 5. Select all files and directories: add * 6. Verify the list of files marked for extraction. Note which tapes are needed. list 7. Note the number of the archive you need on each tape. history You will see lines like: 201- 2002-03-06 0 ds-daily4 8 The last column shows the number of the archive and the second last the name of the tape. You need all listed tapes since the last level "0" backup. 8. Start the restore. extract 9. Verify if the shown destination directory is correct. 10. Load tape and wind to the beginning of the archive. <disaster-Amrecover-Linux-Band-laden> (a) Login on the Amanda backup server. (b) Load the tape wanted by amrecover. Wait until the streamer is quiet again. (c) Wind to the X. Filemark. Attention: X = archive-number - 1 mt --file=/den/<DEVICE> rewind mt --file=/dev/<DEVICE> fsf <X> (d) Wait until you get the next prompt. 11. Confirm to Amrecover on the backup client that the correct tape is loaded. Load tape <NAME> now Continue? [Y/n]: Y 12. Wait until restoration finishes. 13. Confirm restoration of origin permissions to the top level directory. set owner/mode for '.'? [yn] y 14. If Amrecover want another tape, proceed with step [disaster-Amrecover-Linux-Band-laden]. 15. Leave Amrecover. quit 16. Proceed with step [Disaster-Linux-Client-Amrecover-starten] to restore the next partition. 3.3 Restore Linux-Backup-Server ------------------------------- Use this step if your Amanda-Backup-Server itself is defect. Because the Backup-Server has failed, there is no Amanda database and you cannot use "amrecover". So we restore each partition with the less comfortable tool "amrestore". You must manually find out, which tapes and which archive-numbers you need for recovery. 1. Find out the tapes and archive-numbers. For each destination partition you need the last level "0" backup and the last backup of each higher backup level. You can find this information manually in the e-mails you have gotten from "amverify" in the past. Here is an example: Following you find an extract from different "amverify" e-mails. Each e-mail shows the content of one tape. The last number shows the backup level and the number of the "Checked ..." line (count from top) gives the number of the archive on the tape. In the example we want to restore the "/home"-Partition of out Backup-Server "amun". We start with the last level "0" backup in archive-number 11 on tape "ds-daily4". After that we have to restore the last level "1" backup in archive-number 10 on tape "ds-daily7". There is no level "2" backup, so we need only two tapes. Date: Wed, 5 Mar 2003 12:51:21 +0100 Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily4 [...] Using device /dev/nst0 Volume ds-daily4, Date 20030305 Checked upuaut.datasys._boot.20030305.0 Checked inpu.datasys._boot.20030305.0 Checked amun.datasys.__ra.datasys_E$.20030305.1 Checked amun.datasys.__aset.datasys_E$.20030305.1 Checked amun.datasys.__ra.datasys_D$.20030305.1 Checked inpu.datasys._var_lib.20030305.0 Checked amun.datasys._usr.20030305.0 Checked amun.datasys.__djhuti.datasys_E$.20030305.0 Checked amun.datasys.__djhuti.datasys_F$.20030305.1 Checked inpu.datasys._var.20030305.3 Checked amun.datasys._home.20030305.0 [...] Date: Thu, 6 Mar 2003 12:59:49 +0100 Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily6 [...] Using device /dev/nst0 Volume ds-daily6, Date 20030306 Checked amun.datasys._usr.20030306.1 Checked inpu.datasys._boot.20030306.1 Checked upuaut.datasys._.20030306.1 Checked upuaut.datasys._boot.20030306.1 Checked amun.datasys._.20030306.1 Checked inpu.datasys._var_lib.20030306.1 Checked upuaut.datasys._var.20030306.1 Checked amun.datasys.__aset.datasys_E$.20030306.1 Checked inpu.datasys._.20030306.1 Checked amun.datasys.__djhuti.datasys_F$.20030306.1 Checked amun.datasys.__ra.datasys_E$.20030306.1 Checked amun.datasys._var.20030306.1 Checked amun.datasys.__ra.datasys_C$.20030306.1 Checked amun.datasys.__aset.datasys_C$.20030306.1 Checked amun.datasys.__djhuti.datasys_C$.20030306.1 Checked amun.datasys.__aset.datasys_D$.20030306.1 Checked amun.datasys.__djhuti.datasys_E$.20030306.1 Checked inpu.datasys._var.20030306.0 Checked amun.datasys.__ra.datasys_D$.20030306.0 Checked amun.datasys.__djhuti.datasys_D$.20030306.0 Checked amun.datasys._home.20030306.1 [...] Date: Fri, 7 Mar 2003 13:41:35 +0100 Subject: ds-daily AMANDA VERIFY REPORT FOR ds-daily7 [...] Using device /dev/nst0 Volume ds-daily7, Date 20030307 Checked inpu.datasys._boot.20030307.1 Checked amun.datasys._usr.20030307.1 Checked upuaut.datasys._.20030307.1 Checked upuaut.datasys._boot.20030307.1 Checked amun.datasys._.20030307.1 Checked inpu.datasys._var_lib.20030307.1 Checked upuaut.datasys._var.20030307.2 Checked amun.datasys.__ra.datasys_D$.20030307.1 Checked inpu.datasys._.20030307.1 Checked amun.datasys._home.20030307.1 [...] Possible optimization: Provide an easy export/import mechanism for the Amanda database to use "amrecover" here (see [Easy Amanda Database export / import]). 2. TAR or DUMP? For each partition you must find out, if the backup was made using "tar" or "dump". You find this information in your amanda disklist file (e.g.: /etc/amanda/<CONFIG>/disklist), if you have a separate backup of it. Possible optimization: Provide an "Essential Backup" tool, that stores such information in a separate backup (see [Essential Backup]). 3. If you do not have root permission in the emergency system, get it now. su 4. Restore destination partitions (a) Change to the top level directory of the destination partition. <Disaster-Linux-Backup-Server-CD> cd /mnt/<MOUNTPOINT> (b) Insert correct tape<Disaster-Linux-Backup-Server-Bandwechsel> (c) Wind to the X. Filemark. Attention: X = archive-number - 1 mt --file=/den/<DEVICE> rewind mt --file=/dev/<DEVICE> fsf <X> (d) Run "amrecover" For DUMP-Backups amrestore -p /dev/<DEVICE> "<HOSTNAME>" "<MPOINT>$" | restore -rv -b2 -f- For TAR-Backups amrestore -p /dev/<DEVICE> "<HOSTNAME>" "<MPOINT>$" | tar -xvpmi -f- --ignore-failed-read --same-owner Possible optimization: Provide simple scripts that run this nasty commands (see [Amrestore Scripts]). (e) If there are more backup levels for this partition, proceed with step [Disaster-Linux-Backup-Server-Bandwechsel]. (f) If there are more partitions proceed with step [Disaster-Linux-Backup-Server-CD]. 3.4 Make the System bootable ---------------------------- 1. Change "/" to destination system. With this command the destination system becomes the active system. You can mostly use it as if you have booted it. chroot /mnt 2. Make sure that /proc is an empty directory /proc is an virtual file system provided by the kernel. During the restore process it was maybe restored with it contents, but it should only be a mountpoint. rm -f /proc/* 3. Check /etc/fstab Is the fstab conform with the new partition table? 4. Check /etc/lilo.conf Are the params "root" and "boot" conform with the new partition table? root = Device that contains the "/"-partition (e.g. /dev/sda2). boot = Device that should contain the bootsector (e.g. /dev/sda). 5. Write a new bootsector liloconfig 6. Exit "chroot" exit 7. Boot restored destination system. shutdown -r now 8. Thats all. ---------------------------------- 4 Starting points for optimization ---------------------------------- This part shows the possible targets for optimization, extracted from chapter [Disater Recovery naitive]. At the moment this is more a brainstorming than a detailed concept. We like to read your ideas about that. 4.1 Essential Backup Tool <Essential Backup> ------------------------- This little script should collect all the essential informations that is need in case of an disaster recovery and store it in one or more a save places appart from the normal backups. It can be installed on all Linux hosts and started by (ana)cron e.g. once a week. The informations we consider essential are: * Configuration (/etc/*, incl. full amanda config) * Partition table * Installed packages (dpkg --get-selections) * Amanda database (only on the Backup-Server, amadmin <CONFIG> export) There are plans to provide ways to save this informations: * on a local floppy disk. * by GPG encrypted e-mail. * by sftp or ftp. 4.1.1 Easy Amanda Database export / import <Easy Amanda Database export / import> Provide a way to use "amrecover" even if the Backup-Server has failed. For this we need an easy import of the Amanda database from the last essential backup. If there are problems with that, we can provide a script that extracts the informations about tapes and archive-numbers from a amanda database and optionally calls amrestore (see [Amrestore Scripts]). 4.2 Specialized Amanda Recovery System on CD <Amanda Recovery System on CD> -------------------------------------------- Provide an bootable emergency system on cd, that contains: * a base system * all necessary tools * some scripts to make disaster recovery more easy. * an "import" function for the essential backups. * maybe it is nice to have a kind of GUI where you only select the name of the host you want to restore and everything else runs automatic. But we think this is much work and should be delayed for a later step. 4.2.1 Remote Access<Remote Access> Provide good callin defaults for the isdn config files device.ippp0 and ipppd.ippp0. The support worker should only load the correct kernel module and change the MSN. With this feature a less trained worker can start the disaster recovery system and someone in the main office can proceed or assist. 4.2.2 Full automatic partitioning, formating and mounting<Full-automatic-partitioning> For this we can write a script that reads all necessary information from the essential backup of the selected host (see [Essential Backup]) and automatic: * partitioning the harddisk(s). * initialize the partitions with the correct filesystem or swap. * mount the partitions for disaster recovery. 4.2.3 Amrestore Scripts <Amrestore Scripts> Provide little scripts (e.g. "amrestoredump" and "amrestoretar") that runs the following nasty "amrestore" commands on the backup server, in cases where we cannot use "amrecover". But maybe this can/should be more automatic. * For DUMP-Backups: amrestore -p /dev/<DEVICE> "<HOSTNAME>" "<MPOINT>$" | restore -rv -b2 -f- * For TAR-Backups: amrestore -p /dev/<DEVICE> "<HOSTNAME>" "<MPOINT>$" | tar -xvpmi -f- --ignore-failed-read --same-owner E.g.: amrestoretar <DEVICE> <HOSTNAME> <MPOINT>