Hi Folks,
Here it is, Friday, and my birthday to boot, with one fire on my desk
already, when I discover that a critical server has crashed....
The server is running Sarge (I know, I was just about to upgrade, but if
it ain't broke, why fix it), that just crashed this morning, and I'm
having a horrible time recovering. Any help anyone can offer would be
very much appreciated.
The basic configuration:
- i686 motherboard, Pentium chip
- 2 SATA channels, 2 drives on each (total of 4)
- 4 partitions on each drive
- 4 md devices are built across the four drives (for each - 3 hot
drives, 1 spare)
- two md devices are used for boot and swap
- the other two md devices have logical volumes on top of them (LVM) -
used for / and /backup (large archive)
- all MBRs set up to boot
The failure:
- looks like one of two SCSI interfaces has died, taking down the two
attached drives
-- the system should keep running, but doesn't, and won't come up
--- it gets pretty far in the boot process, then starts throwing errors
"devfs_mk_dir invalid argument, could not append to parent for /disc"
and freezes
- if I boot from a live CD, I get errors from the ATA driver (IO error,
and so forth) - very obviously hardware errors
Luckily, I have an identical box avaiable. So... I simply moved the
four disk drives from the failed machine, to the new one. Silly me, I
figured it would just come up, the RAIDs would repair themselves, and
I'd be back on the air. Instead:
- I get the same devfs_mk_dir error (but if I boot from a live CD, I
DON'T get any hardware errors)
-- suggests that one of the drives is so badly corrupted that the RAID
can't rebuild
--- when I try looking at the disks (start up the Debian installer, go
into the partitioner), the partitioner freezes halfway through scanning
the drives
--- a little experimentation (pulling different drives) gets me to the
point where the partitioner will start, and sees the various partitions
----- of course, at this point, I abort - I don't want to trash any of
the data
- with the bad drive pulled, I try to boot, but all I get is a "boot
from CD" prompt
Where this leaves me:
- I don't want to trash the system (or the user data) on the drives, if
I can avoid it (obviously)
- I need to recover sufficiently to boot
- from there I'd like to try to rebuild the RAID devices and logical
volumes and see where I am
- I'm guessing that something very basic has been trashed - like the
MBR, or grub configuration
So.... any suggestions would be very much appreciated as to:
1. rescue tools - particularly something that lets me try to mount the
existing md devices and LVMs, and then boot
2. generally restoring the system to a bootable state (mbr, grub, etc.)
3. thoughts on examining the one drive that might or might not be bad
-- diagnostic
-- if good: recovery or reformatting so I can add it back to the
RAID/LVM pool
-- if bad: how to configure a spare drive to stick it into the existing
RAID/LVM pool
Thanks VERY much.
Miles Fidelman
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org