On Tue, 7 Sep 1999, David van der Spoel wrote:
> I run a 2.2.12 kernel + latest RAID (990824) + latest knfsd (1.4.7)
> OS is on an IDE disk, I have a 8 x 9 Gb SCSI disk RAID 5 array with
> 2 SCSI controllers each having four disks (four internal, four external)
> After three days uptime (and reading 50 Gb data from tapes) the following
> crash occurred:
>
> Sep 7 01:17:06 zorn kernel: scsi0: MEDIUM ERROR on channel 0, id 4, lun 0, CDB:
could you also send the messages from the startup of the array?
> The strange thing is that after one disk failure (sdd1) another one is
> reported faulty (sda1) without there being an error from the SCSI layer.
this should not happen - but lets first see the startup messages, maybe
sda1 was faulty already when the array started up?
> With two disks down the RAID5 array is beyond automatic recovery.
> I would like to suggest the following:
>
> - When an array is not recoverable shut it down automatically, in my case
> that means after "skipping faulty sda1"
I followed the policy we do on normal disks - ie. we report the IO error
back to higher levels and let them decide. Ie. ext2fs might ignore the
error, or panic, or remount read-only.
> - For enhanced recovery make a tool that can make a superblock backup on
> another disk (and restore it)
so which superblock would you restore? The problem was not the lack of
utilities - you can reload the wrong superblock just as easily as you can
misconfigure /etc/raidtab. But i agree that we should work on avoiding
situations like yours in the future.
> Finally, I tried the suggestions on Jakob O/stergaards page. mkraid with one
> failed disk (sdd1) worked without error messages, but I could not mount
> the raid dev. Then I reset sdd1 to be a raid disk again. mkraid then
> happily made a raid array, and for some reason automatically started doing
> file system stuff, which screwed up everything.
you should have marked sda1 as faulty-disk. I think this is what happened:
sda1 failed and this went unnoticed for some time (with the array being in
degraded mode). Then sdd1 failed as well - this made the array unrunnable.
Then you recreated the array with the wrong raidtab info - it's sda1 that
should have been marked as faulty. The RAID code then synced the wrong
thing.
> With the tool I suggested one can try to restore a superblock and then
> do a consistency check, to see if the array is recreated correctly.
> The tool should also be able to recreate the superblock in degraded
> mode.
there are so many failure scenarios that i think it's hard to do this
automatically. (but anyone is welcome to try it)
i think further disk failures should not be written into the superblock,
if the array is already in degraded mode.
-- mingo