On Tue, 7 Sep 1999, David van der Spoel wrote:

> I run a 2.2.12 kernel + latest RAID (990824) + latest knfsd (1.4.7)
> OS is on an IDE disk, I have a 8 x 9 Gb SCSI disk RAID 5 array with
> 2 SCSI controllers each having four disks (four internal, four external)
> After three days uptime (and reading 50 Gb data from tapes) the following
> crash occurred:
> 
> Sep  7 01:17:06 zorn kernel: scsi0: MEDIUM ERROR on channel 0, id 4, lun 0, CDB:

could you also send the messages from the startup of the array?

> The strange thing is that after one disk failure (sdd1) another one is 
> reported faulty (sda1) without there being an error from the SCSI layer.

this should not happen - but lets first see the startup messages, maybe
sda1 was faulty already when the array started up?

> With two disks down the RAID5 array is beyond automatic recovery.
> I would like to suggest the following:
> 
> - When an array is not recoverable shut it down automatically, in my case
>   that means after "skipping faulty sda1"

I followed the policy we do on normal disks - ie. we report the IO error
back to higher levels and let them decide. Ie. ext2fs might ignore the
error, or panic, or remount read-only.

> - For enhanced recovery make a tool that can make a superblock backup on 
>   another disk (and restore it)

so which superblock would you restore? The problem was not the lack of
utilities - you can reload the wrong superblock just as easily as you can
misconfigure /etc/raidtab. But i agree that we should work on avoiding
situations like yours in the future.

> Finally, I tried the suggestions on Jakob O/stergaards page. mkraid with one
> failed disk (sdd1) worked without error messages, but I could not mount 
> the raid dev. Then I reset sdd1 to be a raid disk again. mkraid then 
> happily made a raid array, and for some reason automatically started doing 
> file system stuff, which screwed up everything.

you should have marked sda1 as faulty-disk. I think this is what happened:
sda1 failed and this went unnoticed for some time (with the array being in
degraded mode). Then sdd1 failed as well - this made the array unrunnable.  
Then you recreated the array with the wrong raidtab info - it's sda1 that
should have been marked as faulty. The RAID code then synced the wrong
thing.

> With the tool I suggested one can try to restore a superblock and then
> do a consistency check, to see if the array is recreated correctly.  
> The tool should also be able to recreate the superblock in degraded
> mode.

there are so many failure scenarios that i think it's hard to do this
automatically. (but anyone is welcome to try it)

i think further disk failures should not be written into the superblock,
if the array is already in degraded mode.

-- mingo

Reply via email to