Re: Recovering from a lost disk

Marc Haber Mon, 25 Oct 1999 10:06:55 -0700
On Mon, 25 Oct 1999 12:20:57 -0200, you wrote:
>Marc Haber wrote:
>> These tools don't seem to be included with the raidtools snapshots.
>> Where do I get them from?
> 
>They are symbolic links of raidstart (they wouldn't be in the raidtools
>source directory after a 'make', but 'make install' should create them
>in /sbin)

I see. Naturally, I didn't see them because this is a test system and
I never bothered to make install. The total absence of any docs
(neither raidhotadd, raidsetfaulty nor raidhotremove to have man
pages) didn't make finding my mistake any easier.

Is <commandname> --help really the only documentation besides the
source code?

After re-adding both "failed" disks, /dev/mdstat now shows
|q:~# cat /proc/mdstat
|Personalities : [raid5]
|read_ahead 1024 sectors
|md0 : active raid5 sdc6[4] sdb5[2] sdd6[1] sda5[0] 3068672 blocks level 5, 32k chunk, 
|algorithm 2 [3/3] [UUU]
|unused devices: <none>
|q:~#

Is there a man page to explain mdstat's contents? Does sdc6 having a
[4] on a 3 disk array indicate that this is the new spare disk? Where
is [3]?

>It seems the SCSI drivers try very hard to get whatever commands where
>passed them done, and the md driver would know that one failed only when
>the low level driver gave up. One solution would be to make the low
>level driver not try so hard, but return an error sooner, but perhaps
>you wouldn't want that always: that could be fine if the disk was part
>of a raid1 or raid5, where you have some redundancy.

Yes, but trying for _two_ _hours_ is simply ridiculous. Additionally,
since the system was running off a different disk that wasn't affected
by the failure, I feel that the system should at least have remained
useable. But this is clearly a kernel issue.

>There was some talk
>of passing a parameter to the lower layers to tell them how hard to try,
>but that means changing a lot of things in the lower layers. From a bit
>of lurking on linux-kernel, it seems the SCSI layer is undergoing some
>code clean up now, and is expected to get a major rewrite in 2.5

... when this makes its way into the stable 2.6 kernel, we'll have
2001.

>> This behavior can cause great additional problems because I might have
>> to reboot to get my system back up. 
>
>Yep. But people here say that most of disks would die much more
>gracefully than when you pull the power/data cable,

We might be talking about a failed power supply in an external SCSI
case which would mean the exactly same thing to the system. From what
I currently see, it is almost impossible to have the system recover
gracefully from a failure. This ....... 

>But yes, you could have to raidhotremove the disk and tell
>the scsi layer to drop it ('echo "scsi remove-single-device a b c d
>">/proc/scsi/scsi' would do the trick (a=adapter ID, b=chanel, c=device
>ID, d=LUN)if the disk isn't being used anymore), and worst case you
>would have to reboot.

Then I had this "worst case" because the system became entirely
unuseable.

>If you create the arrays with persistent superblocks (default for
>raidtools 0.90), have autodetection of raid built into your kernel (or
>as a module in a initrd) and your raid partitions are of type fd
>(instead of 83), you will not need to worry about that too much: you
>will get the right partitions in the right order into your array even if
>the disks "change name".

This would cover the RAID disks, but not the unRAIDed disks that might
be present for booting or for holding transient data.

>You need an up to date raidtab if you run mkraid to write new
>superblocks into the raid partitions though.

Since the data is already in the superblocks, it could be possible to
create a raidtab from there. Is there a tool that does so?

However, mkraid is destructive, so I probably won't need that since
there is no data to safe.

Thanks, too, to all other people who answered to my original question
on list and privately. I really appreciate that.

One more thing: In a spare minute, I straced scsidev and found out
that it has a buglet: It obviously doesn't check the return code of
open(/etc/scsi.alias) and proceeds to use the file even if doesn't
exist. This was leading to the segfaults I had been experiencing.
Creating an empty /etc/scsi.alias solved the problem.

Greetings
Marc

-- 
-------------------------------------- !! No courtesy copies, please !! -----
Marc Haber          |   " Questions are the         | Mailadresse im Header
Karlsruhe, Germany  |     Beginning of Wisdom "     | Fon: *49 721 966 32 15
Nordisch by Nature  | Lt. Worf, TNG "Rightful Heir" | Fax: *49 721 966 31 29
Re: Recovering from a lost disk

Reply via email to