Ed Wilts said: > On Wed, Dec 18, 2002 at 02:52:28PM -0800, nate wrote: >> have you tested your raid1 ? I have only had 1 failure with software >> raid 1 on linux out of about 10 arrays. And when it failed the system >> went with it. IMO, the point of software raid is to protect data, not >> protect uptime. Data can still be lost due to unclean unmount and stuff, >> but much of the data would be intact. > > I'll disagree with your opinion. I have been managing VMS systems for > about 20 years, and the last 12 or so have been managing them with > software-based RAID. I have *never* lost a system when a shadowed disk > failed, and I have *never* lost data. I've never lost data unless I've > had multiple simultaneous failures (yes, this has happened a few times).
I haven't used VMS myself, but from what I've heard it uses much more robust hardware then the typical x86 PC with IDE disks ? I thought it used something based on alpha? a disk failure by itself can take out a system just by the fact that the system is not fault tolerant, that is the controller freaks out because of the failure and the OS cannot compensate for the hardware failure. SCSI is different, since a SCSI device can disconnect itself from the bus(usually done automatically but can be done manually as well[1]), vastly improving the situation. Another example of this, many CDROM drives have trouble reading CD-R media, I don't know if you've experienced this but I sure have, put the wrong kind of media in, or an incompadible kind, try transferring some files(works best on large 100MB+ files), and the system can easily crash just due to some I/O errors. I am not a programmer but I had a discussion similar to this on another linux list recently and a programmer stepped in to say some more technical stuff - "I'm not a Linux kernel programmer, but I've worked on device drivers and firmware for many systems. There are always some hardware errors you cannot recover from, though the details will vary based on the situation. For every strategy you can device, you can find another class of hardware errors you simply cannot recover from. For example, if I program a DMA controller to transfer bytes from address x to address y but the controller sends it somewhere else, I'm hosed. When I program a PCI bus master to do a burst transfer, I *expect* it to obey the rules. I do not checksum all of my memory and then verify that it did not change. If the hardware breaks the rules, there is very little you can do to recover. For example, depending on the OS and architecture, the I/O error might erase the very code that is supposed to recover from the error!" thread here -> http://www.geocrawler.com/mail/thread.php3?subject=My+first+Linux+crash&list=199 > Then the RAID subsystem is broken. With hardware RAID, the OS does not > have to know that there is any hardware protection underneath. exactly. there is really no raid 'subsystem' its all in hardware, yet the filesystem driver detects lots of bad sectors and dies. this should not be possible in a hardware raid enviornment, especially since there is another perfectly intact copy of the data on the 2nd disk. Another poster mentioned he has seen similar results on several other hardware raid controllers. > I've had dozens of SCSI disks fail. It's the luck of the draw - the media > is the same these days. I keep seeing people say that but I can't believe it. I used to believe it. IDE disks get bad sectors so much faster then SCSI disks. I also see people say that the only difference between IDE and SCSI is the electronics on the drive, I don't believe that either. Perhaps at one point they were. And when I say I have had a lot of disks fail I am talkin about in the range of 25-35% failures for IDE disks and less then 5% for SCSI. for IBM specific IDE disks the failure rates jump to about 60-75%. Dozens of SCSI disks is a lot, but if your systems have hundreds of disks then it may not be so much by comparison. Then there is the recent fiasco with Fujitsu drives, the manufacturer is claiming 3% failure rate, but the evidence points to 75-85% failure rate for specific models, from what I've read, I've not had experience with their drives. faulty raid controller design? maybe ..I can only remember having 1 SCSI disk fail on a SCSI raid controller in the past 5 years under normal operation and the system didn't go down(3 disk raid5). Though my coworker had a compaq DL380 with hardware raid overheat(A/C failed), 1 disk of a 2-disk mirror died. The system no longer booted(it only had 2 disks). Wasn't until the disk was replaced that the system booted again, and even then they had to reformat it, and reinstall everything, ever since it's worked flawlessly. VMS I believe is more industrial, but much of that is probably due to the design of the hardware, perhaps multiple independant data busses? ideally software raid should work, I think much of the blame goes to the hardware underneath, PC hardware just doesn't cut it for the most part. anyways, I really wish I could get software raid that would work in such a fashion. I've worked with tru64 on the alpha(none of the alphas had any raid on them though), haven't tried VMS though. I have heard of worse stories though, some of the cheap highpoint and promise raid controllers won't boot at all without a fully funtional array, that is, if a disk dies the other disk is useless until the first disk is replaced then the system can boot. Also have read stories about the controller "forgetting" it has a raid array entirely, for no apparent reason. I just wanted to make that point. you may not agree but thats my experience with software raid, so the original poster shouldn't get their hopes up. Or at the very least test the configuration. Another quick story .. one time about a year ago I was booting my system at work, after doing some work to the insides, my one IDE disk was in one of them removable drive bays, but it was not plugged in all the way(about 75%). Didn't notice it at first. The system booted up, didn't detect the IDE disk(the system had a SCSI disk for the OS). I saw it wasn't plugged in all the way. I knew I couldn't get the system to re-detect the disk without a reboot but thought it wouldn't be too bad if I just plugged it in, afterall it was a single connector, not like I was trying to plug a raw disk in. So I pushed it in, didn't even "lock" the drive bay to power the disk up, and *BAM* the system cold rebooted instantly. I was planning to reboot the machine quickly anyways but was not expecting the system to hard reboot. The hardware just freaked. anyways, good discussion, nice to have a nice discussion from time to time. I get so bored of the 1 line replies to questions sometimes :) nate [1] I have a system which has 2 external SCSI CDROMs(1 CDROM 1 CDRW), and the enclosure they are in is VERY noisy. So I power it up when I want to use it, tell the /proc filesystem to re-scan the SCSI bus on those particular SCSI IDs (2,3), and it sees the drives. Then I use them, when I'm finished I tell /proc to remove those devices from the SCSI chain, once thats done I can power them off. Same with the external SCSI DAT, can easily remove it and add it at will(can't have them both at the same time since I only have 1 scsi cable).. IDE would just go crazy if someone tried that! -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe https://listman.redhat.com/mailman/listinfo/redhat-list