Ed Wilts said:
> On Wed, Dec 18, 2002 at 02:52:28PM -0800, nate wrote:
>> have you tested your raid1 ? I have only had 1 failure with software
>> raid 1 on linux out of about 10 arrays. And when it failed the system
>> went with it. IMO, the point of software raid is to protect data, not
>> protect uptime. Data can still be lost due to unclean unmount and stuff,
>> but much of the data would be intact.
>
> I'll disagree with your opinion.  I have been managing VMS systems for
> about 20 years, and the last 12 or so have been managing them with
> software-based RAID.  I have *never* lost a system when a shadowed disk
> failed, and I have *never* lost data.  I've never lost data unless I've
> had multiple simultaneous failures (yes, this has happened a few times).

I haven't used VMS myself, but from what I've heard it uses much more
robust hardware then the typical x86 PC with IDE disks ? I thought it
used something based on alpha? a disk failure by itself can take out a
system just by the fact that the system is not fault tolerant, that is
the controller freaks out because of the failure and the OS cannot
compensate for the hardware failure. SCSI is different, since a SCSI
device can disconnect itself from the bus(usually done automatically
but can be done manually as well[1]), vastly improving the situation.
Another example of this, many CDROM drives have trouble reading CD-R
media, I don't know if you've experienced this but I sure have, put
the wrong kind of media in, or an incompadible kind, try transferring
some files(works best on large 100MB+ files), and the system can easily
crash just due to some I/O errors. I am not a programmer but I had a
discussion similar to this on another linux list recently and a programmer
stepped in to say some more technical stuff -

"I'm not a Linux kernel programmer, but I've worked on device drivers
and firmware for many systems. There are always some hardware errors
you cannot recover from, though the details will vary based on the
situation. For every strategy you can device, you can find another
class of hardware errors you simply cannot recover from.

For example, if I program a DMA controller to transfer bytes from
address x to address y but the controller sends it somewhere else, I'm
hosed. When I program a PCI bus master to do a burst transfer, I
*expect* it to obey the rules. I do not checksum all of my memory and
then verify that it did not change.

If the hardware breaks the rules, there is very little you can do to
recover. For example, depending on the OS and architecture, the I/O
error might erase the very code that is supposed to recover from the
error!"

thread here ->
http://www.geocrawler.com/mail/thread.php3?subject=My+first+Linux+crash&list=199


> Then the RAID subsystem is broken.  With hardware RAID, the OS does not
> have to know that there is any hardware protection underneath.

exactly. there is really no raid 'subsystem' its all in hardware, yet
the filesystem driver detects lots of bad sectors and dies. this should
not be possible in a hardware raid enviornment, especially since there
is another perfectly intact copy of the data on the 2nd disk. Another
poster mentioned he has seen similar results on several other hardware
raid controllers.


> I've had dozens of SCSI disks fail. It's the luck of the draw - the media
> is the same these days.

I keep seeing people say that but I can't believe it. I used to
believe it. IDE disks get bad sectors so much faster then SCSI disks.
I also see people say that the only difference between IDE and SCSI
is the electronics on the drive, I don't believe that either. Perhaps
at one point they were. And when I say I have had a lot of disks fail
I am talkin about in the range of 25-35% failures for IDE disks and
less then 5% for SCSI. for IBM specific IDE disks the failure rates jump to
about 60-75%. Dozens of SCSI disks is a lot, but if your systems have
hundreds of disks then it may not be so much by comparison. Then there
is the recent fiasco with Fujitsu drives, the manufacturer is claiming
3% failure rate, but the evidence points to 75-85% failure rate for
specific models, from what I've read, I've not had experience with
their drives.

faulty raid controller design? maybe ..I can only remember having 1 SCSI
disk fail on a SCSI raid controller in the past 5 years under normal
operation and the system didn't go down(3 disk raid5). Though my coworker
had a compaq DL380 with hardware raid overheat(A/C failed), 1 disk of
a 2-disk mirror died. The system no longer booted(it only had 2 disks).
Wasn't until the disk was replaced that the system booted again, and even
then they had to reformat it, and reinstall everything, ever since it's
worked flawlessly.

VMS I believe is more industrial, but much of that is probably due to
the design of the hardware, perhaps multiple independant data busses?

ideally software raid should work, I think much of the blame goes to
the hardware underneath, PC hardware just doesn't cut it for the most
part.

anyways, I really wish I could get software raid that would work in
such a fashion. I've worked with tru64 on the alpha(none of the alphas
had any raid on them though), haven't tried VMS though.

I have heard of worse stories though, some of the cheap highpoint
and promise raid controllers won't boot at all without a fully
funtional array, that is, if a disk dies the other disk is useless
until the first disk is replaced then the system can boot. Also have
read stories about the controller "forgetting" it has a raid array
entirely, for no apparent reason.

I just wanted to make that point. you may not agree but thats my
experience with software raid, so the original poster shouldn't
get their hopes up. Or at the very least test the configuration.

Another quick story .. one time about a year ago I was booting
my system at work, after doing some work to the insides, my one
IDE disk was in one of them removable drive bays, but it was not
plugged in all the way(about 75%). Didn't notice it at first. The
system booted up, didn't detect the IDE disk(the system had a SCSI
disk for the OS). I saw it wasn't plugged in all the way. I knew
I couldn't get the system to re-detect the disk without a reboot
but thought it wouldn't be too bad if I just plugged it in, afterall
it was a single connector, not like I was trying to plug a raw disk
in. So I pushed it in, didn't even "lock" the drive bay to power the
disk up, and *BAM* the system cold rebooted instantly. I was planning
to reboot the machine quickly anyways but was not expecting the system
to hard reboot. The hardware just freaked.

anyways, good discussion, nice to have a nice discussion from time
to time. I get so bored of the 1 line replies to questions sometimes :)

nate

[1] I have a system which has 2 external SCSI CDROMs(1 CDROM 1 CDRW),
and the enclosure they are in is VERY noisy. So I power it up when I
want to use it, tell the /proc filesystem to re-scan the SCSI bus on
those particular SCSI IDs (2,3), and it sees the drives. Then I use them,
when I'm finished I tell /proc to remove those devices from the SCSI
chain, once thats done I can power them off. Same with the external
SCSI DAT, can easily remove it and add it at will(can't have them both
at the same time since I only have 1 scsi cable).. IDE would just go
crazy if someone tried that!





-- 
redhat-list mailing list
unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe
https://listman.redhat.com/mailman/listinfo/redhat-list

Reply via email to