Greg Oster wrote:
> "Peter Fraser" writes:
>> I had a disk drive fail while running RAIDframe.
>> The system did not survive the failure. Even worse
>> there was data loss.
> 
> Ow.  

Welcome to the REALITY of RAID.

If you rely on RAID to always work, and never go down, you Just Don't
Understand.

...
> You havn't said what types of disks.  I've had IDE disks fail that 
> take down the entire system.  I've had IDE disks fail but the system 
> remains up and happy.  I've had SCSI disks fail that have made the 
> SCSI cards *very* unhappy (and had the system die shortly after).  
> None of these things can be solved by RAIDframe -- if the underlying 
> device drivers can't "deal" in the face of lossage, RAIDframe can't 
> do anything about that...

Doesn't matter about drive type, doesn't really matter about device
drivers, there are PLENTY of things that CAN and WILL cause every drive
on the same channel with the failed drive to go down.  There are even
plenty of things that can fail on the drive which will jump across
channels (imagine a nice little despiking cap shorting out, slamming
your 5v line to ground for a moment until it turns into a puff of smoke.
 yes, I've seen this).  RAID can help you get back up faster, but it
can't keep you from ever going down.
...
>> And even more to my surprise, about two days
>> of my work disappeared.
> 
> Of course, you just went to your backups to get that back, right? :)
> 
>> I believe, the disk drive died about 2 days before
>> the crash. I also believe that RAIDframe did
>> not handle the disk drive's failure correctly
> 
> Do you have a dmesg related to the drive failure?  e.g. something 
> that shows RAIDframe complaining that something was wrong, and 
> marking the drive as failed?  
> 
>> and as a result all file writes to the failed
>> drive queued up in memory,
> 
> I've never seen that behaviour...  I find it hard to believe that 
> you'd be able to queue up 2 days worth of writes without a) any reads 
> being done or b) not noticing that the filesystem was completely 
> unresponsive when a write of associated meta-data never returned...  
> (on the first write of meta-data that didn't return, pretty much all
> IO to that filesystem should grind to a halt.  Sorry.. I'm not buying 
> the "it queued up things for two days"... )

agreed.
HOWEVER...
I have seen (and heard reports of) OpenBSD firewalls run for LONG
periods of time with a failed hard disks.  Don't ask me how..think kept
putting scary messages on the screen (so obviously it was
using..er..trying to use...the bad spots), and kept filtering packets
until the power went out, and of course, it didn't come back up.  Could
lead someone to think the disk wasn't that important. :)
...
>> I am now trying ccd for my web pages and 
>> ALTROOT in daily for root, I have not had a disk
>> fail with ccd yet, so I have not determined whether
>> ccd works better.
> 
> "Good luck."  (see a different thread for my thoughts on using ccd :)

more than that...he doesn't understand the nature of RAID.

If hardware breaks, don't expect everything else to keep working.  Hope,
sure.  Expect?  No.  I don't care if you are talking about ccd,
RAIDframe, or hardware RAID.  Your machine can still go down due to a
disk failure.  People who don't believe me have just been lucky.  So far.

Further, if you wait until a disk fails to find out how things work, you
are a fool.  Worst down-time disasters I've seen involved RAID systems
where people expected magic to happen when something went wrong.

>> Neither RAIDframe or ccd seems to be up the
>> quality of nearly all the other software
>> in OpenBSD. This statement is also true of the documentation.

well, one of the people that writes documentation (me) can't mention
RAID without yelling at people who think it will haul their butts out of
the fire under all circimstances.  That gets a little off-topic for
official documentation, so the editing and reediting process is pretty
painful. :)

Yes, some drivers might be more tollerant of a disk failure than others,
however disk failure is something that's almost impossible to
simulate...and disk failures rarely happen on cue [insert nailgun
comment here], so testing, debugging and improving hardware failure
handling is not a very easy task...and you go through a lot of hard
disks.  Non-destructive testing is just "best the real world can do", it
doesn't accurately simulate most types of hard disk failures.

Nick.

Reply via email to