Steve Shockley wrote:
> RedShift wrote:
>> Anyone got any similar experiences with hardware RAID cards? Hardware 
>> RAID has always been misery for me.
> 
> I've had two instances where older Adaptec RAID cards had a disk failure 
> and then reverted to a week-old copy of the data.  I'm not quite sure 
> how that's possible, but having it happen on two different machines, at 
> two different employers, in two different brands of servers (Dell, HP 
> Netserver) made me a real believer in Adaptec.
> 
> I've had generally good luck with Compaq/HP and LSI controllers.

I'll give you a fairly realistic possible explanation:

For whatever reason, at least some SCSI drives in at least some RAID
systems (I saw a lot of it on SCSI Dell PERC cards) will just hop
off-line.  Nothing wrong with the drive, if you pull the drive out and
put it back in, either in SW or physically, it will go back on-line
and happily rebuild.  Curiously, these same PERC cards also lacked
any kind of beeper to let you know they were in any kind of degraded
mode.  They would turn the deactivated drive "orange" instead of green,
but even that was in dispute with one of our offices which managed to
have two drives fail in a RAID10 array and swore there were never any
orange lights on the machine ("I was checking!! Really!").

SO, I suspect that one of your drives hopped off-line and no one
noticed.  A week later, the OTHER drive failed.  Either the system
was then rebooted or maybe it just figured, "Hey, let's see if we
can revive that other drive" on its own, and ta-da, you are running
with week old data.  And no, I can't prove it...


[rest of this isn't aimed at Steve...]

RAID is a complexity, and complexity is the enemy of security and
reliability.  It *may* help protect against data loss.  It *may*
keep you running.  It *may* also be the cause of the data loss or
downtime.

PROPERLY implemented, RAID can be a part of your event recovery
process.  It certainly can give you performance gains.  But if you
don't understand the system in your hands, it will most likely
bite you hard at some point.

Alternatives should be considered: many apps such as firewalls
and DNS servers don't need/want RAID at all, as you can "mirror"
entire MACHINES.  At that point, the disk failure becomes a special
case of "system failure" and you are ready for it.  RAID becomes
simply an unneeded complexity.

For many systems, L. V. Lammert's rsync system (or even dump/restore)
to a second disk in the system is wonderful.  Done properly, it can
be SUPERIOR to RAID for some apps, in that it gives you a roll-back
if you make an error on a change or upgrade...and a number of other
"failure modes" where you wish you could "roll back" to a previous
version.

The question of "HW vs SW RAID" is wrong.  The question is
"understood vs. not understood RAID solutions".  I understood very
well the old Netware 3/4 software mirroring, and had complete faith
in it, and had the experience to prove it on a number of cases.  On
the other hand, I saw a lot of systems that were completely hosed
because people DIDN'T understand the system and expected magic to
happen (or someone else to be on call) when the system failed.  Same
thing goes for HW RAID.  HW RAID is "easy" to get running, but that
usually means you have NO idea how it is really working, and that
makes it less likely you will know how to get it back to fully
functional state AFTER an event.

In most (yes, really, I'm convinced it is the vast majority) cases,
people make the error of thinking "getting it running" is the
challenge.  NO!!  The point of RAID (and the rest of your system)
is to keep your system serviceable AFTER something goes horribly
wrong.  What happens when the system goes down hard, how do you bring
the system back to a happy state after a drive failure, what happens
if you try to stick too small a drive in (yes, it won't work, but how
will it inform you the new drive is one pseudo-cylinder smaller than
the old ones?  Knowing that will save you major headaches when it
happens when you can no longer get the exact model of drive you had
in place before...or the mfg changes the drive specs without
changing the model number (yes, that happened to a friend of mine)).

Moral: learn your RAID system.  Whatever it is, you have to understand
it.

Nick.

Reply via email to