Rogier Wolff said:

> No.
>
> Think RAID.

think CPU fan fails, CPU overheats, CPU fails, system crashes.

think memory goes bad(it does happen), system crashes.

think power supply blows out, system crashes. I have also had
a power supply trip out a UPS as well. I happened to come into
the office on a saturday to hear one of my UPSs blaring at me,
took a while to troubleshoot but turns out a power supply fried
(the system was running at the time), tripped a breaker on the UPS,
preventing it from using AC power. I have also come home after
a power glitch to see one of my UPSs fried, and the systems connected
to it crashed(though no damage occured to the equipment connected
other then a blown fuse).

think motherboard dies(rare but it can happen), system dies

think network controller dies, depending on how it dies(e.g. surge
on the ethernet cord), it can take the system with it.

think disk controller failure, takes system with it.

think ice storm(like what happened on the east coast of the US
this past week) which knocks power out for a week.

think HVAC failure, server room turns into an oven, systems
start overheating and frying themselves left and right. One of
my offices had their serverroom A/C fail, within minutes the room
was at about 100 degrees, they frantically tried to shut down
equipment, luckily this was during business hours so they could
respond quickly, only a couple disks failed. 1 system was down
for 3 days(compaq with hardware raid, when the 2nd disk of the
raid 1 array failed, the system failed to boot, it only had
2 disks, took a few days to get a replacement from compaq, they
weren't in a rush).

think UPS failure. I have read reports that at least APC UPSs(probably
others) can completely lose power during one of their automated
battery tests if the battery is no good(that is the battery cannot
sustain the load for the 2 seconds that the test occurs). That can
wipe out the systems right there.

I could try to think of more but its 2:15AM :)

RAID only solves one of the many points of failure. if any of
the above fails, without redundancy your system can be toast.
Best case scenario, you can swap a cpu or ram out and have
the system back up in 10-15 minutes(thats VERY optimistic), that
probably kills your 5 9s in uptime right there. In my experience
even some of the most knowledgable people can take a long time
to troubleshoot a problem, tracing hardware problems isn't easy.

but even in raid! take the 3ware 6800 8-port hardware IDE raid card,
EVERY SINGLE TIME a drive dies the system CRASHES. Granted IDE raid
is cheap shit, but still. This is in a 6-disk RAID 10 array.

> In RAID it is acceptable that any one harddrive goes completely
> bad.

yes, but it is short sighted to think RAID alone can provide
5 9s of uptime.

> So, if you have a computer that allows it's network card to fail,  sends
> you an Email requesting a new network card, and that you can change the
> network card all while the computer is still completelly functional, then
> that's great. One way to solve the problem.

care to explain how it can email you when the network card is fried?:)
Best case, perhaps your resourceful enough to have another system
monitoring the network, it notices the system is not responding,
how fast can you troubleshoot and repair? If it takes longer then
a few minutes you lost that 5 9s of availability. And those few
minutes INCLUDE system shutdown and system power up. Most of my
linux servers can easily take 5 minutes to hard reboot, not
taking into account swapping hardware like a NIC or memory or
a cpu. Having my L440GX+ motherboards do a "full" memory test
on 1GB of ram alone can take 3 minutes. thats excluding SCSI scan,
NIC bootrom initialization(can't seem to turn the damn thing off),
and start to boot the OS ..by the time its up it can mean about 7
minutes(from start of shutdown to being back up).


> However if your system remains "up" because you get an Email from the
> redundant computer who took over, then that's acceptable as well, as long
> as you reach the stated goal (99.999% uptime of the SYSTEM!!).

yes, if you have multiple computers that helps which is why I
tried to point that out in the original email. the more systems
you have the more likely your able to achieve such reliability.
But even 2 systems is not enough for that kind of uptime I would
think. I would feel more comfertable with at least 10 or 20.


> If you have staff onsite who can diagnose a broken switch in less than a
> minute and they have the spare parts to replace it within two minutes,
> you can tolerate one or two of these single-point-of-failure  failures a
> year.

yes but who can diagnose something that fast :) switches are
easy to take care of though, setup a spanning tree with redundant
ports accross switches, not hard to do(though I've never done it).
switches need extra power supplies as well. extreme networks recently
released a new switch(drool) which is 48ports in 1U, which supports
a internal redundant power supply.

backup cooling, backup batteries, backup generators, putting the
redudant power supplies on seperate circuts, or even better seperate
power grids(if available), can help achieve 5 9s of uptime

doing some google searching from what I've found 99.999% uptime
equates to 4 minutes of downtime per year, or according to
rackspace.com's SLA, 24 seconds per month[1]. Someone may be able
to achieve this level of reliability short term, but to have it
be consistant on-going reliability it takes a lot more then
just a couple servers in a clustered configuration, it takes
a lot of equipment, not something that is to be taken lightly.
which is why it seems when most people realize it, they end
up not caring if their system is not available for 24 seconds
per month.

would be fun to work in an enviornment that had the equipment,
the budget and the desire to have 5 9s of uptime ..

nate

[1] http://www.rackspace.com/infrastructure/sla/sla.php



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to