Rogier Wolff said: > No. > > Think RAID.
think CPU fan fails, CPU overheats, CPU fails, system crashes. think memory goes bad(it does happen), system crashes. think power supply blows out, system crashes. I have also had a power supply trip out a UPS as well. I happened to come into the office on a saturday to hear one of my UPSs blaring at me, took a while to troubleshoot but turns out a power supply fried (the system was running at the time), tripped a breaker on the UPS, preventing it from using AC power. I have also come home after a power glitch to see one of my UPSs fried, and the systems connected to it crashed(though no damage occured to the equipment connected other then a blown fuse). think motherboard dies(rare but it can happen), system dies think network controller dies, depending on how it dies(e.g. surge on the ethernet cord), it can take the system with it. think disk controller failure, takes system with it. think ice storm(like what happened on the east coast of the US this past week) which knocks power out for a week. think HVAC failure, server room turns into an oven, systems start overheating and frying themselves left and right. One of my offices had their serverroom A/C fail, within minutes the room was at about 100 degrees, they frantically tried to shut down equipment, luckily this was during business hours so they could respond quickly, only a couple disks failed. 1 system was down for 3 days(compaq with hardware raid, when the 2nd disk of the raid 1 array failed, the system failed to boot, it only had 2 disks, took a few days to get a replacement from compaq, they weren't in a rush). think UPS failure. I have read reports that at least APC UPSs(probably others) can completely lose power during one of their automated battery tests if the battery is no good(that is the battery cannot sustain the load for the 2 seconds that the test occurs). That can wipe out the systems right there. I could try to think of more but its 2:15AM :) RAID only solves one of the many points of failure. if any of the above fails, without redundancy your system can be toast. Best case scenario, you can swap a cpu or ram out and have the system back up in 10-15 minutes(thats VERY optimistic), that probably kills your 5 9s in uptime right there. In my experience even some of the most knowledgable people can take a long time to troubleshoot a problem, tracing hardware problems isn't easy. but even in raid! take the 3ware 6800 8-port hardware IDE raid card, EVERY SINGLE TIME a drive dies the system CRASHES. Granted IDE raid is cheap shit, but still. This is in a 6-disk RAID 10 array. > In RAID it is acceptable that any one harddrive goes completely > bad. yes, but it is short sighted to think RAID alone can provide 5 9s of uptime. > So, if you have a computer that allows it's network card to fail, sends > you an Email requesting a new network card, and that you can change the > network card all while the computer is still completelly functional, then > that's great. One way to solve the problem. care to explain how it can email you when the network card is fried?:) Best case, perhaps your resourceful enough to have another system monitoring the network, it notices the system is not responding, how fast can you troubleshoot and repair? If it takes longer then a few minutes you lost that 5 9s of availability. And those few minutes INCLUDE system shutdown and system power up. Most of my linux servers can easily take 5 minutes to hard reboot, not taking into account swapping hardware like a NIC or memory or a cpu. Having my L440GX+ motherboards do a "full" memory test on 1GB of ram alone can take 3 minutes. thats excluding SCSI scan, NIC bootrom initialization(can't seem to turn the damn thing off), and start to boot the OS ..by the time its up it can mean about 7 minutes(from start of shutdown to being back up). > However if your system remains "up" because you get an Email from the > redundant computer who took over, then that's acceptable as well, as long > as you reach the stated goal (99.999% uptime of the SYSTEM!!). yes, if you have multiple computers that helps which is why I tried to point that out in the original email. the more systems you have the more likely your able to achieve such reliability. But even 2 systems is not enough for that kind of uptime I would think. I would feel more comfertable with at least 10 or 20. > If you have staff onsite who can diagnose a broken switch in less than a > minute and they have the spare parts to replace it within two minutes, > you can tolerate one or two of these single-point-of-failure failures a > year. yes but who can diagnose something that fast :) switches are easy to take care of though, setup a spanning tree with redundant ports accross switches, not hard to do(though I've never done it). switches need extra power supplies as well. extreme networks recently released a new switch(drool) which is 48ports in 1U, which supports a internal redundant power supply. backup cooling, backup batteries, backup generators, putting the redudant power supplies on seperate circuts, or even better seperate power grids(if available), can help achieve 5 9s of uptime doing some google searching from what I've found 99.999% uptime equates to 4 minutes of downtime per year, or according to rackspace.com's SLA, 24 seconds per month[1]. Someone may be able to achieve this level of reliability short term, but to have it be consistant on-going reliability it takes a lot more then just a couple servers in a clustered configuration, it takes a lot of equipment, not something that is to be taken lightly. which is why it seems when most people realize it, they end up not caring if their system is not available for 24 seconds per month. would be fun to work in an enviornment that had the equipment, the budget and the desire to have 5 9s of uptime .. nate [1] http://www.rackspace.com/infrastructure/sla/sla.php -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]