on Wed, Dec 11, 2002 at 04:21:21PM +0100, Rogier Wolff ([EMAIL PROTECTED]) wrote: > On Wed, Dec 11, 2002 at 02:19:23AM -0800, nate wrote: > > Rogier Wolff said: > > > > > No. > > > > > > Think RAID. > > > > think CPU fan fails, CPU overheats, CPU fails, system crashes. > > You misunderstand my "think Raid" remark. In a RAID configuration you > can handle a WHOLE DISK going offline. If your SYSTEM can handle a > whole CPU giving the ghost, then you can still achieve high uptimes > by just taking over the jobs on another machine.
Repeating your assertion doesn't make it true. Say, didn't the Netherlands just suffer a fire at Twente? How many "five 9s" servers did that take out? Or an event some might recall occuring in or about NYC September 11 of 2001. At the level of five nines support, you're not talking single systems, and you're very likely not talking single NOCs. Better, your NOCs should be several hours' distance apart, be served by independent Internet backbones (if 'Net attached), or WAN links, and be wired into relatively independent power grids. Wind damange in Goose Lake, at the CA/OR border, knocked 45% of California customers off the grid August 10, 1996[1]. Single-server uptimes of 1-2 years are not valid datapoints unless drawn from a statistically valid sample. Otherwise you're at best demonstrating survivor identification capabilities. A credible record should point to a multi-year history, across multiple individual hosts, comprising a "system". Net uptime and/or availability of this system, in the context of anticipated service, HW and SW upgrades, and reasonably anticipated emergency occurances (fire, flood, power outage, earthquake, hurricane, severe wind, civil unrest, internal sabotage or compromise) _might_ make a credible basis for claims. Note that with the emerging significance of highly modular redundant x86 form factors (eg: "blade" servers with 300+ nodes per standard 19" rack), RAID may in fact play _no_ role, as service would consist of wholesale replacement of individual nodes. IBM's work on "self healing" systems doesn't even call for replacment[2]. Instead, anomolous units are simply shut down entirely, with the system as a whole having sufficient redundent capacity to accomodate anticipated failures over the planned life of the system. Peace. ---------------------------------------- Notes: 1. http://www.energy.ca.gov/reports/70097003.html This was one of _two_ major outages summer 1996, and followed another extensive, statewide, outage, lasting upwards of a week, following the storm of Dec 13, 1995. 2. I sat in on a seminar on this topic at the Stanford Computer System Lab Colloquium, neat stuff: http://www.stanford.edu/class/ee380/Abstracts/011128.html -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? If spam is the question, Spamassassin is the answer. http://spamassassin.taint.org/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]