On Fri, May 8, 2009 at 3:38 PM, Olivier Tharan <[email protected]> wrote: > On Thu, May 7, 2009 at 9:58 PM, Nathan Hruby <[email protected]> wrote: >> 6. Is your cooling covered? UPS's and generators aren't worth much if >> you room cooks everything when utility power is gone. > > If you know that your cooling is not covered by the UPS (and you > *know* it, right? :), then your plan should be to start powering down > servers immediately, starting with the least useful ones. That way, > you can withstand a power outage that lasts only 15-30 minutes and not > start frying your most expensive stuff; or keep on shutting down > hardware and sustain a longer outage. Of course, the fine line is > figuring out when to call an outage and start powering down.
I know it, some facility folks and management need it expressly mentioned so it can be worked into the incident response plan without a giant WTF later. There's a big knowledge gap between 'We have a UPS!" and "What do we do now?!" at a lot of levels. If you have people who think a firewall means you don't have to keep Anti-Virus up to date or patch systems, then you probably have people who think the UPS will keep everything running with no problems. > The same applies when your A/C fails and your sysadmins are not > directly responsible for the A/C -- it generally takes a lot of time > and physical room access to restart an A/C unit (they do not restart > on their own), while it is usually easier to access servers remotely, > so your sysadmin can shutdown stuff from home at 3am. But do you have environmental monitoring to know to wake said sysadmin up at 3AM or a facilities plan for keeping some form of sufficient stopgap cooling (warehouse fans to vent in/out fresh air, spot chillers) in the event of a primary chiller loss in order to keep critical systems functioning nominally? This is one of the reasons I like having core/critical systems be distributed/HA'ed with component parts at separate facilities (even if it's just a reduced room/spare office in the next building over) -- so that in the face of a short term site failure, you can punt the important stuff quickly and easily over to someplace safe and not worry about it while you deal with the rest of the junkola and not have to call a full-blown failover-able incident for something that might affect you less than the time it'd take to run your complete DR procedure both ways. Everything breaks, so the special sauce should be knowing how to deal with it before you have to. -n -- ------------------------------------------- nathan hruby <[email protected]> metaphysically wrinkle-free ------------------------------------------- _______________________________________________ Discuss mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
