On Fri, May 8, 2009 at 3:38 PM, Olivier Tharan <[email protected]> wrote:
> On Thu, May 7, 2009 at 9:58 PM, Nathan Hruby <[email protected]> wrote:
>> 6. Is your cooling covered?  UPS's and generators aren't worth much if
>> you room cooks everything when utility power is gone.
>
> If you know that your cooling is not covered by the UPS (and you
> *know* it, right? :), then your plan should be to start powering down
> servers immediately, starting with the least useful ones. That way,
> you can withstand a power outage that lasts only 15-30 minutes and not
> start frying your most expensive stuff; or keep on shutting down
> hardware and sustain a longer outage. Of course, the fine line is
> figuring out when to call an outage and start powering down.

I know it, some facility folks and management need it expressly
mentioned so it can be worked into the incident response plan without
a giant WTF later.  There's a big knowledge gap between 'We have a
UPS!" and "What do we do now?!" at a lot of levels.  If you have
people who think a firewall means you don't have to keep Anti-Virus up
to date or patch systems, then you probably have people who think the
UPS will keep everything running with no problems.

> The same applies when your A/C fails and your sysadmins are not
> directly responsible for the A/C -- it generally takes a lot of time
> and physical room access to restart an A/C unit (they do not restart
> on their own), while it is usually easier to access servers remotely,
> so your sysadmin can shutdown stuff from home at 3am.

But do you have environmental monitoring to know to wake said sysadmin
up at 3AM or a facilities plan for keeping some form of sufficient
stopgap cooling (warehouse fans to vent in/out fresh air, spot
chillers) in the event of a primary chiller loss in order to keep
critical systems functioning nominally?

This is one of the reasons I like having core/critical systems be
distributed/HA'ed with component parts at separate facilities (even if
it's just a reduced room/spare office in the next building over) -- so
that in the face of a short term site failure, you can punt the
important stuff quickly and easily over to someplace safe and not
worry about it while you deal with the rest of the junkola and not
have to call a full-blown failover-able incident for something that
might affect you less than the time it'd take to run your complete DR
procedure both ways.

Everything breaks, so the special sauce should be knowing how to deal
with it before you have to.

-n
-- 
-------------------------------------------
nathan hruby <[email protected]>
metaphysically wrinkle-free
-------------------------------------------

_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to