Hmm:

On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau <russb...@hotmail.com> wrote:

> 2)  Within the last hour, the server log at
> http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found
> and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker
> was tripped in the data center.

So we conclude that

Feb 12th: a breaker trips, taking four servers offline

(8 days go by, with a number of reports)

Feb 20th: it is noted that srv31 is down, (noted that AC is off?)

(3 days go by)

Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours
later, the dumps have not resumed)

Really? I mean is this for real?

The sequence ought to be something like: breaker trips, monitor shows
within a minute or two that 4 servers are offline, and not scheduled
to be. In the next 5 minutes someone looks at the server(s), notes
that there is no AC power, walks directly to the panel and resets the
breaker. How is this *not* done? I'm sorry, I just don't get it. I've
run data centres, and it just is not possible to have servers down for
AC power for more than a few minutes unless there is a fault one can't
locate. (Or grid down, and running a subset on the generators ;-)

Can someone explain all this? Is the whole thing just completely
beyond the resource available to manage it?

Best regards,
Robert

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to