Hmm: On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau <russb...@hotmail.com> wrote:
> 2) Within the last hour, the server log at > http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found > and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker > was tripped in the data center. So we conclude that Feb 12th: a breaker trips, taking four servers offline (8 days go by, with a number of reports) Feb 20th: it is noted that srv31 is down, (noted that AC is off?) (3 days go by) Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours later, the dumps have not resumed) Really? I mean is this for real? The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-) Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it? Best regards, Robert _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l