Don't be so fast to point the finger.  Generally speaking, blame
is obvious from the initial news reports but tends to diminish
with retrospective fact-based assessment.

For example: it's "obvious" that serious net sites need multihoming.
But what if your multihomed bits go through the same pipe (or worse,
through the same fiber)?  Who do you blame when you find out?
Worse, in terms of blame: who can you go to beforehand who
actually knows where that can happen?

I well remember this slide from Sean Donelan's talk at NANOG23:


What Didn't Work - Diversity and Avoidance

*       Equipment in the World Trade Center     
        primarily served tenants in complex     
        (shared fate)   
*       SONET ring through WTC tower 1 and
        alternate path through WTC tower 2      
*       Damage to 140 West Street central       
        office and surrounding underground      
*       Backup circuit routed through same      
*       “Advanced” data circuits (ISDN/DSL)     
        concentrated in a few central offices


The real answer, found elsewhere in Sean's talk, is that the
design of the net has always encouraged redundancy as an
engineering principle.  Stress situations is where that pays off,
even though it can't solve every possible eventuality (and as
has already been noted, redundant equipment also fails as
well as creating more complex failure modes).  The net had
problems on 9/11, especially around the WTC, but Sean's slides
document remarkable resiliency even in that area.

The power went off at a key spot in the San Francisco
infrastructure today.  But as far as I know, even though it was
mentioned in the Chron article, Craigslist stayed online
because they have a distributed and redundant system (which is
not to say, impervious to all failure modes).

Some shortcomings are obvious, but all I am saying is, before
rushing to cast blame, it's a good idea to try and collect some



Power restored in San Francisco
Marisa Lagos and Matthew B. Stannard, Chronicle Staff Writers
Tuesday, July 24, 2007

(07-24) 16:57 PDT SAN FRANCISCO -- Between 30,000 and 50,000 Pacific
Gas and Electric Co. customers in San Francisco and the northern
Peninsula lost power for several hours this afternoon after what
witnesses described as an explosion under a manhole cover on Mission
Street, the utility said.

Brian Swanson, a spokesman for the utility, said power failures were
reported throughout wide swaths of the east side of San Francisco,
including downtown and at PG&E's own office on Beale Street near the
Ferry Building.

The outage first occurred at about 1:50 p.m., and electricity
flickered on and off at least five times before power was restored
at about 4 p.m.

PG&E officials said the source of the power outage was an
underground failure. Standing at a manhole in a plaza at 560 Mission
St. in San Francisco, where witnesses reported hearing an explosion,
Swanson said it could have been the source of the outage, but
officials were still investigating.

The incident recalled an August 2005 explosion in an underground
vault at Post and Kearny streets that critically injured a woman who
was walking by. At the time, PG&E blamed high levels of moisture in
the attached high-voltage chambers and said it was checking the
safety of about 1,000 other high-voltage chambers.

Swanson said today's incident -- in which no one was injured -- was
caused by some sort of fault in the line.

"It is completely unrelated to what happened two years ago," he

Witnesses said they heard an explosion at about 1:50 p.m., then saw
flames coming from the manhole.

Actor Torino Von Jones, 32, said he was filming a Fruit of the Loom
commercial down the block at the time.

"We were standing over there waiting for the camera cue when we
heard a big explosion," he said. "Flames came up taller than I am,
and I'm 6-foot-2."

"Naturally, when you hear an explosion, you think the worst," Von
Jones said. Nevertheless, he hurried back to work. "We're Fruit of
the Loom -- we've got to make this commercial."

The outage briefly affected some Muni buses and trains, but all were
back to normal by 3 p.m., a spokeswoman said.

Workers at several downtown and South of Market offices were
reportedly sent home for the day following the outage. Additionally,
the datacenter 365 Main -- which hosts Web sites including
Craigslist and Yelp -- lost power.

------ mail forwarded, original message follows ------

From: [EMAIL PROTECTED] <Seth Mattinen>
Subject: Re: San Francisco Power Outage
Date: Tue, 24 Jul 2007 15:54:08 -0700

Jonathan Lassoff wrote:
> Just a heads up to anyone on list that PG&E has just sustained a large
> outage in San Francisco that has caused a few hiccups (both network,
> electrical, infrastructural, etc.) around the city.
> I've confirmed that both customers in 365 Main and parts of telecom 1
> have both sustained brief blackouts. No word yet form 200 Paul.
> Anyone in the area that could use a hand with anything, I'll probably
> be wrapping up fixes for my stuff soon, and would be glad to help
> however I can.

I have a question: does anyone seriously accept "oh, power trouble" as a
reason your servers went offline? Where's the generators? UPS? Testing
said combination of UPS and generators? What if it was important? I
honestly find it hard to believe anyone runs a facility like that and
people actually *pay* for it.

If you do accept this is a good reason for failure, why?


Reply via email to