Re: San Francisco Power Outage
On 7/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: ... fire department evacuating the data center, cutting off electricity in the area, and forbidding the diesel generators to be switched on? I know a guy who was at the US Data Centers Inc facility in Marlborough, MA (before USDCI failed). Soon after they first opened it up, they had a fire. The problem was the fire was *in* the giant APC/Silicon system they had. They had to kill the APC, and that took the load down too. So they installed an external transfer switch, rather than depending on the one built-in to the APC system. There was some SNAFU with the wiring, so right after the install, there was an electrical fire -- this time in the external transfer switch panel. While I suspect poor planning/testing contributed to their woes, it still goes to show: Some days you're the windshield, and some days you're the bug. -- Ben
Re: San Francisco Power Outage
[EMAIL PROTECTED] (Jeff Aitken) writes: > ..., we had a failure at another datacenter that uses Piller units, which > operate on the same basic principle as the Hitec ones. ... i guess i never understood why anyone would install a piller that far from the equator. (it spins like a top, on a vertical axis, and the angular momentum is really quite gigantic for its size -- it's heavy and it spins really really fast -- and i remember asking a piller tech why his machine wasn't tipped slightly southward to account for Coriolis, and he said i was confused. probably i am.) but for north america, whenever i had a choice, i chose hitec. (which spins with an axis parallel to gravity.) -- Paul Vixie
Re: San Francisco Power Outage
[EMAIL PROTECTED] ("Jonathan Lassoff") writes: > Well, the fact still remains that operating a datacenter smack-dab in > the center of some of the most inflated real estate in recent history > is quite a castly endeavor. yes. (speaking for both 365 main, and 529 bryant.) > I really wouldn't be all that surprised if 365 Main cut some corners > here and there behind the scenes to save costs while saving face. no expense was spared in the conversion of this tank turret factory into a modern data center. if there was a dark start option, MFN ordered it. (but if it required maintainance, MFN's bankruptcy interrupted that, but the current owner has never been bankrupt.) > As it is, they don't have remotely enough power to fill that facility > to capacity, and they've suffered some pretty nasty outages in the > recent past. I'm strongly considering the possibility of completely > moving out of there. 2mW/floor seemed like a lot at the time. ~6kW/rack wasn't contemplated. (is it time to build out the land adjacent to 200 paul, then?) -- Paul Vixie
Re: San Francisco Power Outage
Michael Dillon writes: >> And the stories that the power guy I'm working with tells >> about foreign facilities, particularly in middle east war >> zones, are really scary... > >> We fundamentally do not have the facilities problem >> completely nailed down to the point that things will never >> drop. Level 4 >> datacenters can, and will, fail. Nothing you can do including >> just doing 48V DC for everything are truly foolproof solutions. > >A single level 4 datacenter is a Single Point of Failure! > >Two of those middle-eastern style facilities is... ? >Has anyone actually kept track of all these data center failures over >the years and done some statistical analysis on it? Maybe two half-baked >data centers is better than one over the long run? > >Remember that one 10-12 years ago in (Palo Alto, Mountainview?) where a >lady in a car caused a backhoe driver to move out of the way which >resulted in him cutting a gas line which resulted in the fire department >evacuating the data center, cutting off electricity in the area, and >forbidding the diesel generators to be switched on? Santa Clara. I was working right outside the evacuation radius. Which exchange point was in the building? PB-NAP? CIX? I remember we had a net-dark event associated, but not which one. It was a bad day... The lesson, as you point out, is that geographical redundancy is sometimes necessary. This is as true for providers as for datacenter end-users... -george william herbert [EMAIL PROTECTED]
Re: San Francisco Power Outage
On Tue, Jul 24, 2007 at 09:57:09PM -0500, Brandon Galbraith wrote: > It appears that 365 is using the Hytec Continuous Power System [ > http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor, > generator, flywheel, clutch, and Diesel engine all on the same shaft. They > don't use batteries. Yes. I used to work for the company that originally built the 365 Main datacenter and remember touring it near the end of the construction phase. The collection of power units up on the roof was impressive, as were the seismic isolators in the basement. But even when you try and do everythying right Murphy usually finds a way to sneak up behind you and whisper "BOHICA" in your ear. For example, we had a failure at another datacenter that uses Piller units, which operate on the same basic principle as the Hitec ones. While running on generator one of the engines overheated due to an oil-flow problem and threw a rod. When the on-duty electrician responded to the alarm, there were red-hot chunks of engine *outside* of the enclosure, and there was a hole in the side of the unit large enough to stick your arm in. The facility manager kept the damaged piston as a momento. :-) I don't remember whether this was due to a design flaw, improper installation, or what, but the important points are that (1) this is the real world and shit happens, and (2) it wasn't until the generator was worked long enough that the reduction in oil flow caused enough friction to trigger a catastrophic failure. I.e., there's no guarantee that you will catch this kind of problem in your monthly tests. On Tue, Jul 24, 2007 at 05:39:34PM -0700, George William Herbert wrote: > Unfortunate real-world lesson: there is a functional difference between > pushing the UPS test cutover button, and some of the stuff that can happen > out on the power lines (including rapid voltage swings, harmonics, etc). Precisely. --Jeff
Re: San Francisco Power Outage
Speaking on Deep Background, the Press Secretary whispered: > Level 4 datacenters can, and will, fail. Nothing you can > do including just doing 48V DC for everything are truly > foolproof solutions. Hard to find anyone who takes the -48vdc mantra to heart more than an RBOC. Ditto on lightning protection. Yet I recall the Bell South 305-255 CO taking a lightning hit on the incoming power; the 5ESS was down for 3-4 hours. -- A host is a host from coast to [EMAIL PROTECTED] & no one will talk to a host that's close[v].(301) 56-LINUX Unless the host (that isn't close).pob 1433 is busy, hung or dead20915-1433
Re: San Francisco Power Outage
On Tue, Jul 24, 2007 at 11:57:37PM +, Paul Vixie wrote: > > [EMAIL PROTECTED] (Seth Mattinen) writes: > > > I have a question: does anyone seriously accept "oh, power trouble" as a > > reason your servers went offline? Where's the generators? UPS? Testing > > said combination of UPS and generators? What if it was important? I > > honestly find it hard to believe anyone runs a facility like that and > > people actually *pay* for it. > > > > If you do accept this is a good reason for failure, why? > > sometimes the problem is in the redundancy gear itself. PAIX lost power > twice during its first five years of operation, and both times it was due > to faulty GFI in the UPS+redundancy gear. which had passed testing during > construction and subsequently, but eventually some component just wore out. I had an issue with exactly that 7 or 8 years ago at Via Networks.. the switchover gear shorted and died horrifically leading to an outage that lasted well through the night (something like 16hours in total). Being on a Friday evening it was difficult to get people on site promptly. The lesson learned was 'the big switch' .. a huge thing that took the weight of two adults to move it, but did mean that should something similar occur we could transfer the whole building power manually directly to the generator. I doubt such a beast would scale to the power loads on a large datacentre tho, but then they are generally not on a single grid/UPS feed. Steve
RE: San Francisco Power Outage
> And the stories that the power guy I'm working with tells > about foreign facilities, particularly in middle east war > zones, are really scary... > We fundamentally do not have the facilities problem > completely nailed down to the point that things will never > drop. Level 4 > datacenters can, and will, fail. Nothing you can do including > just doing 48V DC for everything are truly foolproof solutions. A single level 4 datacenter is a Single Point of Failure! Two of those middle-eastern style facilities is... ? Has anyone actually kept track of all these data center failures over the years and done some statistical analysis on it? Maybe two half-baked data centers is better than one over the long run? Remember that one 10-12 years ago in (Palo Alto, Mountainview?) where a lady in a car caused a backhoe driver to move out of the way which resulted in him cutting a gas line which resulted in the fire department evacuating the data center, cutting off electricity in the area, and forbidding the diesel generators to be switched on? --Michael Dillon
Re: San Francisco Power Outage
From: "Justin M. Streiner" <[EMAIL PROTECTED]> Sent: Tuesday, July 24, 2007 5:58 PM Subject: Re: San Francisco Power Outage Nothing quite like the sound of a whole machine room spinning down at the same time. It gives you that lovely "oh shit" feeling in the pit of your stomach.<< Yep. I plugged in my soldering iron and (coincidentally) the whole room at State of Calif., Franchise Tax, EPO'd. Everyone immediately started staring at me of course. --Michael
Re: San Francisco Power Outage
On Tue, 24 Jul 2007, Tuc at T-B-O-H.NET wrote: (I remember two guys with VERY LONG screwdrivers poking a live transfer switch to get it to reset properly, and was told to step back 20 feet as thats how far they expected to get thrown if they did something wrong). (I also remember them resetting the switch, then TRIPPING it again just to make sure it could be reset again!) Ahhh, a trip down memory lane :) The ISP I used to work at had a small ping-and-power colo space, and we also housed a large dial/DSL POP in the same building. A customer went in to do hardware maintenance on one of their colo boxes. Two important notes here: 1. The machine was still plugged in to the power outlet when they decided to do this work. 2. They decided to stick a screwdriver into the power supply WHILE said machine was plugged into said power outlet. I guess those "no user serviceable parts inside" warning labels are just friendly recommendations and nothing more... While the machine was fed from a circuit that other colo customers were on, the breaker apparently didn't trip quickly enough to keep the resulting short from sending the 20 kva Liebert UPS at the back of the room into a fit. It alarmed then shut down within 1-2 seconds of this customer doing the trick with the screwdriver. This UPS also fed said large dial and DSL POP. Nothing quite like the sound of a whole machine room spinning down at the same time. It gives you that lovely "oh shit" feeling in the pit of your stomach. I do remember fighting back the urge to stab said customer with that screwdriver... jms
Re: San Francisco Power Outage
But as George mentions... Sh*t happens There are things you can't forsee, or maybe spend way too much engineering to overcome that 1 in a million "oops". I've been at Telehouse 25B a few times when the "I never expected something like that would happen" happened. (I remember two guys with VERY LONG screwdrivers poking a live transfer switch to get it to reset properly, and was told to step back 20 feet as thats how far they expected to get thrown if they did something wrong). (I also remember them resetting the switch, then TRIPPING it again just to make sure it could be reset again!) Tuc/TBOH > > > They should have generators running...I can't foresee any good > datacenter not having multiple generators to keep their customers > servers online with UPS. > > -Ray > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of > Adrian Chadd > Sent: Tuesday, July 24, 2007 7:54 PM > To: Seth Mattinen > Cc: nanog list > Subject: Re: San Francisco Power Outage > > > On Tue, Jul 24, 2007, Seth Mattinen wrote: > > > I have a question: does anyone seriously accept "oh, power trouble" as > a > > reason your servers went offline? Where's the generators? UPS? Testing > > > said combination of UPS and generators? What if it was important? I > > honestly find it hard to believe anyone runs a facility like that and > > people actually *pay* for it. > > > If you do accept this is a good reason for failure, why? > > Didn't you read? He paid extra for super-reliable power from his > electricity provider.. > > > > Adrian > >
Fw: Re: San Francisco Power Outage
Don't be so fast to point the finger. Generally speaking, blame is obvious from the initial news reports but tends to diminish with retrospective fact-based assessment. For example: it's "obvious" that serious net sites need multihoming. But what if your multihomed bits go through the same pipe (or worse, through the same fiber)? Who do you blame when you find out? Worse, in terms of blame: who can you go to beforehand who actually knows where that can happen? I well remember this slide from Sean Donelan's talk at NANOG23: --- http://www.nanog.org/mtg-0110/ppt/donelan_files/v3_document.htm What Didn't Work - Diversity and Avoidance * Equipment in the World Trade Center primarily served tenants in complex (shared fate) * SONET ring through WTC tower 1 and alternate path through WTC tower 2 * Damage to 140 West Street central office and surrounding underground infrastructure * Backup circuit routed through same facility * “Advanced” data circuits (ISDN/DSL) concentrated in a few central offices --- The real answer, found elsewhere in Sean's talk, is that the design of the net has always encouraged redundancy as an engineering principle. Stress situations is where that pays off, even though it can't solve every possible eventuality (and as has already been noted, redundant equipment also fails as well as creating more complex failure modes). The net had problems on 9/11, especially around the WTC, but Sean's slides document remarkable resiliency even in that area. The power went off at a key spot in the San Francisco infrastructure today. But as far as I know, even though it was mentioned in the Chron article, Craigslist stayed online because they have a distributed and redundant system (which is not to say, impervious to all failure modes). Some shortcomings are obvious, but all I am saying is, before rushing to cast blame, it's a good idea to try and collect some facts. fh --- http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2007/07/24/BAG9NR67253.DTL&tsp1 Power restored in San Francisco Marisa Lagos and Matthew B. Stannard, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 16:57 PDT SAN FRANCISCO -- Between 30,000 and 50,000 Pacific Gas and Electric Co. customers in San Francisco and the northern Peninsula lost power for several hours this afternoon after what witnesses described as an explosion under a manhole cover on Mission Street, the utility said. Brian Swanson, a spokesman for the utility, said power failures were reported throughout wide swaths of the east side of San Francisco, including downtown and at PG&E's own office on Beale Street near the Ferry Building. The outage first occurred at about 1:50 p.m., and electricity flickered on and off at least five times before power was restored at about 4 p.m. PG&E officials said the source of the power outage was an underground failure. Standing at a manhole in a plaza at 560 Mission St. in San Francisco, where witnesses reported hearing an explosion, Swanson said it could have been the source of the outage, but officials were still investigating. The incident recalled an August 2005 explosion in an underground vault at Post and Kearny streets that critically injured a woman who was walking by. At the time, PG&E blamed high levels of moisture in the attached high-voltage chambers and said it was checking the safety of about 1,000 other high-voltage chambers. Swanson said today's incident -- in which no one was injured -- was caused by some sort of fault in the line. "It is completely unrelated to what happened two years ago," he said. Witnesses said they heard an explosion at about 1:50 p.m., then saw flames coming from the manhole. Actor Torino Von Jones, 32, said he was filming a Fruit of the Loom commercial down the block at the time. "We were standing over there waiting for the camera cue when we heard a big explosion," he said. "Flames came up taller than I am, and I'm 6-foot-2." "Naturally, when you hear an explosion, you think the worst," Von Jones said. Nevertheless, he hurried back to work. "We're Fruit of the Loom -- we've got to make this commercial." The outage briefly affected some Muni buses and trains, but all were back to normal by 3 p.m., a spokeswoman said. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power. -- mail forwarded, original message follows -- To: nanog@merit.edu From: [EMAIL PROTECTED] Subject: Re: San Francisco Power Outage Date: Tue, 24 Jul 2007 15:54:08 -0700 J
Re: San Francisco Power Outage
365 I believe has flywheels...from what I'm gathering it wasn't a full building outage. Static switch issues again, anyone? Either way, happy I moved out of there. It was overpriced for when it was working. I hear they had a scheduled power outage for maintenance this coming weekend. I'll give benefit of doubt and assume it was for something else, not that they knew they had an issue and had their fingers crossed[1] On a related note - one of my clients came to within 5 minutes of the DC UPSs running out today before power came back. Generator truck was still en-route, but hey power's back! So they cancel it. *sigh* John 1: ...but not crossed tight enough. On Tue, Jul 24, 2007 at 08:36:59PM -0400, Raymond L. Corbin wrote: > > They should have generators running...I can't foresee any good > datacenter not having multiple generators to keep their customers > servers online with UPS. > > -Ray > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of > Adrian Chadd > Sent: Tuesday, July 24, 2007 7:54 PM > To: Seth Mattinen > Cc: nanog list > Subject: Re: San Francisco Power Outage > > > On Tue, Jul 24, 2007, Seth Mattinen wrote: > > > I have a question: does anyone seriously accept "oh, power trouble" as > a > > reason your servers went offline? Where's the generators? UPS? Testing > > > said combination of UPS and generators? What if it was important? I > > honestly find it hard to believe anyone runs a facility like that and > > people actually *pay* for it. > > > If you do accept this is a good reason for failure, why? > > Didn't you read? He paid extra for super-reliable power from his > electricity provider.. > > > > Adrian >
Re: San Francisco Power Outage
Seth wrote: >Jonathan Lassoff wrote: >> >> Just a heads up to anyone on list that PG&E has just sustained a large >> outage in San Francisco that has caused a few hiccups (both network, >> electrical, infrastructural, etc.) around the city. >> >> I've confirmed that both customers in 365 Main and parts of telecom 1 >> have both sustained brief blackouts. No word yet form 200 Paul. >> >> Anyone in the area that could use a hand with anything, I'll probably >> be wrapping up fixes for my stuff soon, and would be glad to help >> however I can. > >I have a question: does anyone seriously accept "oh, power trouble" as a >reason your servers went offline? Where's the generators? UPS? Testing >said combination of UPS and generators? What if it was important? I >honestly find it hard to believe anyone runs a facility like that and >people actually *pay* for it. > >If you do accept this is a good reason for failure, why? Unfortunate real-world lesson: there is a functional difference between pushing the UPS test cutover button, and some of the stuff that can happen out on the power lines (including rapid voltage swings, harmonics, etc). I know 365 Main has the equipment and tests it, I've been standing outside when the generators spool up. I've had generator firmware upgrades generate reporting info on the serial uplink that flipped the UPSes into permanent error state until the Liebert guys got off the plane with the replacement mainboard. I've had grid voltage fluctuations that toasted VSDs in chillers. I watched a building's electrical service go "pop" when a transformer blew and ran 10kv into the 220 mains for a fraction of a second as it arced. I was at home but called in after a 5 MW generator popped under a sufficiently badly harmonic UPS and PDU load of only about 2.4 MW. I had a client who forgot to wire the A/C into the UPS, and nearly melted a whole server room. And the stories that the power guy I'm working with tells about foreign facilities, particularly in middle east war zones, are really scary... We fundamentally do not have the facilities problem completely nailed down to the point that things will never drop. Level 4 datacenters can, and will, fail. Nothing you can do including just doing 48V DC for everything are truly foolproof solutions. -george williiam herbert [EMAIL PROTECTED]
RE: San Francisco Power Outage
They should have generators running...I can't foresee any good datacenter not having multiple generators to keep their customers servers online with UPS. -Ray -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Adrian Chadd Sent: Tuesday, July 24, 2007 7:54 PM To: Seth Mattinen Cc: nanog list Subject: Re: San Francisco Power Outage On Tue, Jul 24, 2007, Seth Mattinen wrote: > I have a question: does anyone seriously accept "oh, power trouble" as a > reason your servers went offline? Where's the generators? UPS? Testing > said combination of UPS and generators? What if it was important? I > honestly find it hard to believe anyone runs a facility like that and > people actually *pay* for it. > If you do accept this is a good reason for failure, why? Didn't you read? He paid extra for super-reliable power from his electricity provider.. Adrian
Re: San Francisco Power Outage
[EMAIL PROTECTED] (Seth Mattinen) writes: > I have a question: does anyone seriously accept "oh, power trouble" as a > reason your servers went offline? Where's the generators? UPS? Testing > said combination of UPS and generators? What if it was important? I > honestly find it hard to believe anyone runs a facility like that and > people actually *pay* for it. > > If you do accept this is a good reason for failure, why? sometimes the problem is in the redundancy gear itself. PAIX lost power twice during its first five years of operation, and both times it was due to faulty GFI in the UPS+redundancy gear. which had passed testing during construction and subsequently, but eventually some component just wore out. -- Paul Vixie
Re: San Francisco Power Outage
On 7/24/07, Seth Mattinen <[EMAIL PROTECTED]> wrote: I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it. If you do accept this is a good reason for failure, why? ~Seth I'm unable to find a link at the moment, but many moons ago power was lost at the 350 E Cermak Equinix facility in Chicago. At the time, we didn't have production equipment there (only a firewall in a shared colo cage/cabinet). This occured on a Friday evening and lasted for quite some time into Saturday morning because their generators would start up but would refuse to continue running. I believe the root cause was a problem related to insulation on the power cables somewhere. I understand testing is done frequently, but I'm also aware that if I want full redundancy, I'm going to have two physically separate locations. There are some events you can't plan for, as well as failure modes that aren't easily/quickly resolved. -brandon
Re: San Francisco Power Outage
On Tue, Jul 24, 2007, Seth Mattinen wrote: > I have a question: does anyone seriously accept "oh, power trouble" as a > reason your servers went offline? Where's the generators? UPS? Testing > said combination of UPS and generators? What if it was important? I > honestly find it hard to believe anyone runs a facility like that and > people actually *pay* for it. > If you do accept this is a good reason for failure, why? Didn't you read? He paid extra for super-reliable power from his electricity provider.. Adrian
Re: San Francisco Power Outage
Jonathan Lassoff wrote: Just a heads up to anyone on list that PG&E has just sustained a large outage in San Francisco that has caused a few hiccups (both network, electrical, infrastructural, etc.) around the city. I've confirmed that both customers in 365 Main and parts of telecom 1 have both sustained brief blackouts. No word yet form 200 Paul. Anyone in the area that could use a hand with anything, I'll probably be wrapping up fixes for my stuff soon, and would be glad to help however I can. I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it. If you do accept this is a good reason for failure, why? ~Seth
San Francisco Power Outage
Just a heads up to anyone on list that PG&E has just sustained a large outage in San Francisco that has caused a few hiccups (both network, electrical, infrastructural, etc.) around the city. I've confirmed that both customers in 365 Main and parts of telecom 1 have both sustained brief blackouts. No word yet form 200 Paul. Anyone in the area that could use a hand with anything, I'll probably be wrapping up fixes for my stuff soon, and would be glad to help however I can. Cheers, jonathan -- Jonathan Lassoff echo thejof | sed 's/^/jof@/;s/$/.com/' http://thejof.com 415-215-2464 GPG: 0xC8579EE5