RE: What to expect after a cooling failure

2013-07-10 Thread Tony Patti
This has been a very interesting thread.

Google pointed me to this Dell document which specs some of their servers 
having an expanded operating temperature range
*** based on the amount of time spent at the elevated temperature, as a 
percentage of annual operating hours. ***

ftp://ftp.dell.com/Manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r710_User%27s%20Guide4_en-us.pdf

I mention that because the "1% of annual operating hours" at 45 C would be two 
degrees higher than the 43 C stated as reached in the original email.

It would seem that Dell recognizes that there might be situations, such as 
this, where the "continuous operation" range (35 C) is briefly exceeded.

Tony Patti
CIO
S. Walter Packaging Corp.

-Original Message-
From: Erik Levinson [mailto:erik.levin...@uberflip.com] 
Sent: Tuesday, July 09, 2013 11:28 PM
To: NANOG mailing list
Subject: What to expect after a cooling failure

As some may know, yesterday 151 Front St suffered a cooling failure after 
Enwave's facilities were flooded. 

One of the suites that we're in recovered quickly but the other took much 
longer and some of our gear shutdown automatically due to overheating. We shut 
down remotely many redundant and non-essential systems in the hotter suite, and 
transferred remotely some others to the cooler suite, to ensure that we had a 
minimum of all core systems running in the hotter suite. We waited until the 
temperatures returned to normal, and brought everything back online. The entire 
event lasted from approx 18:45 until 01:15. Apparently ambient temperature was 
above 43 degrees Celcius at one point on the cool side of cabinets in the 
hotter suite. 

For those who have gone through such events in the past, what can one expect in 
terms of long-term impact...should we expect some premature component failures? 
Does anyone have any stats to share? 

Thanks

--
Erik Levinson
CTO, Uberflip
416-900-3830
1183 King Street West, Suite 100
Toronto ON  M6K 3C5
www.uberflip.com
 





Re: What to expect after a cooling failure

2013-07-10 Thread Daniel Taylor
Another failure I've seen connected to overheating events is AC power 
supply failures.


On 07/09/2013 10:28 PM, Erik Levinson wrote:

As some may know, yesterday 151 Front St suffered a cooling failure after 
Enwave's facilities were flooded.

One of the suites that we're in recovered quickly but the other took much 
longer and some of our gear shutdown automatically due to overheating. We shut 
down remotely many redundant and non-essential systems in the hotter suite, and 
transferred remotely some others to the cooler suite, to ensure that we had a 
minimum of all core systems running in the hotter suite. We waited until the 
temperatures returned to normal, and brought everything back online. The entire 
event lasted from approx 18:45 until 01:15. Apparently ambient temperature was 
above 43 degrees Celcius at one point on the cool side of cabinets in the 
hotter suite.

For those who have gone through such events in the past, what can one expect in 
terms of long-term impact...should we expect some premature component failures? 
Does anyone have any stats to share?

Thanks

--
Erik Levinson
CTO, Uberflip
416-900-3830
1183 King Street West, Suite 100
Toronto ON  M6K 3C5
www.uberflip.com
  








RE: What to expect after a cooling failure

2013-07-10 Thread Lorell Hathcock
Ugly.

If the batteries that were in the facility's power distribution system were
affected by the heat, then their life is likely significantly shortened.
This is in terms of their capacity to supply power in the event of an outage
and a shortened shelf life.

Lorell

On Jul 9, 2013, at 8:28 PM, "Erik Levinson" 
wrote:

> As some may know, yesterday 151 Front St suffered a cooling failure after
Enwave's facilities were flooded. 
> 
> One of the suites that we're in recovered quickly but the other took much
longer and some of our gear shutdown automatically due to overheating. We
shut down remotely many redundant and non-essential systems in the hotter
suite, and transferred remotely some others to the cooler suite, to ensure
that we had a minimum of all core systems running in the hotter suite. We
waited until the temperatures returned to normal, and brought everything
back online. The entire event lasted from approx 18:45 until 01:15.
Apparently ambient temperature was above 43 degrees Celcius at one point on
the cool side of cabinets in the hotter suite. 
> 
> For those who have gone through such events in the past, what can one
expect in terms of long-term impact...should we expect some premature
component failures? Does anyone have any stats to share?
> 
> Thanks
> 
> --
> Erik Levinson
> CTO, Uberflip
> 416-900-3830
> 1183 King Street West, Suite 100
> Toronto ON  M6K 3C5
> www.uberflip.com
> 
> 
> 




Re: What to expect after a cooling failure

2013-07-10 Thread George Herbert
Numbers from memory and filed off a bit for anonymity, but

A site I was consulting with had statistically large numbers of x86 servers 
(say, 3000), SPARC enterprise gear (100), NetApp units (60) and NetApp drives 
(5000+) go through a roughly 42C excursion.  It was much hotter at ceiling 
level but fortunately high (20 foot) ceilings.  Within about 1C of the (wet 
pipes) sprinkler system head fuse temp... (shudder)

Both NetApp and X86 server PSUs had significantly increased failure rates for 
the next year.  Say in rough numbers 10% failed in the year.  About 2% were 
instant fails.

Hard drives had a significantly higher fail rate for the next year, also in the 
10% range.

No change in rate of motherboard or CPU or RAM failures was noted that I recall.


George William Herbert
Sent from my iPhone

On Jul 9, 2013, at 8:28 PM, "Erik Levinson"  wrote:

> As some may know, yesterday 151 Front St suffered a cooling failure after 
> Enwave's facilities were flooded. 
> 
> One of the suites that we're in recovered quickly but the other took much 
> longer and some of our gear shutdown automatically due to overheating. We 
> shut down remotely many redundant and non-essential systems in the hotter 
> suite, and transferred remotely some others to the cooler suite, to ensure 
> that we had a minimum of all core systems running in the hotter suite. We 
> waited until the temperatures returned to normal, and brought everything back 
> online. The entire event lasted from approx 18:45 until 01:15. Apparently 
> ambient temperature was above 43 degrees Celcius at one point on the cool 
> side of cabinets in the hotter suite. 
> 
> For those who have gone through such events in the past, what can one expect 
> in terms of long-term impact...should we expect some premature component 
> failures? Does anyone have any stats to share?
> 
> Thanks
> 
> --
> Erik Levinson
> CTO, Uberflip
> 416-900-3830
> 1183 King Street West, Suite 100
> Toronto ON  M6K 3C5
> www.uberflip.com
> 
> 
> 



Re: What to expect after a cooling failure

2013-07-09 Thread Stefan Förster
* Erik Levinson :

[cooling failure]

> For those who have gone through such events in the past, what can
> one expect in terms of long-term impact...should we expect some
> premature component failures? Does anyone have any stats to share? 

We had a similar event (temperatures were a bit higher at 49°C,
duration was a bit shorter, 10am to 3pm) this January. In the two days
after the event, two of our HP servers had drives that went from "OK" to
"Predictive Failure", which is the SmartArray controller's way of
telling about high error rates. Two weeks after, we had a single DIMM
with an uncorrectable ECC error, causing a server reboot. Three weeks
after, a single PSU failed.

In our opinion, the disk problems were caused by the cooling failure,
while the ECC error and the faulted PSU were probably not related.

I believe that your hardware will be fine, but it probably wouldn't be
a bad idea to check if you have current maintenance contracts/warranty
for your servers, or any other way of obtaining replacement drives in
a reasonably short time.


Cheers
Stefan



Re: What to expect after a cooling failure

2013-07-09 Thread Johnny Eriksson
Jake Khuon  wrote:

> While others have already talked about what to look out for in terms of 
> systems and drives, I haven't seen anyone mention things like your UPS 
> batteries.  Were they also heat-soaked? At one place I worked at, we 
> lost a whole bank of batteries in the UPS room when it overheated.  I 
> think that was somewhere around a $95,000 replacement and required 
> rush-delivery of a lot of SLAs from all over the place.

That is one reason to have the UPS and the batteries in separate rooms.

--Johnny



Re: What to expect after a cooling failure

2013-07-09 Thread Tri Tran
I have seen DDR2 RAM give random errors from inadequate cooling. The cabinets 
were stacked to the max with severs but the doors were not meshed. DDR2 run 
fairly hot, especially when all the banks are filled.
Tri Tran

-Original Message-
From: Jay Ashworth 
Date: Wed, 10 Jul 2013 00:04:23 
To: NANOG
Subject: Re: What to expect after a cooling failure

- Original Message -
> From: "Erik Levinson" 


> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?

If the HDDs were spinning while above rated maximum ambient intake temp,
*especially* if they're not *right out front in the intake path* (is
anything not built that way anymore?  Yeah; the back side of 45-drive
Supermicro racks, among other things), you should probably plan on doing
a preemptive replacement cycle, or at the very least, pay *very* close
attention to smartctld, and have a good stock of pre-trayed replacements.

Remember that you may fall in the RAID Hole if you wait for failures,
and hence lose data which isn't backed up anyway -- if more drives in a 
raid group fail *during rebuilds*, you're essentially screwed.

If your raid groups were properly dispersed across drive build dates, then
this will probably be *slightly* less dangerous, but still.

Also watch bearing-type fans.

Cheers,
-- jra
-- 
Jay R. Ashworth  Baylink   j...@baylink.com
Designer The Things I Think   RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA   #natog  +1 727 647 1274



Re: What to expect after a cooling failure

2013-07-09 Thread Mikael Abrahamsson

On Tue, 9 Jul 2013, Erik Levinson wrote:

For those who have gone through such events in the past, what can one 
expect in terms of long-term impact...should we expect some premature 
component failures? Does anyone have any stats to share?


I have experience with a different kind of event that might be of interest 
to a wider audience.


When the fire suppression system went off in a site, we had a lot of 
instant harddrive failures. I don't have any numbers, but let's say 5-10% 
of all hdd:s in the room died more or less instantly. Supposedly this was 
because of the air pressure shock when the inert fire suppression gas was 
released and the vents weren't big enough to release the overpressurised 
air outside.


I did some research and there are forum posts etc about these kinds of 
events happening in other places.


So, takeaway from this was RAID is an uptime tool, not a substitute for 
backups, and also, get a qualified ventilation/fire supression systems 
engineer to inspect your sites from this aspect.


--
Mikael Abrahamssonemail: swm...@swm.pp.se



Re: What to expect after a cooling failure

2013-07-09 Thread Jake Khuon

On 09/07/13 20:28, Erik Levinson wrote:


For those who have gone through such events in the past, what can one
expect in terms of long-term impact...should we expect some premature
component failures? Does anyone have any stats to share?


While others have already talked about what to look out for in terms of 
systems and drives, I haven't seen anyone mention things like your UPS 
batteries.  Were they also heat-soaked? At one place I worked at, we 
lost a whole bank of batteries in the UPS room when it overheated.  I 
think that was somewhere around a $95,000 replacement and required 
rush-delivery of a lot of SLAs from all over the place.




Re: What to expect after a cooling failure

2013-07-09 Thread Bryan Tong
Honestly, I think your hardware will be fine just like everyone else said
keep an eye on your hard drives they are by far the most sensitive.
Anything not mechanical if it didnt melt you're good.

One data center we had equipment in was 153F for about a week and all we
saw were drive failures and they were still fairly sparse. 1 out of 10 I
would say.

Thanks


On Tue, Jul 9, 2013 at 11:07 PM, Jimmy Hess  wrote:

> On 7/9/13, Erik Levinson  wrote:
> > For those who have gone through such events in the past, what can one
> expect
> > in terms of long-term impact...should we expect some premature component
> > failures? Does anyone have any stats to share?
>
> Realistically...  you had a single short-lived stress event.There
> are likely to be some number of random component failures in the
> future.   It is unlikely that you will be able to attribute the
> failures to such a short lived stress event of that magnitude  --
> there might on average be a small increase over normal failure rates.
>
> The bigger concern,  may be that  /a lot of different components/
> could have been subject to the same kind of abuse at the same time:
> including  sets of components that are supposed to be in a redundant
> pair  and not fail simultaneously.
>
> I wouldn't necessarily be so concerned about premature failures ---
> I would be more concerned,  that you  may have redundant components
> that were exposed to the same stress event at the same time;now
> the assumption that   their chances of failure are independent  may
> become more questionable   ---   the chance of a correlated failure in
> the future  might be greatly increased, reducing the level of
> effective redundancy/risk reduction today.
>
> That would apply mainly to mechanical devices such as HDDs.
>
>
> > Thanks
> --
> -JH
>
>


-- 

Bryan Tong
Nullivex LLC | eSited LLC
(507) 298-1624


Re: What to expect after a cooling failure

2013-07-09 Thread Jimmy Hess
On 7/9/13, Erik Levinson  wrote:
> For those who have gone through such events in the past, what can one expect
> in terms of long-term impact...should we expect some premature component
> failures? Does anyone have any stats to share?

Realistically...  you had a single short-lived stress event.There
are likely to be some number of random component failures in the
future.   It is unlikely that you will be able to attribute the
failures to such a short lived stress event of that magnitude  --
there might on average be a small increase over normal failure rates.

The bigger concern,  may be that  /a lot of different components/
could have been subject to the same kind of abuse at the same time:
including  sets of components that are supposed to be in a redundant
pair  and not fail simultaneously.

I wouldn't necessarily be so concerned about premature failures ---
I would be more concerned,  that you  may have redundant components
that were exposed to the same stress event at the same time;now
the assumption that   their chances of failure are independent  may
become more questionable   ---   the chance of a correlated failure in
the future  might be greatly increased, reducing the level of
effective redundancy/risk reduction today.

That would apply mainly to mechanical devices such as HDDs.


> Thanks
--
-JH



Re: What to expect after a cooling failure

2013-07-09 Thread Larry Sheldon

On 7/9/2013 10:28 PM, Erik Levinson wrote:

As some may know, yesterday 151 Front St suffered a cooling failure
after Enwave's facilities were flooded.

One of the suites that we're in recovered quickly but the other took
much longer and some of our gear shutdown automatically due to
overheating. We shut down remotely many redundant and non-essential
systems in the hotter suite, and transferred remotely some others to
the cooler suite, to ensure that we had a minimum of all core systems
running in the hotter suite. We waited until the temperatures
returned to normal, and brought everything back online. The entire
event lasted from approx 18:45 until 01:15. Apparently ambient
temperature was above 43 degrees Celcius at one point on the cool
side of cabinets in the hotter suite.

For those who have gone through such events in the past, what can one
expect in terms of long-term impact...should we expect some premature
component failures? Does anyone have any stats to share?


No stats, but way back in the day of very large computers (1 each) in 
very large facilities, it seems like the thing we worried most about at 
restart was too-rapid cooling and the resulting condensation if the 
conditions were right.


After power-up the next thing was disk crashes that occurred on the way 
down (this was a long time ago discs and drums are different now).


Lastly was overheat failures which were relatively few and always in 
components with a weakness reputation.


--
Requiescas in pace o email   Two identifying characteristics
of System Administrators:
Ex turpi causa non oritur actio  Infallibility, and the ability to
learn from their mistakes.
  (Adapted from Stephen Pinker)



Re: What to expect after a cooling failure

2013-07-09 Thread Jay Ashworth
- Original Message -
> From: "Erik Levinson" 


> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?

If the HDDs were spinning while above rated maximum ambient intake temp,
*especially* if they're not *right out front in the intake path* (is
anything not built that way anymore?  Yeah; the back side of 45-drive
Supermicro racks, among other things), you should probably plan on doing
a preemptive replacement cycle, or at the very least, pay *very* close
attention to smartctld, and have a good stock of pre-trayed replacements.

Remember that you may fall in the RAID Hole if you wait for failures,
and hence lose data which isn't backed up anyway -- if more drives in a 
raid group fail *during rebuilds*, you're essentially screwed.

If your raid groups were properly dispersed across drive build dates, then
this will probably be *slightly* less dangerous, but still.

Also watch bearing-type fans.

Cheers,
-- jra
-- 
Jay R. Ashworth  Baylink   j...@baylink.com
Designer The Things I Think   RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA   #natog  +1 727 647 1274



Re: What to expect after a cooling failure

2013-07-09 Thread Erik Levinson
Thanks. I should also mention that most of the gear was still on but we had 
turned off many VMs on physical servers within the first 2.5 hours, so the CPU 
and hard drive / io load was around zero on such servers. Most of the servers 
in the hotter suite had fans running at over 75% vs. about 35% in the cooler 
suite and ambient temp was down to 32 degrees Celcius within four hours.


--
Erik Levinson
CTO, Uberflip
416-900-3830 
1183 King Street West, Suite 100
Toronto ON  M6K 3C5
www.uberflip.com
 

-Original Message-
From: "Bryan Tong" 
Sent: Tuesday, July 9, 2013 11:42pm
To: "Erik Levinson" 
Cc: "NANOG mailing list" 
Subject: Re: What to expect after a cooling failure

Hello,

In my experience with heating issues the only thing that really degrades
quickly in event of overheating are hard drives. If you had them spun down
it should be fine.

CPU / Memory / Motherboards will be fine.

The only other thing I can think of having possible issues are PSU's but if
they were powered off should be fine as well. Maybe melted wires but I dont
think it was hot enough for that.

Thanks


On Tue, Jul 9, 2013 at 9:28 PM, Erik Levinson wrote:

> As some may know, yesterday 151 Front St suffered a cooling failure after
> Enwave's facilities were flooded.
>
> One of the suites that we're in recovered quickly but the other took much
> longer and some of our gear shutdown automatically due to overheating. We
> shut down remotely many redundant and non-essential systems in the hotter
> suite, and transferred remotely some others to the cooler suite, to ensure
> that we had a minimum of all core systems running in the hotter suite. We
> waited until the temperatures returned to normal, and brought everything
> back online. The entire event lasted from approx 18:45 until 01:15.
> Apparently ambient temperature was above 43 degrees Celcius at one point on
> the cool side of cabinets in the hotter suite.
>
> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?
>
> Thanks
>
> --
> Erik Levinson
> CTO, Uberflip
> 416-900-3830
> 1183 King Street West, Suite 100
> Toronto ON  M6K 3C5
> www.uberflip.com
>
>
>
>


-- 

Bryan Tong
Nullivex LLC | eSited LLC
(507) 298-1624





Re: What to expect after a cooling failure

2013-07-09 Thread Bryan Tong
Hello,

In my experience with heating issues the only thing that really degrades
quickly in event of overheating are hard drives. If you had them spun down
it should be fine.

CPU / Memory / Motherboards will be fine.

The only other thing I can think of having possible issues are PSU's but if
they were powered off should be fine as well. Maybe melted wires but I dont
think it was hot enough for that.

Thanks


On Tue, Jul 9, 2013 at 9:28 PM, Erik Levinson wrote:

> As some may know, yesterday 151 Front St suffered a cooling failure after
> Enwave's facilities were flooded.
>
> One of the suites that we're in recovered quickly but the other took much
> longer and some of our gear shutdown automatically due to overheating. We
> shut down remotely many redundant and non-essential systems in the hotter
> suite, and transferred remotely some others to the cooler suite, to ensure
> that we had a minimum of all core systems running in the hotter suite. We
> waited until the temperatures returned to normal, and brought everything
> back online. The entire event lasted from approx 18:45 until 01:15.
> Apparently ambient temperature was above 43 degrees Celcius at one point on
> the cool side of cabinets in the hotter suite.
>
> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?
>
> Thanks
>
> --
> Erik Levinson
> CTO, Uberflip
> 416-900-3830
> 1183 King Street West, Suite 100
> Toronto ON  M6K 3C5
> www.uberflip.com
>
>
>
>


-- 

Bryan Tong
Nullivex LLC | eSited LLC
(507) 298-1624


What to expect after a cooling failure

2013-07-09 Thread Erik Levinson
As some may know, yesterday 151 Front St suffered a cooling failure after 
Enwave's facilities were flooded. 

One of the suites that we're in recovered quickly but the other took much 
longer and some of our gear shutdown automatically due to overheating. We shut 
down remotely many redundant and non-essential systems in the hotter suite, and 
transferred remotely some others to the cooler suite, to ensure that we had a 
minimum of all core systems running in the hotter suite. We waited until the 
temperatures returned to normal, and brought everything back online. The entire 
event lasted from approx 18:45 until 01:15. Apparently ambient temperature was 
above 43 degrees Celcius at one point on the cool side of cabinets in the 
hotter suite. 

For those who have gone through such events in the past, what can one expect in 
terms of long-term impact...should we expect some premature component failures? 
Does anyone have any stats to share? 

Thanks

--
Erik Levinson
CTO, Uberflip
416-900-3830
1183 King Street West, Suite 100
Toronto ON  M6K 3C5
www.uberflip.com