On 9/30/2010 10:04 AM, Giovanni Tirloni wrote:
Hello,
Recently during an electrical maintenance, we faced a problem with
some servers that had redundant PSUs. After the power was shut down on
the circuit that serves the first PSU, the second PSU failed to keep
some servers up and they rebooted (came back normal and stayed stable
after that). Tonight the same procedure was done on the other power
circuit and the second PSU failed too (on a smaller number of
machines). These are all enterprise-level servers which vendors will
promptly replace failed PSUs.. but these PSUs were working fine as far
as we can tell. Has anyone had this problem too?
I'm looking for some advice regarding proactive PSU replacements. Is
it a common practice? We do replace disks as proactively as we can by
monitoring several performance metrics but for PSUs I'm at a loss here.
Thank you,
--
Giovanni Tirloni
gtirl...@sysdroid.com <mailto:gtirl...@sysdroid.com>
Unfortunately, the most likely time for a PSU to fail is the stress
caused during a sudden change in state. This usually happens when the
machine has been running for a couple/few years, you lose power, it
comes back on, has 50A of inrush current for .1 seconds, then *pow*. But
it can happen during failover, as you found too.
You could try asking the vendor for failure stats and if there are any
particular lot/revision numbers of PSUs that have this issue since it
hit you a lot. You could do some of this yourself by looking at the
model number on the replaceable unit. Many times they'll have a rev like
A, B, C, etc. You can check to see if there's any commonality and then
ask the vendor. If you don't get anywhere, you could try demanding
replacement for all of the same model number/rev.
Are you still under service coverage?
_______________________________________________
Tech mailing list
Tech@lopsa.org
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/