>From: Keith Roberts <ke...@karsites.net>

>On Wed, 3 Nov 2010, Lamar Owen wrote:
>> Might want to check the power supply as well.  Bad/flakey 
>> power can indeed case damage to the drive surface; been 
>> there, done that, have two Maxtor 250GB drives with 
>> scribbled servo data to prove it.

>OK.

> I'm running the server from an APC UPS Back-UPS 650, so 
> there should not be any glitches in the power supply, should 
> there?

Probably not on the AC side, although the Back-UPS 650 isn't a full online UPS 
but a switching standby UPS (full online, like the APC Symmetra 16KVA units I 
have here) rectify to DC, float the batteries at all times, and run the output 
from inverter all of the time (unless they're switched to bypass).  The 
SmartUPS 1400RM I had in front of the PC that suffered the glitchy power is, 
unless I'm mistaken, also a full online pure sinewave UPS like the Symmetra, 
and is still in service (I checked its output on my oscilloscope first, though).

No, I was referring to the output DC voltages (+12V, +5V, +3.3V,-5V, and -12V) 
from the power supply inside the system.  

In addition to my own personal RAID1 of 250GB drives, I also, a different time, 
lost a RAID5 array of 15K 36GB SCSI drives in a Dell 1600SC server; testing the 
power supply showed lots of noise and complete dropouts of a few milliseconds 
duration on the drive connectors' 5V supply pins.  Completely and thoroughly 
scrambled the servo data on the Hitachi drives.  Meaning they didn't just start 
showing bad sectors; they started getting seek errors.  The 5V line on the 
drive connectors was reading an AC RMS of 4V superimposed on the +5V, yielding 
an effective DC voltage of 4V.  Happened over a period of three weeks, during 
which time I had a number of mysterious failures (the Hitachi drives were 
error-correcting so well that by the time they started reporting errors, it was 
way past too late, and it became impossible for the Hitachi drives to even 
power up).  I found that the power supply in question, upon investigation, 
provided the motherboard (where the DC power sensors on tha
 t box are) with clean 5V, and the drives were powered from a separate 5V rail, 
meaning the Dell management system wasn't seeing the power problems.

A simple power supply tester with a built-in meter can be bought for less than 
$20; a more thorough power analyzer will run more than that.  But even the 
simple one caught the failing Dell 1600SC supply.  It took an oscilloscope to 
test the Antec in my personal box; turned out it was a cold solder joint in the 
Antec.  A new power supply is less expensive than the equivalent labor it took 
to fix the Antec.  I keep a known good 500W ATX 12V server-grade (8 pin 12V 
plug with adapters, and 24-pin ATX plug with 20-pin adapter) around for 
testing; that's one of the very first things I check when a PC is brought in 
that is flaky.  (The very first thing is the dust accumulation, and the second 
thing is the heatsink compound).

One of the first things I do on any CentOS system I put together is install 
lm_sensors and gkrellm (gkrellm from a third-party repo).  I then enable all 
the motherboard sensors that are available in the gkrellm plugins, and run it 
(either local GUI or through ssh X forwarding to my central monitoring PC).  On 
supermicro boards I install SuperODoctor for Linux, available on the supermicro 
site.  The GUI runs well (there are some odd dependencies, however) and will 
e-mail you on alarm conditions that you can set.  These include fan RPM, 
temperatures, and voltages.  The CLI program isn't quite so sophisticated, but 
it can be run periodically and the result sent by e-mail for health checks.

Drives that are having trouble will show up with high iowaits; run iostat (from 
the sysstat package) and look at the await result.  Long awaits mean the drive 
is having trouble (or it has firmware issues like WD's EARS and EADS drives 
have in RAID configurations).
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Reply via email to