Ralf Mardorf wrote:
For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.
By the time a disk gets to the click-click-click phase, there has been
LOTS of warning - it's just that today's disks include lots of internal
fault-recovery mechanisms that hide things from you, unless you run
SMART diagnostics (and not just the basic "smart status" either).
For example, if you have a machine that's suddenly running VERY slowly -
it's good sign that a drive is experiencing internal read errors (unless
it's a laptop - a shorted battery is a good suspect). Both are lessons
learned the hard way, and not forgotten.
Turns out that modern drives have onboard processors that retry reads
multiple times - good for protecting data if you only have the one copy
on that drive, at the expense of reduced disk access times. Not so good if:
a. you don't notice that it's happening (the disk will eventually fail
hard), or,
b. you're running RAID - instead of the drive dropping out of the array,
the entire array slows down as it waits for the failing drive to
(eventually) respond
In either case, you'll tear your hair out trying to figure out why your
machine is running slowly (is it a virus, a file lock that didn't
release, etc., etc., etc.).
Lessons learned:
- if your machine is running really slowly, try a reboot -- if it
reboots properly, but takes 2 times as long (or longer) to shutdown and
then come back up -- get very suspicious (if your patience lasts that long)
- if it's a laptop - pull the battery and try again - if everything is
normal, buy yourself a new battery
- if it's a server - try booting from a liveCD (if you can, first
disconnect the hard drive entirely) - if normal then you could well have
a hard drive problem (or you could have a virus)
- install SMART utilities and run "smartctl -A /dev/<your drive> -- the
first line is usually the "raw read error" rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!
- if you're running RAID, be sure to purchase "enterprise" drives (where
"desktop" try very hard to read a sector, despite the delay; enterprise
drives give up quickly as they expect failure recovery to be handled by
RAID)
- you would expect software raid (md) to detect slow drives, mark them
bad, and drop them from an array -- nope, md does not keep track of delay
and, not really relevant for Debian, but a direct offshoot of learning
the above lessons:
- if you're running a Mac or Windows, you're system may be reporting
"smart status good" - but it's not really true - it's not looking at raw
read errors
- there seems to be a bug in the smart utilities for Mac (as available
through Macports and Fink) -- the smart daemon will fail periodically,
with the only symptom being that every few minutes, you're machine will
slow to a crawl (spinning beachball everywhere) for 30 seconds or so,
then recover --- a really good example of taking a pre-emptive measure
that causes a new problem (I can't tell you how long it took to track
this one down - what with downloading every performance tracking tool I
could find.)
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4dee217c.9020...@meetinghouse.net