Re: is this hard disk failure?

Miles Fidelman Tue, 07 Jun 2011 06:03:45 -0700

Ralf Mardorf wrote:

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

By the time a disk gets to the click-click-click phase, there has beenLOTS of warning - it's just that today's disks include lots of internalfault-recovery mechanisms that hide things from you, unless you runSMART diagnostics (and not just the basic "smart status" either).

For example, if you have a machine that's suddenly running VERY slowly -it's good sign that a drive is experiencing internal read errors (unlessit's a laptop - a shorted battery is a good suspect). Both are lessonslearned the hard way, and not forgotten.

Turns out that modern drives have onboard processors that retry readsmultiple times - good for protecting data if you only have the one copyon that drive, at the expense of reduced disk access times. Not so good if:

a. you don't notice that it's happening (the disk will eventually failhard), or,

b. you're running RAID - instead of the drive dropping out of the array,the entire array slows down as it waits for the failing drive to(eventually) respond

In either case, you'll tear your hair out trying to figure out why yourmachine is running slowly (is it a virus, a file lock that didn'trelease, etc., etc., etc.).


Lessons learned:

- if your machine is running really slowly, try a reboot -- if itreboots properly, but takes 2 times as long (or longer) to shutdown andthen come back up -- get very suspicious (if your patience lasts that long)

- if it's a laptop - pull the battery and try again - if everything isnormal, buy yourself a new battery

- if it's a server - try booting from a liveCD (if you can, firstdisconnect the hard drive entirely) - if normal then you could well havea hard drive problem (or you could have a virus)

- install SMART utilities and run "smartctl -A /dev/<your drive> -- thefirst line is usually the "raw read error" rate -- if the value (lastentry on the line) is anything except 0, that's the sign that your driveis failing, if it's in the 1000s, failure is imminent, it's just thatyour drive's internal software is hiding it from you - replace it!

- if you're running RAID, be sure to purchase "enterprise" drives (where"desktop" try very hard to read a sector, despite the delay; enterprisedrives give up quickly as they expect failure recovery to be handled byRAID)

- you would expect software raid (md) to detect slow drives, mark thembad, and drop them from an array -- nope, md does not keep track of delay

and, not really relevant for Debian, but a direct offshoot of learningthe above lessons:

- if you're running a Mac or Windows, you're system may be reporting"smart status good" - but it's not really true - it's not looking at rawread errors

- there seems to be a bug in the smart utilities for Mac (as availablethrough Macports and Fink) -- the smart daemon will fail periodically,with the only symptom being that every few minutes, you're machine willslow to a crawl (spinning beachball everywhere) for 30 seconds or so,then recover --- a really good example of taking a pre-emptive measurethat causes a new problem (I can't tell you how long it took to trackthis one down - what with downloading every performance tracking tool Icould find.)



Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



--

To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.orgwith a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee217c.9020...@meetinghouse.net

Re: is this hard disk failure?

Reply via email to