Re: is this hard disk failure?

2011-06-09 Thread Scott Ferguson
On 09/06/11 13:46, Ron Johnson wrote:
 On 06/07/2011 08:02 AM, Miles Fidelman wrote:
 [snip]

 - install SMART utilities and run smartctl -A /dev/your drive -- the
 first line is usually the raw read error rate -- if the value (last
 entry on the line) is anything except 0, that's the sign that your drive
 is failing, if it's in the 1000s, failure is imminent, it's just that
 your drive's internal software is hiding it from you - replace it!

 
 Then why does smartctl give my disk a green light?
 
 http://members.cox.net/ron.l.johnson/smart_window.png
 

Is that a TravelStar?
Try running the extended tests and setting it for offline data
collection. I've got two factory refurbished ones that show 0 where
yours shows a scary 589825. That mine had to be refurbished means they
were sent back...and I've heard stories of hundreds sent back to the
factory when a rollout of Ipex boxes found 1 in 5 were dying during the
initial imaging.

What is the raw value for Reallocation for event count?

Cheers

-- 
Tuttle? His name's Buttle.
There must be some mistake.
Mistake? [Chuckles]
We don't make mistakes.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4df07948.9080...@gmail.com



Re: is this hard disk failure?

2011-06-09 Thread lee
surreal firewal...@gmail.com writes:

From today morning i am getting strange kind of system messages on starting 
the computer..

 I typed dmesg and found these messages

 [  304.694936] ata4.00: status: { DRDY ERR }
 [  304.694939] ata4.00: error: { ICRC ABRT }
 [  304.694954] ata4: soft resetting link
 [  304.938280] ata4.00: configured for UDMA/33
 [  304.938293] ata4: EH complete
 [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
 [  304.970873] ata4.00: BMDMA stat 0x26
 [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 
 28672 in
 [  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 
 (host bus error)

 What do these messages mean? What is the solution to prevent these messages 
 from appearing? Help!

This doesn´t look like the usual hardware error from a broken hard disk:
When a disk is broken, you usually get messages about sector errors.

I would check all the connections (power and SATA) and try new
cables. If the problem doesn´t go away, it can be anything, like the
firmware of the drive, a problem with your mainboard, a problem with
your power supply: Backup the data, replace the drive and see if the new
one also shows errors like the above.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de



Re: is this hard disk failure?

2011-06-09 Thread Nico Kadel-Garcia
On Tue, Jun 7, 2011 at 9:02 AM, Miles Fidelman
mfidel...@meetinghouse.net wrote:
 Ralf Mardorf wrote:

 For me a hard disc never gets broken without click-click-click noise
 before it failed, but it's very common that cables and connections fail.



 By the time a disk gets to the click-click-click phase, there has been LOTS
 of warning - it's just that today's disks include lots of internal
 fault-recovery mechanisms that hide things from you, unless you run SMART
 diagnostics (and not just the basic smart status either).

This is not borne out by my experience, or Google's white paper on the
subject in 2007. See this study

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf

The upshot is that smart monitoring is nowhere near 100% reliable,
you're lucky if it catches even half of your drive failures in time to
do anything besides rely on backups or rely on the rest of your RAID.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/banlktikv12enqg6iud-vyhhew6jx9u6...@mail.gmail.com



Re: is this hard disk failure?

2011-06-09 Thread Aenn Seidhe Priest
Looks like controller failure or a broken pin/wire in the cable (more
likely).

On 09.06.2011 at 20:14 lee wrote:

surreal firewal...@gmail.com writes:

From today morning i am getting strange kind of system messages on
starting the computer..

 I typed dmesg and found these messages

 [  304.694936] ata4.00: status: { DRDY ERR }
 [  304.694939] ata4.00: error: { ICRC ABRT }
 [  304.694954] ata4: soft resetting link
 [  304.938280] ata4.00: configured for UDMA/33
 [  304.938293] ata4: EH complete
 [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6
 [  304.970873] ata4.00: BMDMA stat 0x26
 [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0
tag 0
dma 28672 in
 [ 
304.970887]          res
51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error)

 What do these messages mean? What is the solution to prevent these
messages from appearing? Help!

This doesn´t look like the usual hardware error from a broken
hard disk:
When a disk is broken, you usually get messages about sector errors.

I would check all the connections (power and SATA) and try new
cables. If the problem doesn´t go away, it can be anything, like
the
firmware of the drive, a problem with your mainboard, a problem with
your power supply: Backup the data, replace the drive and see if the
new
one also shows errors like the above.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact
listmas...@lists.debian.org
Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de




--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/201106091817170078.1d84a...@portafi.com



Re: is this hard disk failure?

2011-06-08 Thread Ron Johnson

On 06/07/2011 08:02 AM, Miles Fidelman wrote:
[snip]


- install SMART utilities and run smartctl -A /dev/your drive -- the
first line is usually the raw read error rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!



Then why does smartctl give my disk a green light?

http://members.cox.net/ron.l.johnson/smart_window.png

--
Neither the wisest constitution nor the wisest laws will secure
the liberty and happiness of a people whose manners are universally
corrupt.
Samuel Adams, essay in The Public Advertiser, 1749


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4df04225.4090...@cox.net



Re: is this hard disk failure?

2011-06-08 Thread Miles Fidelman

Ron Johnson wrote:

On 06/07/2011 08:02 AM, Miles Fidelman wrote:
[snip]


- install SMART utilities and run smartctl -A /dev/your drive -- the
first line is usually the raw read error rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!



Then why does smartctl give my disk a green light?

http://members.cox.net/ron.l.johnson/smart_window.png


Well... smartctl isn't giving you the green light, it's your GUI that's 
interpreting the numbers as a green light


Personally, that raw read error rate would scare me, particularly in 
such a young drive.


Miles




--
In theory, there is no difference between theory and practice.
Infnord  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4df05d4a.4050...@meetinghouse.net



is this hard disk failure?

2011-06-07 Thread surreal
From today morning i am getting strange kind of system messages on starting
the computer..

I typed dmesg and found these messages

[  304.694936] ata4.00: status: { DRDY ERR }
[  304.694939] ata4.00: error: { ICRC ABRT }
[  304.694954] ata4: soft resetting link
[  304.938280] ata4.00: configured for UDMA/33
[  304.938293] ata4: EH complete
[  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  304.970873] ata4.00: BMDMA stat 0x26
[  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma
28672 in
[  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30
(host bus error)
[  304.970891] ata4.00: status: { DRDY ERR }
[  304.970895] ata4.00: error: { ICRC ABRT }
[  304.970909] ata4: soft resetting link
[  305.218280] ata4.00: configured for UDMA/33
[  305.218296] ata4: EH complete
[  305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  305.880385] ata4.00: BMDMA stat 0x26
[  305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma
196608 in
[  305.880399]  res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30
(host bus error)
[  305.880404] ata4.00: status: { DRDY ERR }
[  305.880408] ata4.00: error: { ICRC ABRT }
[  305.880423] ata4: soft resetting link
[  306.126281] ata4.00: configured for UDMA/33
[  306.126297] ata4: EH complete


What do these messages mean? What is the solution to prevent these messages
from appearing? Help!

-- 
Harshad Joshi


Re: is this hard disk failure?

2011-06-07 Thread Ong Chin Kiat
Couple of possibilites:
- Hard disk is failing
- Insufficient power available for your hard disk, causing it to spin up
then spin down again
- Controller error
- Faulty connection or SATA port

The more likely possibilities are 1 and 3.

If you can get another hard disk to test, that will narrow down the
possibilities.

On Tue, Jun 7, 2011 at 3:47 PM, surreal firewal...@gmail.com wrote:

 From today morning i am getting strange kind of system messages on
 starting the computer..

 I typed dmesg and found these messages

 [  304.694936] ata4.00: status: { DRDY ERR }
 [  304.694939] ata4.00: error: { ICRC ABRT }
 [  304.694954] ata4: soft resetting link
 [  304.938280] ata4.00: configured for UDMA/33
 [  304.938293] ata4: EH complete
 [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
 [  304.970873] ata4.00: BMDMA stat 0x26
 [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma
 28672 in
 [  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30
 (host bus error)
 [  304.970891] ata4.00: status: { DRDY ERR }
 [  304.970895] ata4.00: error: { ICRC ABRT }
 [  304.970909] ata4: soft resetting link
 [  305.218280] ata4.00: configured for UDMA/33
 [  305.218296] ata4: EH complete
 [  305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
 [  305.880385] ata4.00: BMDMA stat 0x26
 [  305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma
 196608 in
 [  305.880399]  res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30
 (host bus error)
 [  305.880404] ata4.00: status: { DRDY ERR }
 [  305.880408] ata4.00: error: { ICRC ABRT }
 [  305.880423] ata4: soft resetting link
 [  306.126281] ata4.00: configured for UDMA/33
 [  306.126297] ata4: EH complete


 What do these messages mean? What is the solution to prevent these messages
 from appearing? Help!

 --
 Harshad Joshi





Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 16:21 +0800, Ong Chin Kiat wrote:

 If you can get another hard disk to test, that will narrow down the
 possibilities
... and before doing this turn off power and disconnect and connect all
cables for this HDD on the HDD (power too) and on the mobo.

-- Ralf







-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307443287.4467.2.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Camaleón
On Tue, 07 Jun 2011 13:17:29 +0530, surreal wrote:

 From today morning i am getting strange kind of system messages on
 starting the computer..
 
 I typed dmesg and found these messages
 
 [  304.694936] ata4.00: status: { DRDY ERR } 
 [  304.694939] ata4.00: error: { ICRC ABRT } 
 [  304.694954] ata4: soft resetting link 

(...)

What do you have attached to that port (ata 4)? 

 What do these messages mean? What is the solution to prevent these
 messages from appearing? Help!

It can be a bad cable -or bad connection- or even a kernel issue. I mean, 
it does not have to be a hard disk failure per se. Anyway, running a 
smartctl long test will neither hurt.

Greetings,

-- 
Camaleón


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/pan.2011.06.07.11.46...@gmail.com



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 11:46 +, Camaleón wrote:
 It can be a bad cable -or bad connection-

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

A tip: If there's a warranty seal, don't break it, try to loose it with
a hairdryer. Then disassemble cables and remount them.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307447981.4467.36.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 13:59 +0200, Ralf Mardorf wrote:
 On Tue, 2011-06-07 at 11:46 +, Camaleón wrote:
  It can be a bad cable -or bad connection-
 
 For me a hard disc never gets broken without click-click-click noise
 before it failed, but it's very common that cables and connections fail.
 
 A tip: If there's a warranty seal, don't break it, try to loose it with
 a hairdryer. Then disassemble cables and remount them.

PS: Back in the old Atari days we kept the seals and by force tear out
the screw under the seal. Not every seal can be unscathed removed by a
hairdryer, but usually not all screws are needed.



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307448211.4467.40.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Ralf Mardorf wrote:

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

   


By the time a disk gets to the click-click-click phase, there has been 
LOTS of warning - it's just that today's disks include lots of internal 
fault-recovery mechanisms that hide things from you, unless you run 
SMART diagnostics (and not just the basic smart status either).


For example, if you have a machine that's suddenly running VERY slowly - 
it's good sign that a drive is experiencing internal read errors (unless 
it's a laptop - a shorted battery is a good suspect).  Both are lessons 
learned the hard way, and not forgotten.


Turns out that modern drives have onboard processors that retry reads 
multiple times - good for protecting data if you only have the one copy 
on that drive, at the expense of reduced disk access times.  Not so good if:


a. you don't notice that it's happening (the disk will eventually fail 
hard), or,


b. you're running RAID - instead of the drive dropping out of the array, 
the entire array slows down as it waits for the failing drive to 
(eventually) respond


In either case, you'll tear your hair out trying to figure out why your 
machine is running slowly  (is it a virus, a file lock that didn't 
release, etc., etc., etc.).


Lessons learned:

- if your machine is running really slowly, try a reboot -- if it 
reboots properly, but takes 2 times as long (or longer) to shutdown and 
then come back up -- get very suspicious (if your patience lasts that long)


- if it's a laptop - pull the battery and try again - if everything is 
normal, buy yourself a new battery


- if it's a server - try booting from a liveCD (if you can, first 
disconnect the hard drive entirely) - if normal then you could well have 
a hard drive problem (or you could have a virus)


- install SMART utilities and run smartctl -A /dev/your drive -- the 
first line is usually the raw read error rate -- if the value (last 
entry on the line) is anything except 0, that's the sign that your drive 
is failing, if it's in the 1000s, failure is imminent, it's just that 
your drive's internal software is hiding it from you - replace it!


- if you're running RAID, be sure to purchase enterprise drives (where 
desktop try very hard to read a sector, despite the delay; enterprise 
drives give up quickly as they expect failure recovery to be handled by 
RAID)


- you would expect software raid (md) to detect slow drives, mark them 
bad, and drop them from an array -- nope, md does not keep track of delay


and, not really relevant for Debian, but a direct offshoot of learning 
the above lessons:


- if you're running a Mac or Windows, you're system may be reporting 
smart status good - but it's not really true - it's not looking at raw 
read errors


- there seems to be a bug in the smart utilities for Mac (as available 
through Macports and Fink) -- the smart daemon will fail periodically, 
with the only symptom being that every few minutes, you're machine will 
slow to a crawl (spinning beachball everywhere) for 30 seconds or so, 
then recover --- a really good example of taking a pre-emptive measure 
that causes a new problem (I can't tell you how long it took to track 
this one down - what with downloading every performance tracking tool I 
could find.)



Miles Fidelman

--
In theory, there is no difference between theory and practice.
Infnord  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee217c.9020...@meetinghouse.net



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 09:02 -0400, Miles Fidelman wrote:
 Ralf Mardorf wrote:
  For me a hard disc never gets broken without click-click-click noise
  before it failed, but it's very common that cables and connections fail.
 
 
 
 By the time a disk gets to the click-click-click phase,

A phase everybody know for modern HDDs :D, but it's possible to get data
even from a disk that won't loose the heads anymore [1].
For the Atari I've got a 42MB SCSI connected to a Lacom adaptor, it
sometimes needs several boots, but it's unbreakable.

  there has been 
 LOTS of warning - it's just that today's disks include lots of internal 
 fault-recovery mechanisms that hide things from you, unless you run 
 SMART diagnostics (and not just the basic smart status either).
 
 For example, if you have a machine that's suddenly running VERY slowly

Correct! Resp. if Voodoo seems to have impact to your machine, it seldom
is Voodoo, but a broken HDD.

  - 
 it's good sign that a drive is experiencing internal read errors (unless 
 it's a laptop - a shorted battery is a good suspect).  Both are lessons 
 learned the hard way, and not forgotten.
 
 Turns out that modern drives have onboard processors that retry reads 
 multiple times - good for protecting data if you only have the one copy 
 on that drive, at the expense of reduced disk access times.  Not so good if:
 
 a. you don't notice that it's happening (the disk will eventually fail 
 hard), or,
 
 b. you're running RAID - instead of the drive dropping out of the array, 
 the entire array slows down as it waits for the failing drive to 
 (eventually) respond
 
 In either case, you'll tear your hair out trying to figure out why your 
 machine is running slowly  (is it a virus, a file lock that didn't 
 release, etc., etc., etc.).
 
 Lessons learned:
 
 - if your machine is running really slowly, try a reboot -- if it 
 reboots properly, but takes 2 times as long (or longer) to shutdown and 
 then come back up -- get very suspicious (if your patience lasts that long)
 
 - if it's a laptop - pull the battery and try again - if everything is 
 normal, buy yourself a new battery
 
 - if it's a server - try booting from a liveCD (if you can, first 
 disconnect the hard drive entirely) - if normal then you could well have 
 a hard drive problem (or you could have a virus)
 
 - install SMART utilities and run smartctl -A /dev/your drive -- the 
 first line is usually the raw read error rate -- if the value (last 
 entry on the line) is anything except 0, that's the sign that your drive 
 is failing, if it's in the 1000s, failure is imminent, it's just that 
 your drive's internal software is hiding it from you - replace it!
 
 - if you're running RAID, be sure to purchase enterprise drives (where 
 desktop try very hard to read a sector, despite the delay; enterprise 
 drives give up quickly as they expect failure recovery to be handled by 
 RAID)
 
 - you would expect software raid (md) to detect slow drives, mark them 
 bad, and drop them from an array -- nope, md does not keep track of delay
 
 and, not really relevant for Debian, but a direct offshoot of learning 
 the above lessons:
 
 - if you're running a Mac or Windows, you're system may be reporting 
 smart status good - but it's not really true - it's not looking at raw 
 read errors
 
 - there seems to be a bug in the smart utilities for Mac (as available 
 through Macports and Fink) -- the smart daemon will fail periodically, 
 with the only symptom being that every few minutes, you're machine will 
 slow to a crawl (spinning beachball everywhere) for 30 seconds or so, 
 then recover --- a really good example of taking a pre-emptive measure 
 that causes a new problem (I can't tell you how long it took to track 
 this one down - what with downloading every performance tracking tool I 
 could find.)
 
 
 Miles Fidelman
 
 -- 
 In theory, there is no difference between theory and practice.
 Infnord  practice, there is.    Yogi Berra

My Samsung SATA drives until now are without failure for a suspicious
long time :). I very, very often turn the computer off and on.
The only bad are the SATA connectors, a friend already planned to solder
new SATA connectors on his mobo. Note! Nobody without experiences in
soldering multi-layer boards should do this soldering. I planned to do
it too.

[1] When the heads aren't released anymore after the final click, there
still is the possibility to get them working.

- Disassemble the HDD from the case, keep the power and data cables
connected.
- With a rubber-headed mallet or something similar knock against the HDD
from several angles, while rebooting again and again.
- If it doesn't work, repeat this after the HDD did rest for a week.
Dunno while this does help, but it does, perhaps different temperatures
for the room will work like gnomes.

-- Ralf


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact 

Re: is this hard disk failure?

2011-06-07 Thread Henrique de Moraes Holschuh
On Tue, 07 Jun 2011, Miles Fidelman wrote:
 b. you're running RAID - instead of the drive dropping out of the
 array, the entire array slows down as it waits for the failing drive
 to (eventually) respond

Eh, it is worse.

A failing drive _will_ drop out of the array sooner or later, and it can
be very bad if it is does so 'sooner' for any other reason than an
imminent unit failure:  there is a high probability of other device(s)
deciding to also time out while the array is degraded or rebuilding, and
it results in service downtime (and usually data loss).

You never want discs dropping off the array due to
non-immediate-failure-related performance problems, the chance of
multiple drops causing an array failure is too high.  You want to know
the disk is slow, and to replace it in controlled conditions.

This problem is *common*.  Don't do hardware RAID on regular consumer
crap without SCT ERC support (aka TLER/CCTL/ERC), and don't buy
expensive crap with buggy firmware that the vendor refuses to issue a
public fix for to save face (but which you can get from your RAID card
vendor if you are very lucky).  Linux smartctl gives you access to the
drive's SCT ERC page if it is supported.

Also, any device model (not a SPECIFIC device) for which firmware
updates are available that reduce the effective throughput should be
avoided like the plague, as that indicates they have shipped models with
manufacturing or component issues, and you can never be sure of what
you'll get when you buy a new one.

If you already have bought such a device with known high design or
manufacturing defects/weakness ratio, it depends on your luck whether
you got something good or a lemon.  If SMART finds *NO* issues (no
increasing high fly writes, no reallocated sectors grow), and throughput
tests show the expected response, you have a good one: be happy.

If either test shows any such issues, remove it from production.
Secure-erase it, apply any firmware updates if you want to use it as
throw-away backup media (make sure the data is encrypted), or send it
for recycling.

Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens.  Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110607152700.gb1...@khazad-dum.debian.net



Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Henrique de Moraes Holschuh wrote:

On Tue, 07 Jun 2011, Miles Fidelman wrote:
   

b. you're running RAID - instead of the drive dropping out of the
array, the entire array slows down as it waits for the failing drive
to (eventually) respond
 



Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens.  Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.
   


Not necessarily the best strategy if you have enough drives to survive 2 
drive failures.  Sometimes better to have a drive drop out of the array 
and trigger an alarm than to have a system slow to a crawl precipitously 
(particularly as that makes it hard to run diagnostics to figure out 
which drive is bad).


Re. tuning:  How?  I've tried to find ways to get md to track timeouts, 
and never been able to find any relevant parameters.  Queries to the 
linux-raid list have yielded some fairly definitive sounding statements, 
from folks who should know, that md doesn't have any such timeouts.  If 
they're there, please.. more information!







--
In theory, there is no difference between theory and practice.
Infnord  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee4643.3060...@meetinghouse.net



Re: is this hard disk failure?

2011-06-07 Thread Henrique de Moraes Holschuh
On Tue, 07 Jun 2011, Miles Fidelman wrote:
 Linux software raid is much more forgiving by default (and it can tune
 the timeout for each component device separately), and will just slow
 down most of the time instead of kicking component devices off the array
 until dataloss happens.  Could be useful if you got duped by the vendor
 and sold a defective drive that can only operate safely out-of-spec, but
 can still be useful to you.
 
 Not necessarily the best strategy if you have enough drives to
 survive 2 drive failures.  Sometimes better to have a drive drop out
 of the array and trigger an alarm than to have a system slow to a
 crawl precipitously (particularly as that makes it hard to run
 diagnostics to figure out which drive is bad).

YMMV.  I'd never do that in a RAID array with important data in it.

External events that cause non-ECR disks to time out CAN and DO happen to
the entire set of disks in the same enclosure (such as impact vibrations
from a nearby equipment or from the floor).  It is a known problem in
datacenters, but it can happen at home as well when a large truck passes
close by, or someone bumps in the shelf/table/rack :-)

If enough of those devices go over the timeout threshold because of the
external even (which is rather spartan by default on most hardware RAID
cards), the array goes offline and data loss can happen.

Worse, rebuilding a degraded array will excercise the array at the time it
is most vulnerable, it is not a safe operation unless you're rebuilding an
already redundant array (which is one of the reasons why RAID6 or anything
N+2 or above is a good idea).  This is why you have to regularly scrub the
array at off-peak hours or as a background operation.

 Re. tuning:  How?  I've tried to find ways to get md to track
 timeouts, and never been able to find any relevant parameters.

It is not in md.  It is in the libata/scsi layer.  Just tune the per-device
parameters, e.g. in /sys/block/sda/device/*

AFAIK, if libata doesn't time out the device, md won't drop it off the
array.

 Queries to the linux-raid list have yielded some fairly definitive
 sounding statements, from folks who should know, that md doesn't
 have any such timeouts.  If they're there, please.. more
 information!

md doesn't track performance (much, if at all), it does not do even a decent
job of scheduling reads/writes over multiple md devices that have components
that share the same physical device.   It is quite simple (but not to the
point of being brain-dead like dm-raid).

OTOH, md really is a separate layer on top of the component devices. You can
smart-test and performance-test the component devices, change their
libata/scsi layer parameters...

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110607160627.gd1...@khazad-dum.debian.net



Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Henrique de Moraes Holschuh wrote

Re. tuning:  How?  I've tried to find ways to get md to track
timeouts, and never been able to find any relevant parameters.
 

It is not in md.  It is in the libata/scsi layer.  Just tune the per-device
parameters, e.g. in /sys/block/sda/device/*

AFAIK, if libata doesn't time out the device, md won't drop it off the
array.

   


Ahhh Thanks!

--
In theory, there is no difference between theory and practice.
Infnord  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee53ae.2050...@meetinghouse.net