Re: is this hard disk failure?
On 09/06/11 13:46, Ron Johnson wrote: On 06/07/2011 08:02 AM, Miles Fidelman wrote: [snip] - install SMART utilities and run smartctl -A /dev/your drive -- the first line is usually the raw read error rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it! Then why does smartctl give my disk a green light? http://members.cox.net/ron.l.johnson/smart_window.png Is that a TravelStar? Try running the extended tests and setting it for offline data collection. I've got two factory refurbished ones that show 0 where yours shows a scary 589825. That mine had to be refurbished means they were sent back...and I've heard stories of hundreds sent back to the factory when a rollout of Ipex boxes found 1 in 5 were dying during the initial imaging. What is the raw value for Reallocation for event count? Cheers -- Tuttle? His name's Buttle. There must be some mistake. Mistake? [Chuckles] We don't make mistakes. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4df07948.9080...@gmail.com
Re: is this hard disk failure?
surreal firewal...@gmail.com writes: From today morning i am getting strange kind of system messages on starting the computer.. I typed dmesg and found these messages [ 304.694936] ata4.00: status: { DRDY ERR } [ 304.694939] ata4.00: error: { ICRC ABRT } [ 304.694954] ata4: soft resetting link [ 304.938280] ata4.00: configured for UDMA/33 [ 304.938293] ata4: EH complete [ 304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 304.970873] ata4.00: BMDMA stat 0x26 [ 304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 28672 in [ 304.970887] res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error) What do these messages mean? What is the solution to prevent these messages from appearing? Help! This doesn´t look like the usual hardware error from a broken hard disk: When a disk is broken, you usually get messages about sector errors. I would check all the connections (power and SATA) and try new cables. If the problem doesn´t go away, it can be anything, like the firmware of the drive, a problem with your mainboard, a problem with your power supply: Backup the data, replace the drive and see if the new one also shows errors like the above. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de
Re: is this hard disk failure?
On Tue, Jun 7, 2011 at 9:02 AM, Miles Fidelman mfidel...@meetinghouse.net wrote: Ralf Mardorf wrote: For me a hard disc never gets broken without click-click-click noise before it failed, but it's very common that cables and connections fail. By the time a disk gets to the click-click-click phase, there has been LOTS of warning - it's just that today's disks include lots of internal fault-recovery mechanisms that hide things from you, unless you run SMART diagnostics (and not just the basic smart status either). This is not borne out by my experience, or Google's white paper on the subject in 2007. See this study http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf The upshot is that smart monitoring is nowhere near 100% reliable, you're lucky if it catches even half of your drive failures in time to do anything besides rely on backups or rely on the rest of your RAID. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/banlktikv12enqg6iud-vyhhew6jx9u6...@mail.gmail.com
Re: is this hard disk failure?
Looks like controller failure or a broken pin/wire in the cable (more likely). On 09.06.2011 at 20:14 lee wrote: surreal firewal...@gmail.com writes: From today morning i am getting strange kind of system messages on starting the computer.. I typed dmesg and found these messages [ 304.694936] ata4.00: status: { DRDY ERR } [ 304.694939] ata4.00: error: { ICRC ABRT } [ 304.694954] ata4: soft resetting link [ 304.938280] ata4.00: configured for UDMA/33 [ 304.938293] ata4: EH complete [ 304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 304.970873] ata4.00: BMDMA stat 0x26 [ 304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 28672 in [ 304.970887]         res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error) What do these messages mean? What is the solution to prevent these messages from appearing? Help! This doesn´t look like the usual hardware error from a broken hard disk: When a disk is broken, you usually get messages about sector errors. I would check all the connections (power and SATA) and try new cables. If the problem doesn´t go away, it can be anything, like the firmware of the drive, a problem with your mainboard, a problem with your power supply: Backup the data, replace the drive and see if the new one also shows errors like the above. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/201106091817170078.1d84a...@portafi.com
Re: is this hard disk failure?
On 06/07/2011 08:02 AM, Miles Fidelman wrote: [snip] - install SMART utilities and run smartctl -A /dev/your drive -- the first line is usually the raw read error rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it! Then why does smartctl give my disk a green light? http://members.cox.net/ron.l.johnson/smart_window.png -- Neither the wisest constitution nor the wisest laws will secure the liberty and happiness of a people whose manners are universally corrupt. Samuel Adams, essay in The Public Advertiser, 1749 -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4df04225.4090...@cox.net
Re: is this hard disk failure?
Ron Johnson wrote: On 06/07/2011 08:02 AM, Miles Fidelman wrote: [snip] - install SMART utilities and run smartctl -A /dev/your drive -- the first line is usually the raw read error rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it! Then why does smartctl give my disk a green light? http://members.cox.net/ron.l.johnson/smart_window.png Well... smartctl isn't giving you the green light, it's your GUI that's interpreting the numbers as a green light Personally, that raw read error rate would scare me, particularly in such a young drive. Miles -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4df05d4a.4050...@meetinghouse.net
is this hard disk failure?
From today morning i am getting strange kind of system messages on starting the computer.. I typed dmesg and found these messages [ 304.694936] ata4.00: status: { DRDY ERR } [ 304.694939] ata4.00: error: { ICRC ABRT } [ 304.694954] ata4: soft resetting link [ 304.938280] ata4.00: configured for UDMA/33 [ 304.938293] ata4: EH complete [ 304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 304.970873] ata4.00: BMDMA stat 0x26 [ 304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 28672 in [ 304.970887] res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error) [ 304.970891] ata4.00: status: { DRDY ERR } [ 304.970895] ata4.00: error: { ICRC ABRT } [ 304.970909] ata4: soft resetting link [ 305.218280] ata4.00: configured for UDMA/33 [ 305.218296] ata4: EH complete [ 305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 305.880385] ata4.00: BMDMA stat 0x26 [ 305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma 196608 in [ 305.880399] res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30 (host bus error) [ 305.880404] ata4.00: status: { DRDY ERR } [ 305.880408] ata4.00: error: { ICRC ABRT } [ 305.880423] ata4: soft resetting link [ 306.126281] ata4.00: configured for UDMA/33 [ 306.126297] ata4: EH complete What do these messages mean? What is the solution to prevent these messages from appearing? Help! -- Harshad Joshi
Re: is this hard disk failure?
Couple of possibilites: - Hard disk is failing - Insufficient power available for your hard disk, causing it to spin up then spin down again - Controller error - Faulty connection or SATA port The more likely possibilities are 1 and 3. If you can get another hard disk to test, that will narrow down the possibilities. On Tue, Jun 7, 2011 at 3:47 PM, surreal firewal...@gmail.com wrote: From today morning i am getting strange kind of system messages on starting the computer.. I typed dmesg and found these messages [ 304.694936] ata4.00: status: { DRDY ERR } [ 304.694939] ata4.00: error: { ICRC ABRT } [ 304.694954] ata4: soft resetting link [ 304.938280] ata4.00: configured for UDMA/33 [ 304.938293] ata4: EH complete [ 304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 304.970873] ata4.00: BMDMA stat 0x26 [ 304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 28672 in [ 304.970887] res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error) [ 304.970891] ata4.00: status: { DRDY ERR } [ 304.970895] ata4.00: error: { ICRC ABRT } [ 304.970909] ata4: soft resetting link [ 305.218280] ata4.00: configured for UDMA/33 [ 305.218296] ata4: EH complete [ 305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [ 305.880385] ata4.00: BMDMA stat 0x26 [ 305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma 196608 in [ 305.880399] res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30 (host bus error) [ 305.880404] ata4.00: status: { DRDY ERR } [ 305.880408] ata4.00: error: { ICRC ABRT } [ 305.880423] ata4: soft resetting link [ 306.126281] ata4.00: configured for UDMA/33 [ 306.126297] ata4: EH complete What do these messages mean? What is the solution to prevent these messages from appearing? Help! -- Harshad Joshi
Re: is this hard disk failure?
On Tue, 2011-06-07 at 16:21 +0800, Ong Chin Kiat wrote: If you can get another hard disk to test, that will narrow down the possibilities ... and before doing this turn off power and disconnect and connect all cables for this HDD on the HDD (power too) and on the mobo. -- Ralf -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1307443287.4467.2.camel@debian
Re: is this hard disk failure?
On Tue, 07 Jun 2011 13:17:29 +0530, surreal wrote: From today morning i am getting strange kind of system messages on starting the computer.. I typed dmesg and found these messages [ 304.694936] ata4.00: status: { DRDY ERR } [ 304.694939] ata4.00: error: { ICRC ABRT } [ 304.694954] ata4: soft resetting link (...) What do you have attached to that port (ata 4)? What do these messages mean? What is the solution to prevent these messages from appearing? Help! It can be a bad cable -or bad connection- or even a kernel issue. I mean, it does not have to be a hard disk failure per se. Anyway, running a smartctl long test will neither hurt. Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.06.07.11.46...@gmail.com
Re: is this hard disk failure?
On Tue, 2011-06-07 at 11:46 +, Camaleón wrote: It can be a bad cable -or bad connection- For me a hard disc never gets broken without click-click-click noise before it failed, but it's very common that cables and connections fail. A tip: If there's a warranty seal, don't break it, try to loose it with a hairdryer. Then disassemble cables and remount them. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1307447981.4467.36.camel@debian
Re: is this hard disk failure?
On Tue, 2011-06-07 at 13:59 +0200, Ralf Mardorf wrote: On Tue, 2011-06-07 at 11:46 +, Camaleón wrote: It can be a bad cable -or bad connection- For me a hard disc never gets broken without click-click-click noise before it failed, but it's very common that cables and connections fail. A tip: If there's a warranty seal, don't break it, try to loose it with a hairdryer. Then disassemble cables and remount them. PS: Back in the old Atari days we kept the seals and by force tear out the screw under the seal. Not every seal can be unscathed removed by a hairdryer, but usually not all screws are needed. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1307448211.4467.40.camel@debian
Re: is this hard disk failure?
Ralf Mardorf wrote: For me a hard disc never gets broken without click-click-click noise before it failed, but it's very common that cables and connections fail. By the time a disk gets to the click-click-click phase, there has been LOTS of warning - it's just that today's disks include lots of internal fault-recovery mechanisms that hide things from you, unless you run SMART diagnostics (and not just the basic smart status either). For example, if you have a machine that's suddenly running VERY slowly - it's good sign that a drive is experiencing internal read errors (unless it's a laptop - a shorted battery is a good suspect). Both are lessons learned the hard way, and not forgotten. Turns out that modern drives have onboard processors that retry reads multiple times - good for protecting data if you only have the one copy on that drive, at the expense of reduced disk access times. Not so good if: a. you don't notice that it's happening (the disk will eventually fail hard), or, b. you're running RAID - instead of the drive dropping out of the array, the entire array slows down as it waits for the failing drive to (eventually) respond In either case, you'll tear your hair out trying to figure out why your machine is running slowly (is it a virus, a file lock that didn't release, etc., etc., etc.). Lessons learned: - if your machine is running really slowly, try a reboot -- if it reboots properly, but takes 2 times as long (or longer) to shutdown and then come back up -- get very suspicious (if your patience lasts that long) - if it's a laptop - pull the battery and try again - if everything is normal, buy yourself a new battery - if it's a server - try booting from a liveCD (if you can, first disconnect the hard drive entirely) - if normal then you could well have a hard drive problem (or you could have a virus) - install SMART utilities and run smartctl -A /dev/your drive -- the first line is usually the raw read error rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it! - if you're running RAID, be sure to purchase enterprise drives (where desktop try very hard to read a sector, despite the delay; enterprise drives give up quickly as they expect failure recovery to be handled by RAID) - you would expect software raid (md) to detect slow drives, mark them bad, and drop them from an array -- nope, md does not keep track of delay and, not really relevant for Debian, but a direct offshoot of learning the above lessons: - if you're running a Mac or Windows, you're system may be reporting smart status good - but it's not really true - it's not looking at raw read errors - there seems to be a bug in the smart utilities for Mac (as available through Macports and Fink) -- the smart daemon will fail periodically, with the only symptom being that every few minutes, you're machine will slow to a crawl (spinning beachball everywhere) for 30 seconds or so, then recover --- a really good example of taking a pre-emptive measure that causes a new problem (I can't tell you how long it took to track this one down - what with downloading every performance tracking tool I could find.) Miles Fidelman -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4dee217c.9020...@meetinghouse.net
Re: is this hard disk failure?
On Tue, 2011-06-07 at 09:02 -0400, Miles Fidelman wrote: Ralf Mardorf wrote: For me a hard disc never gets broken without click-click-click noise before it failed, but it's very common that cables and connections fail. By the time a disk gets to the click-click-click phase, A phase everybody know for modern HDDs :D, but it's possible to get data even from a disk that won't loose the heads anymore [1]. For the Atari I've got a 42MB SCSI connected to a Lacom adaptor, it sometimes needs several boots, but it's unbreakable. there has been LOTS of warning - it's just that today's disks include lots of internal fault-recovery mechanisms that hide things from you, unless you run SMART diagnostics (and not just the basic smart status either). For example, if you have a machine that's suddenly running VERY slowly Correct! Resp. if Voodoo seems to have impact to your machine, it seldom is Voodoo, but a broken HDD. - it's good sign that a drive is experiencing internal read errors (unless it's a laptop - a shorted battery is a good suspect). Both are lessons learned the hard way, and not forgotten. Turns out that modern drives have onboard processors that retry reads multiple times - good for protecting data if you only have the one copy on that drive, at the expense of reduced disk access times. Not so good if: a. you don't notice that it's happening (the disk will eventually fail hard), or, b. you're running RAID - instead of the drive dropping out of the array, the entire array slows down as it waits for the failing drive to (eventually) respond In either case, you'll tear your hair out trying to figure out why your machine is running slowly (is it a virus, a file lock that didn't release, etc., etc., etc.). Lessons learned: - if your machine is running really slowly, try a reboot -- if it reboots properly, but takes 2 times as long (or longer) to shutdown and then come back up -- get very suspicious (if your patience lasts that long) - if it's a laptop - pull the battery and try again - if everything is normal, buy yourself a new battery - if it's a server - try booting from a liveCD (if you can, first disconnect the hard drive entirely) - if normal then you could well have a hard drive problem (or you could have a virus) - install SMART utilities and run smartctl -A /dev/your drive -- the first line is usually the raw read error rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it! - if you're running RAID, be sure to purchase enterprise drives (where desktop try very hard to read a sector, despite the delay; enterprise drives give up quickly as they expect failure recovery to be handled by RAID) - you would expect software raid (md) to detect slow drives, mark them bad, and drop them from an array -- nope, md does not keep track of delay and, not really relevant for Debian, but a direct offshoot of learning the above lessons: - if you're running a Mac or Windows, you're system may be reporting smart status good - but it's not really true - it's not looking at raw read errors - there seems to be a bug in the smart utilities for Mac (as available through Macports and Fink) -- the smart daemon will fail periodically, with the only symptom being that every few minutes, you're machine will slow to a crawl (spinning beachball everywhere) for 30 seconds or so, then recover --- a really good example of taking a pre-emptive measure that causes a new problem (I can't tell you how long it took to track this one down - what with downloading every performance tracking tool I could find.) Miles Fidelman -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra My Samsung SATA drives until now are without failure for a suspicious long time :). I very, very often turn the computer off and on. The only bad are the SATA connectors, a friend already planned to solder new SATA connectors on his mobo. Note! Nobody without experiences in soldering multi-layer boards should do this soldering. I planned to do it too. [1] When the heads aren't released anymore after the final click, there still is the possibility to get them working. - Disassemble the HDD from the case, keep the power and data cables connected. - With a rubber-headed mallet or something similar knock against the HDD from several angles, while rebooting again and again. - If it doesn't work, repeat this after the HDD did rest for a week. Dunno while this does help, but it does, perhaps different temperatures for the room will work like gnomes. -- Ralf -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact
Re: is this hard disk failure?
On Tue, 07 Jun 2011, Miles Fidelman wrote: b. you're running RAID - instead of the drive dropping out of the array, the entire array slows down as it waits for the failing drive to (eventually) respond Eh, it is worse. A failing drive _will_ drop out of the array sooner or later, and it can be very bad if it is does so 'sooner' for any other reason than an imminent unit failure: there is a high probability of other device(s) deciding to also time out while the array is degraded or rebuilding, and it results in service downtime (and usually data loss). You never want discs dropping off the array due to non-immediate-failure-related performance problems, the chance of multiple drops causing an array failure is too high. You want to know the disk is slow, and to replace it in controlled conditions. This problem is *common*. Don't do hardware RAID on regular consumer crap without SCT ERC support (aka TLER/CCTL/ERC), and don't buy expensive crap with buggy firmware that the vendor refuses to issue a public fix for to save face (but which you can get from your RAID card vendor if you are very lucky). Linux smartctl gives you access to the drive's SCT ERC page if it is supported. Also, any device model (not a SPECIFIC device) for which firmware updates are available that reduce the effective throughput should be avoided like the plague, as that indicates they have shipped models with manufacturing or component issues, and you can never be sure of what you'll get when you buy a new one. If you already have bought such a device with known high design or manufacturing defects/weakness ratio, it depends on your luck whether you got something good or a lemon. If SMART finds *NO* issues (no increasing high fly writes, no reallocated sectors grow), and throughput tests show the expected response, you have a good one: be happy. If either test shows any such issues, remove it from production. Secure-erase it, apply any firmware updates if you want to use it as throw-away backup media (make sure the data is encrypted), or send it for recycling. Linux software raid is much more forgiving by default (and it can tune the timeout for each component device separately), and will just slow down most of the time instead of kicking component devices off the array until dataloss happens. Could be useful if you got duped by the vendor and sold a defective drive that can only operate safely out-of-spec, but can still be useful to you. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110607152700.gb1...@khazad-dum.debian.net
Re: is this hard disk failure?
Henrique de Moraes Holschuh wrote: On Tue, 07 Jun 2011, Miles Fidelman wrote: b. you're running RAID - instead of the drive dropping out of the array, the entire array slows down as it waits for the failing drive to (eventually) respond Linux software raid is much more forgiving by default (and it can tune the timeout for each component device separately), and will just slow down most of the time instead of kicking component devices off the array until dataloss happens. Could be useful if you got duped by the vendor and sold a defective drive that can only operate safely out-of-spec, but can still be useful to you. Not necessarily the best strategy if you have enough drives to survive 2 drive failures. Sometimes better to have a drive drop out of the array and trigger an alarm than to have a system slow to a crawl precipitously (particularly as that makes it hard to run diagnostics to figure out which drive is bad). Re. tuning: How? I've tried to find ways to get md to track timeouts, and never been able to find any relevant parameters. Queries to the linux-raid list have yielded some fairly definitive sounding statements, from folks who should know, that md doesn't have any such timeouts. If they're there, please.. more information! -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4dee4643.3060...@meetinghouse.net
Re: is this hard disk failure?
On Tue, 07 Jun 2011, Miles Fidelman wrote: Linux software raid is much more forgiving by default (and it can tune the timeout for each component device separately), and will just slow down most of the time instead of kicking component devices off the array until dataloss happens. Could be useful if you got duped by the vendor and sold a defective drive that can only operate safely out-of-spec, but can still be useful to you. Not necessarily the best strategy if you have enough drives to survive 2 drive failures. Sometimes better to have a drive drop out of the array and trigger an alarm than to have a system slow to a crawl precipitously (particularly as that makes it hard to run diagnostics to figure out which drive is bad). YMMV. I'd never do that in a RAID array with important data in it. External events that cause non-ECR disks to time out CAN and DO happen to the entire set of disks in the same enclosure (such as impact vibrations from a nearby equipment or from the floor). It is a known problem in datacenters, but it can happen at home as well when a large truck passes close by, or someone bumps in the shelf/table/rack :-) If enough of those devices go over the timeout threshold because of the external even (which is rather spartan by default on most hardware RAID cards), the array goes offline and data loss can happen. Worse, rebuilding a degraded array will excercise the array at the time it is most vulnerable, it is not a safe operation unless you're rebuilding an already redundant array (which is one of the reasons why RAID6 or anything N+2 or above is a good idea). This is why you have to regularly scrub the array at off-peak hours or as a background operation. Re. tuning: How? I've tried to find ways to get md to track timeouts, and never been able to find any relevant parameters. It is not in md. It is in the libata/scsi layer. Just tune the per-device parameters, e.g. in /sys/block/sda/device/* AFAIK, if libata doesn't time out the device, md won't drop it off the array. Queries to the linux-raid list have yielded some fairly definitive sounding statements, from folks who should know, that md doesn't have any such timeouts. If they're there, please.. more information! md doesn't track performance (much, if at all), it does not do even a decent job of scheduling reads/writes over multiple md devices that have components that share the same physical device. It is quite simple (but not to the point of being brain-dead like dm-raid). OTOH, md really is a separate layer on top of the component devices. You can smart-test and performance-test the component devices, change their libata/scsi layer parameters... -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110607160627.gd1...@khazad-dum.debian.net
Re: is this hard disk failure?
Henrique de Moraes Holschuh wrote Re. tuning: How? I've tried to find ways to get md to track timeouts, and never been able to find any relevant parameters. It is not in md. It is in the libata/scsi layer. Just tune the per-device parameters, e.g. in /sys/block/sda/device/* AFAIK, if libata doesn't time out the device, md won't drop it off the array. Ahhh Thanks! -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4dee53ae.2050...@meetinghouse.net