Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
On Sun, Feb 27, 2005 at 03:53:30PM +0100, Anthony Atkielski wrote: > messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying > (2 retries left) LBA=4848803 > messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out [...] > Is there a way to work backwards from the LBA to the filesystem so that > I can see which file was being referenced when this occurred? Theoretically, one could use 'fsdb -r' in a scripted manner, to generate a mapping of file names to blocks (relative to the partition of the file system you are mapping). Once you have the blocks, you'll need to do so artithmetics to map those blocks to LBA address ranges (perhaps via GEOM or using data in disklabels). Finally, you'll have to locate the range for a particular LBA address and work backwards up to the inode #, and then to the filename(s) that link to that inode. Perhaps there's already a system utility or port for this? It would be really useful! > Anthony Cheers, -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
[EMAIL PROTECTED] writes: > Theoretically, one could use 'fsdb -r' in a scripted manner, to > generate a mapping of file names to blocks (relative to the partition > of the file system you are mapping). Once you have the blocks, you'll > need to do so artithmetics to map those blocks to LBA address ranges > (perhaps via GEOM or using data in disklabels). Finally, you'll have > to locate the range for a particular LBA address and work backwards > up to the inode #, and then to the filename(s) that link to that inode. Sounds complicated. Surely I'm not the first person to wish for such a utility ... in UNIXland, there seems to be a command for just about every conceivable purpose (?). > Perhaps there's already a system utility or port for this? It would be > really useful! I'm mainly worried about exactly what the system was trying to write at the time. It's not clear from the message whether the write succeeded or not. -- Anthony ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
On Sun, Feb 27, 2005 at 05:19:32PM +0100, Anthony Atkielski wrote: > [EMAIL PROTECTED] writes: > > > Theoretically, one could use 'fsdb -r' in a scripted manner, to > > generate a mapping of file names to blocks (relative to the partition > > of the file system you are mapping). Once you have the blocks, you'll > > need to do so artithmetics to map those blocks to LBA address ranges > > (perhaps via GEOM or using data in disklabels). Finally, you'll have > > to locate the range for a particular LBA address and work backwards > > up to the inode #, and then to the filename(s) that link to that inode. > > Sounds complicated. Surely I'm not the first person to wish for such a > utility ... in UNIXland, there seems to be a command for just about > every conceivable purpose (?). Or you could write the missing ones :-). Actually, it's not that hard. You need three mappings: 1. (lba address, (filesystem, block #)) 2. ((filesystem, block #), (filesystem, inode #)) 3. ((filesystem, inode #), (list of filenames linking to inode #)) Each of those mappings could be done and displayed by a single utility. Combining all three into a lba2filenames program would then be trivial. > > Perhaps there's already a system utility or port for this? It would be > > really useful! > > I'm mainly worried about exactly what the system was trying to write at > the time. It's not clear from the message whether the write succeeded > or not. Yes, that's exactly my concern too. > -- > Anthony -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
On Sun, 27 Feb 2005 15:53:30 +0100, in sentex.lists.freebsd.questions you wrote: >I've gotten two messages like the ones below today on my production server >(5.3-RELEASE): > >messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 >retries left) LBA=4848803 >messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out Could be a bad sector on the drive, or bad cable. Hard to say. Try /usr/ports/sysutils/smartmontools/ It can read all sorts of info off the drive and help you narrow down what the problem might be. ---Mike Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 [EMAIL PROTECTED], (http://www.tancsa.com) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
[EMAIL PROTECTED] writes: > Actually, it's not that hard. You need three mappings: > > 1. (lba address, (filesystem, block #)) > 2. ((filesystem, block #), (filesystem, inode #)) > 3. ((filesystem, inode #), (list of filenames linking to inode #)) Seems like it would be straightforward with adequate documentation. -- Anthony ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
Mike Tancsa writes: > Could be a bad sector on the drive, or bad cable. Hard to say. Try > /usr/ports/sysutils/smartmontools/ > > It can read all sorts of info off the drive and help you narrow down > what the problem might be. Wow! That is a very cool tool. There's even a Windows port so I can use it on my XP machine. The two SATA drives show no errors. The older IDE drive (which contains the filesystem root) shows the stuff below. There have been over 1000 read errors over the lifetime of the disk, but the disk had some hard times back in December when it was in my overheated old server, so that might account for part of that. The most recent errors look like they might correlate with what I saw today (unfortunately, I'm not sure how to interpret them): == smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SV4002H Serial Number:0413J1FR932555 Firmware Version: QP100-07 Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1 Local Time is:Sun Feb 27 22:52:54 2005 CET ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled The SMART RETURN STATUS return value (smartmontools -H option/Directive) can not be retrieved with this version of ATAng, please do not rely on this value === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (1560) seconds. Offline data collection capabilities:(0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 8) minutes. SMART Attributes Data Structure revision number: 9 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000a 100 100 000Old_age Always - 1050 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 253 253 009Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0024 253 253 000Old_age Offline - 0 9 Power_On_Hours 0x0032 096 096 000Old_age Always - 2968364 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 54 194 Temperature_Celsius 0x0022 175 145 000Old_age Always - 21 197 Current_Pending_Sector 0x0033 253 253 009Pre-fail Always - 0 198 Offline_Uncorrectable 0x0031 253 253 009Pre-fail Offline - 0 199 UDMA_CRC_Error_Count0x000a 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000b 100 100 051Pre-fail Always - 0 201 Soft_Read_Error_Rate0x000b 100 100 051Pre-fail Always - 1 SMART Error Log Version: 1 Warning: ATA error count 22 inconsistent with error log pointer 4 ATA Error Count: 22 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number R
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
On Sun, 27 Feb 2005 23:09:50 +0100, in sentex.lists.freebsd.questions you wrote: >Mike Tancsa writes: > >> Could be a bad sector on the drive, or bad cable. Hard to say. Try >> /usr/ports/sysutils/smartmontools/ >> >> It can read all sorts of info off the drive and help you narrow down >> what the problem might be. > > >The two SATA drives show no errors. The older IDE drive (which contains >the filesystem root) shows the stuff below. There have been over 1000 > >Device does not support Selective Self Tests/Logging Try running some of the tests on the SATA drives as well as run the monitoring daemon. With any luck, it will provide a little more information about the error condition you are seeing. ---Mike Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 [EMAIL PROTECTED], (http://www.tancsa.com) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
At 3:53 PM +0100 2/27/05, Anthony Atkielski wrote: I've gotten two messages like the ones below today on my production server (5.3-RELEASE): ... kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4848803 ... kernel: ad10: FAILURE - WRITE_DMA timed out What do these messages mean? The referenced drive is one of two identical SATA drives on the server; it holds /tmp and /var. I don't recall seeing these messages before. Is there a way to work backwards from the LBA to the filesystem so that I can see which file was being referenced when this occurred? First question: which SATA controller are you using? And what is the make&model of the hard drives that you are using? Note: There have been several different threads on different mailing lists from users having WRITE_DMA errors similar to this. At least some of the problem is in the code which handles disk I/O. The developer who works the most on that code is in the middle of a fairly major set of improvements to it, as is described in the thread with a subject of: UPDATE2: ATA mkIII first official patches - please test! on the freebsd-current and freebsd-stable mailing list. That major set of improvements is still being tested, but it does solve some ATA/SATA issues for many users. Which issues you are running into will depend on which SATA controller you have, and the make&model of SATA hard-disks that you have attached to the controller. I realize that none of that info really helps you right now, but I just thought I would say that it may be you're not having any hardware problems. Or at least, not on the disk itself. It might be a problem with the disk-controller, or it might be fairly minor timing-problems that come up under certain kinds of load. Of course, it still *could* be your hard disk... Also note that I am not an expert on hard disks or disk I/O. It's just that I've suffered through many similar problems, and I know that Søren has been working on the newer, improved code for handling ATA/SATA. -- Garance Alistair Drosehn= [EMAIL PROTECTED] Senior Systems Programmer or [EMAIL PROTECTED] Rensselaer Polytechnic Instituteor [EMAIL PROTECTED] ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: WRITE_DMA errors on SATA drive under 5.3-RELEASE
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Anthony > Atkielski > Sent: Sunday, February 27, 2005 2:10 PM > To: freebsd-questions@freebsd.org > Subject: Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE > > > Mike Tancsa writes: > > > Could be a bad sector on the drive, or bad cable. Hard to say. Try > > /usr/ports/sysutils/smartmontools/ > > > > It can read all sorts of info off the drive and help you narrow down > > what the problem might be. > > Wow! That is a very cool tool. There's even a Windows port so I can > use it on my XP machine. > > The two SATA drives show no errors. The older IDE drive > (which contains > the filesystem root) shows the stuff below. There have been over 1000 > read errors over the lifetime of the disk, but the disk had some hard > times back in December when it was in my overheated old server, so that > might account for part of that. The most recent errors look like they > might correlate with what I saw today (unfortunately, I'm not sure how > to interpret them): Rule of thumb on IDE hard drives, if they show more than a few errors with a tool like smartmon, they need to be thrown in the garbage. Heat is the number one enemy of hard drives. If this drive overheated, particularly over a long timeperiod, resistance values and semiconductor values can shift, permanently, in the electronics of the drive. So even if the heads and platters are still good, your on borrowed time with the circuit board. And since it's the circuit board that's dodgy, the drive surface isn't failing, so the problems aren't going to register with S.M.A.R.T. Despite S.M.A.R.T., the vast majority of IDE hard drives that fail, fail without warning. Ted ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
Ted Mittelstaedt writes: > Rule of thumb on IDE hard drives, if they show more than a few errors > with a tool like smartmon, they need to be thrown in the garbage. Seems prudent to me, but right now I don't have the budget to replace this drive (yes, 40 GB IDE drives are cheap, but I don't have even that). -- Anthony ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
Garance A Drosihn writes: > First question: which SATA controller are you using? The controller is built into the Asus P4P800-E motherboard, and is based on the Intel ICH5R southbridge chipset. There's also a Promise 20378 RAID controller on board but I do NOT use it (disabled in BIOS). > And what is the make&model of the hard drives that you are using? The SATA drives are two identical Western Digital WD1200JD 120-GB drives, 7200 RPM. Device ad10 holds /tmp and /var; device ad12 holds /usr. There is also a third drive, an older Samsung SV4002H (40 GB), connected to the primary IDE controller. This drive holds the root /. Although the error messages I've seen name ad10 (the first SATA drive), smartctl says that no errors have occurred on either of these drives--whereas it does show a log of errors on the third drive (ad0) that seem to correspond mysterious to the errors in the message. > Note: There have been several different threads on different mailing > lists from users having WRITE_DMA errors similar to this. At least > some of the problem is in the code which handles disk I/O. So I've surmised. The problem seems to be quite rare, but since this is a production server I worry about disk writes not being completed; I have no easy way to tell whether writes were actually lost or not. > I realize that none of that info really helps you right now, but > I just thought I would say that it may be you're not having any > hardware problems. Or at least, not on the disk itself. It might > be a problem with the disk-controller, or it might be fairly minor > timing-problems that come up under certain kinds of load. I don't think there are any hardware problems at all. This isn't a terribly exotic configuration. It's probably a bug or configuration problem. -- Anthony ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"