Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread cpghost
On Sun, Feb 27, 2005 at 03:53:30PM +0100, Anthony Atkielski wrote:
> messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying 
> (2 retries left) LBA=4848803
> messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out

[...]

> Is there a way to work backwards from the LBA to the filesystem so that
> I can see which file was being referenced when this occurred?

Theoretically, one could use 'fsdb -r' in a scripted manner, to
generate a mapping of file names to blocks (relative to the partition
of the file system you are mapping). Once you have the blocks, you'll
need to do so artithmetics to map those blocks to LBA address ranges
(perhaps via GEOM or using data in disklabels). Finally, you'll have
to locate the range for a particular LBA address and work backwards
up to the inode #, and then to the filename(s) that link to that inode.

Perhaps there's already a system utility or port for this? It would be
really useful!

> Anthony

Cheers,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Anthony Atkielski
[EMAIL PROTECTED] writes:

> Theoretically, one could use 'fsdb -r' in a scripted manner, to
> generate a mapping of file names to blocks (relative to the partition
> of the file system you are mapping). Once you have the blocks, you'll
> need to do so artithmetics to map those blocks to LBA address ranges
> (perhaps via GEOM or using data in disklabels). Finally, you'll have
> to locate the range for a particular LBA address and work backwards
> up to the inode #, and then to the filename(s) that link to that inode.

Sounds complicated.  Surely I'm not the first person to wish for such a
utility ... in UNIXland, there seems to be a command for just about
every conceivable purpose (?).

> Perhaps there's already a system utility or port for this? It would be
> really useful!

I'm mainly worried about exactly what the system was trying to write at
the time.  It's not clear from the message whether the write succeeded
or not.

-- 
Anthony


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread cpghost
On Sun, Feb 27, 2005 at 05:19:32PM +0100, Anthony Atkielski wrote:
> [EMAIL PROTECTED] writes:
> 
> > Theoretically, one could use 'fsdb -r' in a scripted manner, to
> > generate a mapping of file names to blocks (relative to the partition
> > of the file system you are mapping). Once you have the blocks, you'll
> > need to do so artithmetics to map those blocks to LBA address ranges
> > (perhaps via GEOM or using data in disklabels). Finally, you'll have
> > to locate the range for a particular LBA address and work backwards
> > up to the inode #, and then to the filename(s) that link to that inode.
> 
> Sounds complicated.  Surely I'm not the first person to wish for such a
> utility ... in UNIXland, there seems to be a command for just about
> every conceivable purpose (?).

Or you could write the missing ones :-).

Actually, it's not that hard. You need three mappings:

1. (lba address, (filesystem, block #))
2. ((filesystem, block #), (filesystem, inode #))
3. ((filesystem, inode #), (list of filenames linking to inode #))

Each of those mappings could be done and displayed by a single
utility. Combining all three into a lba2filenames program would
then be trivial.

> > Perhaps there's already a system utility or port for this? It would be
> > really useful!
> 
> I'm mainly worried about exactly what the system was trying to write at
> the time.  It's not clear from the message whether the write succeeded
> or not.

Yes, that's exactly my concern too.

> -- 
> Anthony

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Mike Tancsa
On Sun, 27 Feb 2005 15:53:30 +0100, in sentex.lists.freebsd.questions
you wrote:

>I've gotten two messages like the ones below today on my production server
>(5.3-RELEASE):
>
>messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 
>retries left) LBA=4848803
>messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out

Could be a bad sector on the drive, or bad cable. Hard to say.  Try
/usr/ports/sysutils/smartmontools/

It can read all sorts of info off the drive and help you narrow down
what the problem might be.


---Mike

Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
[EMAIL PROTECTED], (http://www.tancsa.com)
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Anthony Atkielski
[EMAIL PROTECTED] writes:

> Actually, it's not that hard. You need three mappings:
>
> 1. (lba address, (filesystem, block #))
> 2. ((filesystem, block #), (filesystem, inode #))
> 3. ((filesystem, inode #), (list of filenames linking to inode #))

Seems like it would be straightforward with adequate documentation.

-- 
Anthony


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Anthony Atkielski
Mike Tancsa writes:

> Could be a bad sector on the drive, or bad cable. Hard to say.  Try
> /usr/ports/sysutils/smartmontools/
>
> It can read all sorts of info off the drive and help you narrow down
> what the problem might be.

Wow!  That is a very cool tool.  There's even a Windows port so I can
use it on my XP machine.

The two SATA drives show no errors.  The older IDE drive (which contains
the filesystem root) shows the stuff below.  There have been over 1000
read errors over the lifetime of the disk, but the disk had some hard
times back in December when it was in my overheated old server, so that
might account for part of that.  The most recent errors look like they
might correlate with what I saw today (unfortunately, I'm not sure how
to interpret them):

==
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG SV4002H
Serial Number:0413J1FR932555
Firmware Version: QP100-07
Device is:In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 1
Local Time is:Sun Feb 27 22:52:54 2005 CET

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

The SMART RETURN STATUS return value (smartmontools -H option/Directive)
 can not be retrieved with this version of ATAng, please do not rely on this 
value
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (1560) seconds.
Offline data collection
capabilities:(0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(   8) minutes.

SMART Attributes Data Structure revision number: 9
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000a   100   100   000Old_age   Always   
-   1050
  4 Start_Stop_Count0x0032   100   100   000Old_age   Always   
-   55
  5 Reallocated_Sector_Ct   0x0033   253   253   009Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000b   253   253   051Pre-fail  Always   
-   0
  8 Seek_Time_Performance   0x0024   253   253   000Old_age   Offline  
-   0
  9 Power_On_Hours  0x0032   096   096   000Old_age   Always   
-   2968364
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   54
194 Temperature_Celsius 0x0022   175   145   000Old_age   Always   
-   21
197 Current_Pending_Sector  0x0033   253   253   009Pre-fail  Always   
-   0
198 Offline_Uncorrectable   0x0031   253   253   009Pre-fail  Offline  
-   0
199 UDMA_CRC_Error_Count0x000a   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x000b   100   100   051Pre-fail  Always   
-   0
201 Soft_Read_Error_Rate0x000b   100   100   051Pre-fail  Always   
-   1

SMART Error Log Version: 1
Warning: ATA error count 22 inconsistent with error log pointer 4

ATA Error Count: 22 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number R

Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Mike Tancsa
On Sun, 27 Feb 2005 23:09:50 +0100, in sentex.lists.freebsd.questions
you wrote:

>Mike Tancsa writes:
>
>> Could be a bad sector on the drive, or bad cable. Hard to say.  Try
>> /usr/ports/sysutils/smartmontools/
>>
>> It can read all sorts of info off the drive and help you narrow down
>> what the problem might be.
>
>
>The two SATA drives show no errors.  The older IDE drive (which contains
>the filesystem root) shows the stuff below.  There have been over 1000
>
>Device does not support Selective Self Tests/Logging


Try running some of the tests on the SATA drives as well as run the
monitoring daemon. With any luck, it will provide a little more
information about the error condition you are seeing.

---Mike

Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
[EMAIL PROTECTED], (http://www.tancsa.com)
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-27 Thread Garance A Drosihn
At 3:53 PM +0100 2/27/05, Anthony Atkielski wrote:
I've gotten two messages like the ones below today on my
production server (5.3-RELEASE):
... kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4848803
... kernel: ad10: FAILURE - WRITE_DMA timed out
What do these messages mean?  The referenced drive is one of
two identical SATA drives on the server; it holds /tmp and /var.
I don't recall seeing these messages before.
Is there a way to work backwards from the LBA to the filesystem
so that I can see which file was being referenced when this
occurred?
First question: which SATA controller are you using?  And what is
the make&model of the hard drives that you are using?
Note: There have been several different threads on different mailing
lists from users having WRITE_DMA errors similar to this.  At least
some of the problem is in the code which handles disk I/O.  The
developer who works the most on that code is in the middle of a
fairly major set of improvements to it, as is described in the
thread with a subject of:
UPDATE2: ATA mkIII first official patches - please test!
on the freebsd-current and freebsd-stable mailing list.  That major
set of improvements is still being tested, but it does solve some
ATA/SATA issues for many users.  Which issues you are running into
will depend on which SATA controller you have, and the make&model
of SATA hard-disks that you have attached to the controller.
I realize that none of that info really helps you right now, but
I just thought I would say that it may be you're not having any
hardware problems.  Or at least, not on the disk itself.  It might
be a problem with the disk-controller, or it might be fairly minor
timing-problems that come up under certain kinds of load.
Of course, it still *could* be your hard disk...  Also note that I
am not an expert on hard disks or disk I/O.  It's just that I've
suffered through many similar problems, and I know that Søren has
been working on the newer, improved code for handling ATA/SATA.
--
Garance Alistair Drosehn=   [EMAIL PROTECTED]
Senior Systems Programmer   or  [EMAIL PROTECTED]
Rensselaer Polytechnic Instituteor  [EMAIL PROTECTED]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-28 Thread Ted Mittelstaedt


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Anthony
> Atkielski
> Sent: Sunday, February 27, 2005 2:10 PM
> To: freebsd-questions@freebsd.org
> Subject: Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE
>
>
> Mike Tancsa writes:
>
> > Could be a bad sector on the drive, or bad cable. Hard to say.  Try
> > /usr/ports/sysutils/smartmontools/
> >
> > It can read all sorts of info off the drive and help you narrow down
> > what the problem might be.
>
> Wow!  That is a very cool tool.  There's even a Windows port so I can
> use it on my XP machine.
>
> The two SATA drives show no errors.  The older IDE drive
> (which contains
> the filesystem root) shows the stuff below.  There have been over 1000
> read errors over the lifetime of the disk, but the disk had some hard
> times back in December when it was in my overheated old server, so that
> might account for part of that.  The most recent errors look like they
> might correlate with what I saw today (unfortunately, I'm not sure how
> to interpret them):

Rule of thumb on IDE hard drives, if they show more than a few errors
with a
tool like smartmon, they need to be thrown in the garbage.

Heat is the number one enemy of hard drives.  If this drive overheated,
particularly over a long timeperiod, resistance values and semiconductor
values can shift, permanently, in the electronics of the drive.  So even
if the heads and platters are still good, your on borrowed time with the
circuit board.  And since it's the circuit board that's dodgy, the drive
surface isn't failing, so the problems aren't going to register with
S.M.A.R.T.

Despite S.M.A.R.T., the vast majority of IDE hard drives that fail, fail
without warning.

Ted

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-28 Thread Anthony Atkielski
Ted Mittelstaedt writes:

> Rule of thumb on IDE hard drives, if they show more than a few errors
> with a tool like smartmon, they need to be thrown in the garbage.

Seems prudent to me, but right now I don't have the budget to replace
this drive (yes, 40 GB IDE drives are cheap, but I don't have even
that).

-- 
Anthony


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

2005-02-28 Thread Anthony Atkielski
Garance A Drosihn writes:

> First question: which SATA controller are you using?

The controller is built into the Asus P4P800-E motherboard, and is
based on the Intel ICH5R southbridge chipset.  There's also a Promise
20378 RAID controller on board but I do NOT use it (disabled in BIOS).

> And what is the make&model of the hard drives that you are using?

The SATA drives are two identical Western Digital WD1200JD 120-GB
drives, 7200 RPM.  Device ad10 holds /tmp and /var; device ad12 holds
/usr.

There is also a third drive, an older Samsung SV4002H (40 GB), connected
to the primary IDE controller.  This drive holds the root /.

Although the error messages I've seen name ad10 (the first SATA drive),
smartctl says that no errors have occurred on either of these
drives--whereas it does show a log of errors on the third drive (ad0)
that seem to correspond mysterious to the errors in the message.

> Note: There have been several different threads on different mailing
> lists from users having WRITE_DMA errors similar to this. At least
> some of the problem is in the code which handles disk I/O.

So I've surmised.  The problem seems to be quite rare, but since this is
a production server I worry about disk writes not being completed; I
have no easy way to tell whether writes were actually lost or not.

> I realize that none of that info really helps you right now, but
> I just thought I would say that it may be you're not having any
> hardware problems.  Or at least, not on the disk itself.  It might
> be a problem with the disk-controller, or it might be fairly minor
> timing-problems that come up under certain kinds of load.

I don't think there are any hardware problems at all.  This isn't a
terribly exotic configuration.  It's probably a bug or configuration
problem.

-- 
Anthony


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"