Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-14 Thread Remco van Bekkum
On Tue, Feb 12, 2008 at 01:03:35AM +0100, Torfinn Ingolfsen wrote:
 On Mon, 11 Feb 2008 13:00:57 +0100
 [EMAIL PROTECTED] (Remco van Bekkum) wrote:
 
  here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU,
  2GB 800MHz RAM.
 
 FWIW, I have the almost the same motherboard (m2a-vm hdmi) with an AMD
 Phenom 9500 and 4GB RAM[1].  Different disk, though. The (single) disk
 drive has worked without problems so far. I'm using standard ufs2
 filesystems on that disk. I'm running RELENG_7:
 [EMAIL PROTECTED] uname -a
 FreeBSD kg-vm.kg4.no 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #6: Sat Jan
 26 20:58:51 CET 2008
 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  amd64
 [EMAIL PROTECTED] atacontrol list
 ATA channel 0:
 Master: acd0 Optiarc DVD RW AD-5170A/1.12 ATA/ATAPI revision 0
 Slave:   no device present
 ATA channel 2:
 Master:  ad4 SAMSUNG HD501LJ/CR100-12 Serial ATA II
 Slave:   no device present
 ATA channel 3:
 Master:  no device present
 Slave:   no device present
 ATA channel 4:
 Master:  no device present
 Slave:   no device present
 ATA channel 5:
 Master:  no device present
 Slave:   no device present
 
 
 References:
 1) http://tingox.googlepages.com/asus_m2a-vm_hdmi_freebsd
 -- 
 Regards,
 Torfinn Ingolfsen
 

Thanks, here some more detailed info from me:

xaero# dmesg | grep atapci
atapci0: ATI AHCI controller port
0xfc00-0xfc07,0xf800-0xf803,0xf400-0xf407,0xf000-0xf003,0xec00-0xec0f
mem 0xfe02f000-0xfe02f3ff irq 22 at device 18.0 on pci0
atapci0: [ITHREAD]
atapci0: AHCI Version 01.10 controller with 4 ports detected
ata2: ATA channel 0 on atapci0
ata3: ATA channel 1 on atapci0
ata4: ATA channel 2 on atapci0
ata5: ATA channel 3 on atapci0
atapci1: ATI IXP600 UDMA133 controller port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe400-0xe40f at device 20.1 on pci0
ata0: ATA channel 0 on atapci1

xaero# atacontrol list
ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  ad4 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II
Slave:   no device present
ATA channel 3:
Master:  ad6 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II
Slave:   no device present
ATA channel 4:
Master:  ad8 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II
Slave:   no device present
ATA channel 5:
Master: ad10 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II
Slave:   no device present

xaero# uname -a
FreeBSD xaero.spacemarines.us 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #1:
Sun Feb 10 16:07:39 CET 2008
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  amd64

I'm using bios 1603, and a seasonic 330W PSU. The errors appear to
happen at random, heavy I/O doesn't trigger it. I can rebuild world
without problems, so I guess the CPU is ok. The memory has been tested
and showed no errors. What's left is cables and mainboard. But how
error-prone are sata cables? Considering that I've got 50% failing...
Okay, maybe that should prove that the mainboard is faulty :)

-Remco
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-11 Thread Remco van Bekkum
On Mon, Feb 11, 2008 at 07:24:55AM -1000, Clifton Royston wrote:
 On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote:
  On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote:
  After having replaced my first SATA disk with one of the same type,
  having still the same errors, I replaced this 1TB drive with 4x500GB
  Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I
  cvsupped and rebuild world. This afternoon everything is breaking down
  again with the same errors:
  
  Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
  MODE taskqueue timeout - completing request directly
  Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
  MODE taskqueue timeout - completing request directly
  Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE
  taskqueue timeout - completing request directly
  Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE
  taskqueue timeout - completing request directly
  Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout
  - completing request directly
  Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out
  LBA=298014274
 
   Did you try replacing cabling as a previous poster recommended?  I've
 had similar problems with both traditional parallel ATA and SATA due to
 marginal cables, which of course are not solved by swapping drives.
 
   Not saying there's not a software problem here, just that there is
 still one area to eliminate.
   -- Clifton
  
 -- 
 Clifton Royston  --  [EMAIL PROTECTED] / [EMAIL PROTECTED]
President  - I and I Computing * http://www.iandicomputing.com/
  Custom programming, network design, systems and network consulting services

Hi Clifton,

I don't recall exactly anymore, but at least 3 cables have been used
without problems on other systems. I'm wondering, the mainboard acts
weird sometimes as well: when I press the reset button, it sometimes powers 
down.
Also, I just did a reset after it deadlocked on shutdown because of the errors,
and when the system booted, 2 disks were not seen by the bios.
I had to power down the box and when it came up again, the disks were back.
Can software leave the disks in a state that the bios doesn't detect
them after pressing the reset button?
I'm 100% certain that on my previous installation, in a 100% different
system, I got the same errors. That should normally mean either software or 
disk.
The disk has been replaced, the OS is the same. I'm either having really bad 
luck or something else is wrong.
What is a good way of stress testing disks?
Thanks!

- Remco
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-11 Thread Clifton Royston
On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote:
 On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote:
 After having replaced my first SATA disk with one of the same type,
 having still the same errors, I replaced this 1TB drive with 4x500GB
 Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I
 cvsupped and rebuild world. This afternoon everything is breaking down
 again with the same errors:
 
 Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout
 - completing request directly
 Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out
 LBA=298014274

  Did you try replacing cabling as a previous poster recommended?  I've
had similar problems with both traditional parallel ATA and SATA due to
marginal cables, which of course are not solved by swapping drives.

  Not saying there's not a software problem here, just that there is
still one area to eliminate.
  -- Clifton
 
-- 
Clifton Royston  --  [EMAIL PROTECTED] / [EMAIL PROTECTED]
   President  - I and I Computing * http://www.iandicomputing.com/
 Custom programming, network design, systems and network consulting services
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-11 Thread Torfinn Ingolfsen
On Mon, 11 Feb 2008 13:00:57 +0100
[EMAIL PROTECTED] (Remco van Bekkum) wrote:

 here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU,
 2GB 800MHz RAM.

FWIW, I have the almost the same motherboard (m2a-vm hdmi) with an AMD
Phenom 9500 and 4GB RAM[1].  Different disk, though. The (single) disk
drive has worked without problems so far. I'm using standard ufs2
filesystems on that disk. I'm running RELENG_7:
[EMAIL PROTECTED] uname -a
FreeBSD kg-vm.kg4.no 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #6: Sat Jan
26 20:58:51 CET 2008
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  amd64
[EMAIL PROTECTED] atacontrol list
ATA channel 0:
Master: acd0 Optiarc DVD RW AD-5170A/1.12 ATA/ATAPI revision 0
Slave:   no device present
ATA channel 2:
Master:  ad4 SAMSUNG HD501LJ/CR100-12 Serial ATA II
Slave:   no device present
ATA channel 3:
Master:  no device present
Slave:   no device present
ATA channel 4:
Master:  no device present
Slave:   no device present
ATA channel 5:
Master:  no device present
Slave:   no device present


References:
1) http://tingox.googlepages.com/asus_m2a-vm_hdmi_freebsd
-- 
Regards,
Torfinn Ingolfsen

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-11 Thread Remco van Bekkum
On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote:
 On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote:
  Joe, I wanted to send you a note about something that I'm still in the
  process of dealing with.  The timing couldn't be more ironic.
  
  I decided it would be worthwhile to migrate from my two-disk ZFS stripe
  with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
  disks combined (since they're all the same size).  I had another
  terminal with gstat -I500ms running in it, so I could see overall I/O.
  
  All was going well until about the 81GB mark of the copy.  gstat started
  showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
  nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
  (summarised):
  
  ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
  request directly
  ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing 
  request directly
  ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
  ad6: FAILURE - WRITE_DMA timed out LBA=13951071
  ad6: FAILURE - WRITE_DMA timed out LBA=13951327
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
  ad6: FAILURE - WRITE_DMA timed out LBA=13951583
  ad6: FAILURE - WRITE_DMA timed out LBA=13951839
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
  ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
  g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
  g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
  g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
  g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
  g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5
  
  It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
  blocks.  Actually, after letting things go for a while, I realised the
  box just locked up.  Probably kernel panic'd due to the I/O problem.
  I'll have to poke at SMART stats later to see what showed up.
  
  -- 
  | Jeremy Chadwickjdc at parodius.com |
  | Parodius Networking   http://www.parodius.com/ |
  | UNIX Systems Administrator  Mountain View, CA, USA |
  | Making life hard for others since 1977.  PGP: 4BD6C0CB |
  
  ___
  freebsd-stable@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to [EMAIL PROTECTED]
 
 Hi all,
 
 After having replaced my first SATA disk with one of the same type,
 having still the same errors, I replaced this 1TB drive with 4x500GB
 Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I
 cvsupped and rebuild world. This afternoon everything is breaking down
 again with the same errors:
 
 Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout
 - completing request directly
 Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out
 LBA=298014274
 
 Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER
 MODE taskqueue timeout - completing request directly
 Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE
 taskqueue timeout - completing request directly
 Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout
 - completing request directly
 Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out
 LBA=298013590
 
 So of 6 new disk I have 4 with the same errors. It would be quite safe then
 to not blame the disks imho. I've tested the second drive in another
 machine, but still got these timeout errors. What's wrong here?
 It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB
 800MHz RAM.
 
 Regards,
 
 Remco
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, 

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-02-11 Thread Remco van Bekkum
On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote:
 Joe, I wanted to send you a note about something that I'm still in the
 process of dealing with.  The timing couldn't be more ironic.
 
 I decided it would be worthwhile to migrate from my two-disk ZFS stripe
 with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
 disks combined (since they're all the same size).  I had another
 terminal with gstat -I500ms running in it, so I could see overall I/O.
 
 All was going well until about the 81GB mark of the copy.  gstat started
 showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
 nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
 (summarised):
 
 ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
 request directly
 ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing 
 request directly
 ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
 ad6: FAILURE - WRITE_DMA timed out LBA=13951071
 ad6: FAILURE - WRITE_DMA timed out LBA=13951327
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
 ad6: FAILURE - WRITE_DMA timed out LBA=13951583
 ad6: FAILURE - WRITE_DMA timed out LBA=13951839
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5
 
 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
 blocks.  Actually, after letting things go for a while, I realised the
 box just locked up.  Probably kernel panic'd due to the I/O problem.
 I'll have to poke at SMART stats later to see what showed up.
 
 -- 
 | Jeremy Chadwickjdc at parodius.com |
 | Parodius Networking   http://www.parodius.com/ |
 | UNIX Systems Administrator  Mountain View, CA, USA |
 | Making life hard for others since 1977.  PGP: 4BD6C0CB |
 
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]

Hi all,

After having replaced my first SATA disk with one of the same type,
having still the same errors, I replaced this 1TB drive with 4x500GB
Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I
cvsupped and rebuild world. This afternoon everything is breaking down
again with the same errors:

Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE
taskqueue timeout - completing request directly
Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE
taskqueue timeout - completing request directly
Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout
- completing request directly
Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out
LBA=298014274

Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE
taskqueue timeout - completing request directly
Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE
taskqueue timeout - completing request directly
Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout
- completing request directly
Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out
LBA=298013590

So of 6 new disk I have 4 with the same errors. It would be quite safe then
to not blame the disks imho. I've tested the second drive in another
machine, but still got these timeout errors. What's wrong here?
It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB
800MHz RAM.

Regards,

Remco
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-28 Thread J H


it should /definitely/ display a diagnostic which encourages the admin 
to use /etc/rc.d/hostid


Ahhh, rather, display a diagnostic which encourages the use of zpool 
import -a.



  --JH
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-28 Thread J H

Richard Todd wrote:

Workaround: always make sure you run /etc/rc.d/hostid start in single-user 
before doing any ZFS tinkering.
  


Good advice -- thank you.

But it still sounds like Jeremy's assessment, it's a bug, is 
accurate.  ZFS could certainly check for zero hostid.  If zero, it 
should /definitely/ display a diagnostic which encourages the admin to 
use /etc/rc.d/hostid (or a printout of it).  If zero, it /might/ 
additionally do some reads in case a likely-looking /etc/rc.d/hostid is 
available, and display the hostid, perhaps even speculatively start 
using it.  It would save some needless no datasets available hair pulling.


   Cheers,
   JH
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-27 Thread Danny Braniss

 Henri Hennebert wrote:
  Jeremy Chadwick wrote:
  On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote:
  Glad you got it back!  Yes, when I was first playing with ZFS, I noti=
 ced
  that booting between single and multi user mode could make the pools
  invisible.  Import seemed to bring them back...
 
  I did go into single-user mode and attempt to do ZFS-related commands,=
 
  which might explain the no datasets available once I was back in
  multiuser!  I would classify that as a bug, and one which is going to
  cause all sorts of hair-pulling for administrators in the future.  I
  wonder what it's caused by.
 =20
  In single user / is read only and so /boot/zfs/zpool.cache can't be=20
  created/updated
 
 But it's still readable. The issue is that hostid isn't set (by=20
 /etc/rc.d/hostid).

if the root is read only, as the case of diskless/dataless boot, it's
the fact that /boot/zfs/zpool.cache cannot be used which causes
the problem, so adding zpool import -a solves the issue.

danny


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Henri Hennebert

Jeremy Chadwick wrote:

On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote:

Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
that booting between single and multi user mode could make the pools
invisible.  Import seemed to bring them back...


I did go into single-user mode and attempt to do ZFS-related commands,
which might explain the no datasets available once I was back in
multiuser!  I would classify that as a bug, and one which is going to
cause all sorts of hair-pulling for administrators in the future.  I
wonder what it's caused by.


In single user / is read only and so /boot/zfs/zpool.cache can't be 
created/updated


Henri



The import technique I found on a forum somewhere, or possibly on a
Solaris mailing list.  I was really sweating there for a moment...


So, is the disk toast, or can you still read anything from it (part
table, etc.)?


The ad6 disk (/backups) fsck'd cleanly without any missing files or
anomalies.

The ZFS pool that has two striped disks (ad8 and ad10) is fully intact
too, with no loss of data that I can see.  I'll have to run a scrub
after I'm done copying data over to ad6, just to make sure though.



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
I performed a ZFS scrub, which finished yesterday, and no new
/var/log/messages errors were reported during that time.  However, the scrub
found something interesting:


crater# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 2
  ad0s1dONLINE   1 3 2

errors: Permanent errors have been detected in the following files:


/home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_
Bachelor_Pad/07-Snowfall.mp3



Note that I have not touched this file since copying it to this drive.

So, it seems one file failed a checksum check during the scrub.  I now
(expectedly) get errors trying to read this file - probably ZFS indicating the
condition.  When I just logged in tonight, I got two more /var/log/messages
disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just
as I was typing my password).

Also, smartctl still shows PASSED, however, this is interesting:

195 Hardware_ECC_Recovered  0x001a   061   046   000Old_age   Always
  -   9070

The number is much *smaller* now!  It was 6 a few minutes before this...
wrap around?  Hmm, I'm really not sure, at this point, what is going on.

So I have started a SeaTools (disk scanner from Seagate) long test of the
drive.  The short test passed already.  The results should be interesting.  If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems.  I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.

If I can turn on any debugging info to help determine if this is
software-related, let me know the magic keywords to use.  :)

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
Joe Peterson wrote:
 So I have started a SeaTools (disk scanner from Seagate) long test of the
 drive.  The short test passed already.  The results should be interesting.  If
 it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
 bugs that just happen to look like drive problems.  I already did a long read,
 under linux, of disk contents, and got no messages about anything wrong.

Update: both SHORT and LONG tests passed for this drive in SeaTools.
Hmph...  the mystery remains.
-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Ivan Voras

Henri Hennebert wrote:

Jeremy Chadwick wrote:

On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote:

Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
that booting between single and multi user mode could make the pools
invisible.  Import seemed to bring them back...


I did go into single-user mode and attempt to do ZFS-related commands,
which might explain the no datasets available once I was back in
multiuser!  I would classify that as a bug, and one which is going to
cause all sorts of hair-pulling for administrators in the future.  I
wonder what it's caused by.


In single user / is read only and so /boot/zfs/zpool.cache can't be 
created/updated


But it's still readable. The issue is that hostid isn't set (by 
/etc/rc.d/hostid).




signature.asc
Description: OpenPGP digital signature


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Ivan Voras

Joe Peterson wrote:

Joe Peterson wrote:

So I have started a SeaTools (disk scanner from Seagate) long test of the
drive.  The short test passed already.  The results should be interesting.  If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems.  I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.


Update: both SHORT and LONG tests passed for this drive in SeaTools.
Hmph...  the mystery remains.


Were both tests done in the same machine (actually, I mean the same PSU)?



signature.asc
Description: OpenPGP digital signature


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Richard Todd
Joe Peterson [EMAIL PROTECTED] writes:

 Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
 that booting between single and multi user mode could make the pools
 invisible.  Import seemed to bring them back...

Yeah.  ZFS pools record the hostid of the system that accessed them last.
When you boot in single-user mode, /etc/rc.d/hostid doesn't get run,
so the hostid is zero, which doesn't match the hostid in the pool, so the
pool doesn't show up without an import.  Workaround: always make sure you
run /etc/rc.d/hostid start in single-user before doing any ZFS tinkering.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
Ivan Voras wrote:
 Were both tests done in the same machine (actually, I mean the same PSU)?

Yes - I deliberately changed nothing (not even cables) before I ran the tests.
 I didn't want any variables.

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-26 Thread Jeremy Chadwick
On Sat, Jan 26, 2008 at 01:15:31PM -0700, Joe Peterson wrote:
 Joe Peterson wrote:
  So I have started a SeaTools (disk scanner from Seagate) long test of 
  the
  drive.  The short test passed already.  The results should be interesting.  
  If
  it finds nothing wrong, I am going to start to wonder if I am experiencing 
  ZFS
  bugs that just happen to look like drive problems.  I already did a long 
  read,
  under linux, of disk contents, and got no messages about anything wrong.
 
 Update: both SHORT and LONG tests passed for this drive in SeaTools.
 Hmph...  the mystery remains.

As do mine -- I also completed both short and long tests in SeaTools on
my drive (finished early this evening).  Absolutely no errors,
everything passed.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Julian H. Stacey
Jeremy Chadwick wrote:
  wondering if this is a known issue.  Note that smartctl does not report
  errors logged and gives a PASSED to the drive.  I am running at
  UDMA100 ATA.  Also, if it matters, I am using ZFS.

 Can you please provide output of the following:
 
 * smartctl -a /dev/ad0

From ports/sysutils/smartmontools I presume ?
( Asking as I also have a DMA prob. to solve, at present
needing hw.ata.ata_dma=0 in /boot/loader.conf to boot,
( interuptions on sound on 7-stable, though no ZFS here)).
smartctl:
Not installed by /usr/src-7
No /usr/ports/*/smartctl
Clues found with locate for ports:
  sysutils/munin-node/files/patch-hddtemp_smartctl.in
  sysutils/sensors-applet/files/smartctl-helper.c
  sysutils/sensors-applet/files/smartctl-sensors-interface.c
  sysutils/sensors-applet/files/smartctl-sensors-interface.h

  sysutils/munin-main   # Not really ?
  ports/sysutils/sensors-applet - ports/sysutils/smartmontools

-- 
Julian Stacey. Munich Computer Consultant, BSD Unix C Linux. http://berklix.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Peter Jeremy
On Fri, Jan 25, 2008 at 12:46:08PM -0800, Chuck Swiger wrote:
On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail  Always
-   82422948
[ ... ]
 
  7 Seek_Error_Rate 0x000f   084   060   030Pre-fail  Always
-   286126605
[ ... ]
 195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age   Always   
 -   166181300

These numbers are quite worrysome-- they should be zero or nearly so in a 
healthy drive.

I see similarly wierd values from a basically new drive.  I'm not sure
that there's a requirement that the raw values start from 0 and increment
on each detected event.

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgpgupRZgQQEC.pgp
Description: PGP signature


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote:
 Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
 that booting between single and multi user mode could make the pools
 invisible.  Import seemed to bring them back...

I did go into single-user mode and attempt to do ZFS-related commands,
which might explain the no datasets available once I was back in
multiuser!  I would classify that as a bug, and one which is going to
cause all sorts of hair-pulling for administrators in the future.  I
wonder what it's caused by.

The import technique I found on a forum somewhere, or possibly on a
Solaris mailing list.  I was really sweating there for a moment...

 So, is the disk toast, or can you still read anything from it (part
 table, etc.)?

The ad6 disk (/backups) fsck'd cleanly without any missing files or
anomalies.

The ZFS pool that has two striped disks (ad8 and ad10) is fully intact
too, with no loss of data that I can see.  I'll have to run a scrub
after I'm done copying data over to ad6, just to make sure though.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Ian Smith
On Fri, 25 Jan 2008, Jeremy Chadwick wrote:
  On Fri, Jan 25, 2008 at 06:03:33PM -0700, Joe Peterson wrote:
   Wow, pretty crazy!  Hmm, and yes, those LBAs do look close together.
   Well, let me know how the smartctl output looks.  I'd be curious if your
   bad sector count rises.
  
  Absolutely nada on the SMART statistics.  Nothing incremented or changed
  in any way.  My short and long tests did not change any of the data in
  the fields either.  Full output is below my .sig.
[..]
  It is interesting to note that we both have Seagate disks...  :-) I'll
  have to run SeaTools on my disk to see if anything comes back, or run a
  selective LBA test in smartctl (since the drive supports it).
[..]
  smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce 
  Allen

  === START OF INFORMATION SECTION ===
  Model Family: Seagate Barracuda 7200.10 family
  Device Model: ST3500630AS
  Serial Number:9QG1YWNL
  Firmware Version: 3.AAE

Same firmware as Joe's, too, though his ad1 was a bit later (3.AAG or H?)

  ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f   114   094   006Pre-fail  Always 
-   131599973
3 Spin_Up_Time0x0003   094   094   000Pre-fail  Always 
-   0
4 Start_Stop_Count0x0032   100   100   020Old_age   Always 
-   6
5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always 
-   0
7 Seek_Error_Rate 0x000f   082   060   030Pre-fail  Always 
-   200325271
9 Power_On_Hours  0x0032   097   097   000Old_age   Always 
-   2970
   10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always 
-   0
   12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always 
-   9
  187 Unknown_Attribute   0x0032   100   100   000Old_age   Always 
-   0
  189 Unknown_Attribute   0x003a   100   100   000Old_age   Always 
-   0
  190 Temperature_Celsius 0x0022   063   050   045Old_age   Always 
-   773849125
  194 Temperature_Celsius 0x0022   037   050   000Old_age   Always 
-   37 (Lifetime Min/Max 0/29)

I noticed Joe's Temp readings look similarly borked too - attribute 190
is likely something else, despite same flag value as 194, which then
shows clearly wrong values for min/max, though raw temp is reasonable: 

   190 Temperature_Celsius 0x0022   065   056   045Old_age   Always 
  -   605749283
   194 Temperature_Celsius 0x0022   035   044   000Old_age   Always 
  -   35 (Lifetime Min/Max 0/15)

.. which only goes to show, as I've seen with other attributes on other
drive brands, that smartctl's database isn't necessarily reliable over
all versions / revisions of a given drive.  Add salt to taste ..

Cheers, Ian

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Andrew MacIntyre

Jeremy Chadwick wrote:

* Getting a larger power supply (usually when lots of disk are involved)

I only have two drives, so I think the PS has enough capacity in my case.


Agreed; even a 350W PSU should handle 2 disks without a problem.


I've seen power supplies with a sagging 12V rail cause these sorts of
problems.

--
-
Andrew I MacIntyre These thoughts are mine alone...
E-mail: [EMAIL PROTECTED]  (pref) | Snail: PO Box 370
   [EMAIL PROTECTED] (alt) |Belconnen ACT 2616
Web:http://www.andymac.org/   |Australia
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote:
 icarus# zfs list
 no datasets available
 
 This doesn't bode well, and doesn't make me happy.  At all.

Pshew!  I was able to get ZFS to start seeing the pool again by doing
the following:  (Supposedly zpool import by itself will show you a
list of pools which it manages to see...)

icarus# zpool import -f storage
icarus# df -k /storage
Filesystem  1024-blocks  Used Avail Capacity  Mounted on
storage   957873024 106124032 85174899211%/storage
icarus# zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
storage   101G   812G   101G  /storage
icarus# zpool status
  pool: storage
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
storage ONLINE   0 0 0
  ad8   ONLINE   0 0 0
  ad10  ONLINE   0 0 0

errors: No known data errors

Back to the drawing board.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
Joe, I wanted to send you a note about something that I'm still in the
process of dealing with.  The timing couldn't be more ironic.

I decided it would be worthwhile to migrate from my two-disk ZFS stripe
with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
disks combined (since they're all the same size).  I had another
terminal with gstat -I500ms running in it, so I could see overall I/O.

All was going well until about the 81GB mark of the copy.  gstat started
showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
(summarised):

ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
ad6: FAILURE - WRITE_DMA timed out LBA=13951071
ad6: FAILURE - WRITE_DMA timed out LBA=13951327
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
ad6: FAILURE - WRITE_DMA timed out LBA=13951583
ad6: FAILURE - WRITE_DMA timed out LBA=13951839
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5

It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
blocks.  Actually, after letting things go for a while, I realised the
box just locked up.  Probably kernel panic'd due to the I/O problem.
I'll have to poke at SMART stats later to see what showed up.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote:
 I'll have to poke at SMART stats later to see what showed up.

So the box did indeed panic.  The backtrace contained about 1.5 screens
of function calls from the stack, which makes taking a photo of the
screen a bit worthless.  All the functions shown were predominantly I/O
related, and a disk locked up (or something), this didn't surprise me.

SMART stats showed absolutely nothing wrong with ad6, or any of the
other drives on the system.

Worse: my ZFS pool appears *completely* gone -- that's about 170GB of
data.  I don't even know how that happened, because there were
absolutely no issues reported on either of the disks on the ZFS pool.
It's like the situation somehow caused ZFS to go crazy and lose all of
it's metadata.

icarus# zfs list
no datasets available

This doesn't bode well, and doesn't make me happy.  At all.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
that booting between single and multi user mode could make the pools
invisible.  Import seemed to bring them back...

So, is the disk toast, or can you still read anything from it (part
table, etc.)?

-Joe


Jeremy Chadwick wrote:
 On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote:
 icarus# zfs list
 no datasets available

 This doesn't bode well, and doesn't make me happy.  At all.
 
 Pshew!  I was able to get ZFS to start seeing the pool again by doing
 the following:  (Supposedly zpool import by itself will show you a
 list of pools which it manages to see...)
 
 icarus# zpool import -f storage
 icarus# df -k /storage
 Filesystem  1024-blocks  Used Avail Capacity  Mounted on
 storage   957873024 106124032 85174899211%/storage
 icarus# zfs list
 NAME  USED  AVAIL  REFER  MOUNTPOINT
 storage   101G   812G   101G  /storage
 icarus# zpool status
   pool: storage
  state: ONLINE
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 storage ONLINE   0 0 0
   ad8   ONLINE   0 0 0
   ad10  ONLINE   0 0 0
 
 errors: No known data errors
 
 Back to the drawing board.
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
 Joe, I wanted to send you a note about something that I'm still in the
 process of dealing with.  The timing couldn't be more ironic.
 
 I decided it would be worthwhile to migrate from my two-disk ZFS stripe
 with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
 disks combined (since they're all the same size).  I had another
 terminal with gstat -I500ms running in it, so I could see overall I/O.
 
 All was going well until about the 81GB mark of the copy.  gstat started
 showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
 nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
 (summarised):
 
 ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
 request directly
 ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing 
 request directly
 ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
 ad6: FAILURE - WRITE_DMA timed out LBA=13951071
 ad6: FAILURE - WRITE_DMA timed out LBA=13951327
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
 ad6: FAILURE - WRITE_DMA timed out LBA=13951583
 ad6: FAILURE - WRITE_DMA timed out LBA=13951839
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5
 
 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
 blocks.  Actually, after letting things go for a while, I realised the
 box just locked up.  Probably kernel panic'd due to the I/O problem.
 I'll have to poke at SMART stats later to see what showed up.

Wow, pretty crazy!  Hmm, and yes, those LBAs do look close together.
Well, let me know how the smartctl output looks.  I'd be curious if your
bad sector count rises.  I had noticed that 1

BTW, I tried:

crater# dd if=/dev/ad1s4 of=/dev/null bs=64k
^C1408596+0 records in
1408596+0 records out
92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec)

(I let it go for 92GB or so) - no messages about ad1.  So I wonder if
this points at either the cable connector on ad0 or the drive itself.  I
guess I'd rather have a failing drive than motherboard...

I originally was wondering if somehow something peculiar about ZFS's
disk access pattern was making it happen...

THanks for the recomendations.  I'll keep an eye on it, and I'll let you
know what a cable change does for me.  Still, I have not had any ad0
messages since this morning (I haven't been using the system today much,
but maybe the cron processes are more likely to trigger it...

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Chuck Swiger

On Jan 25, 2008, at 1:05 PM, Thomas Hurst wrote:
These numbers are quite worrysome-- they should be zero or nearly  
so in a

healthy drive.


No, these are perfectly reasonable for a Seagate.  I have about 12
7200.X's and all show the same sort of behavior.  If they're nearly  
zero

it's probably a sign your manufacturer isn't actually counting them
(marketroids hate accurate SMART readings).

Try graphing them as counters; with an idle disk you'll see periodic
sawtooth patterns as the heads crawl from one side of the disk to the
other.


SMART attributes which end with _Ct or _Count are supposed to  
increment with every event; things which end with _Rate (ie,  
Raw_Read_Error_Rate, Seek_Error_Rate) are supposed to indicate the  
frequency of such errors over time.  It would be reasonable for  
Hardware_ECC_Recovered to keep the incremental count, but not the  
other two.


I agree that minor periodic errors happen over time and are not a  
great concern, but a happy drive will show zero reallocated sectors,  
or perhaps a few over the span of a year or two, and will have a ECC  
recovered or UDMA_CRC count which is much smaller than was reported by  
Joe.


YMMV, of course...

--
-Chuck

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Chuck Swiger

On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE   
UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail   
Always   -   82422948

[ ... ]


 7 Seek_Error_Rate 0x000f   084   060   030Pre-fail   
Always   -   286126605

[ ... ]
195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age
Always   -   166181300


These numbers are quite worrysome-- they should be zero or nearly so  
in a healthy drive.


--
-Chuck

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 06:42:04PM +0100, Julian H. Stacey wrote:
 Jeremy Chadwick wrote:
   wondering if this is a known issue.  Note that smartctl does not report
   errors logged and gives a PASSED to the drive.  I am running at
   UDMA100 ATA.  Also, if it matters, I am using ZFS.
 
  Can you please provide output of the following:
  
  * smartctl -a /dev/ad0
 
 From ports/sysutils/smartmontools I presume ?
   ( Asking as I also have a DMA prob. to solve, at present
   needing hw.ata.ata_dma=0 in /boot/loader.conf to boot,
   ( interuptions on sound on 7-stable, though no ZFS here)).

Yep!  smartctl comes with ports/sysutils/smartmontools.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote:
 I've seen mention of this kind of issue before, but I never saw a
 solution, except that someone reported that a certain version of 6.x
 seemed to make it go away - accounts of this problem are a bit vague.  I
 am running 7.0-RC1, and I am seeing the errors periodically, and I am
 wondering if this is a known issue.  Note that smartctl does not report
 errors logged and gives a PASSED to the drive.  I am running at
 UDMA100 ATA.  Also, if it matters, I am using ZFS.

What you've shown is usually the sign of a disk-related problem.  It's
very obvious when it's just one disk reporting DMA errors.  You use ZFS,
so chances are you have more than one disk in a pool/volume -- there's
no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
something specific to ad0.

Manufacturers pick very passive (non-aggressive) thresholds for error
conditions on disks, so disks which are failing very commonly show
PASSED during SMART analysis.  To make matters worse, most users I
know read SMART stats incorrectly (they're easy to misinterpret).

Can you please provide output of the following:

* smartctl -a /dev/ad0
* atacontrol cap ad0
* atacontrol info ata0, ata1, etc. -- any controller used by ZFS
* Relevant dmesg output that indicates what kind of ATA controller
  these disks are attached to.  Start with output from 'ad0:' and
  work backwards.  For example, ad0 on this machine is using an Intel
  ICH6 controller:
  atapci0: Intel ICH6 SATA150 controller port 
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
  ata0: ATA channel 0 on atapci0
  ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150

Other stuff:

SMART stats which are labelled Offline are only updated when a short
or long offline test is performed.  Have you tried using smartctl -t
short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw
values on the far right column increment?

Have you tried using zpool scrub on the ZFS pool, then zpool status
to see if READ/WRITE/CHKSUM counters increment or if the scrub line
states there were errors?

Other things which have fixed problems in the past for others:

* BIOS updates
* Change of motherboards (sometimes replacing board with same model,
  other times going with a completely different vendor (implies weird
  implementation issues or BIOS problems))
* Changing SATA cables
* Getting a larger power supply (usually when lots of disk are involved)

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 06:03:33PM -0700, Joe Peterson wrote:
 Wow, pretty crazy!  Hmm, and yes, those LBAs do look close together.
 Well, let me know how the smartctl output looks.  I'd be curious if your
 bad sector count rises.

Absolutely nada on the SMART statistics.  Nothing incremented or changed
in any way.  My short and long tests did not change any of the data in
the fields either.  Full output is below my .sig.

 BTW, I tried:
 
 crater# dd if=/dev/ad1s4 of=/dev/null bs=64k
 ^C1408596+0 records in
 1408596+0 records out
 92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec)
 
 (I let it go for 92GB or so) - no messages about ad1.  So I wonder if
 this points at either the cable connector on ad0 or the drive itself.  I
 guess I'd rather have a failing drive than motherboard...
 
 I originally was wondering if somehow something peculiar about ZFS's
 disk access pattern was making it happen...

Since I'm used to dealing with disk issues (at work and personally), I'm
left wondering if this is some strange ATA subsystem quirk, or
ultimately something with ZFS (your something peculiar about ZFS's disk
access pattern claim is starting to look more plausible).

This may sound suicidal, but I'm hoping to recreate the scenario
somehow, and then punt the details to Soren or Xin Li for further
investigation -- if it looks like an ATA subsystem thing, that is.

It is interesting to note that we both have Seagate disks...  :-) I'll
have to run SeaTools on my disk to see if anything comes back, or run a
selective LBA test in smartctl (since the drive supports it).

I've restarted my rsync since, and it's happily chomping away without an
issue.  If my problem was TRULY a bad block or something causing
mechanical lock-up on the disk, I'd have expected my latest rsync to
induce it.

There's always the chance of some bizarre drive firmware bug too.

 THanks for the recomendations.  I'll keep an eye on it, and I'll let you
 know what a cable change does for me.  Still, I have not had any ad0
 messages since this morning (I haven't been using the system today much,
 but maybe the cron processes are more likely to trigger it...

Understood.  In my case, I *know* the cables are fine, because the box
itself I just built and migrated to a few days ago (change of
motherboard, chassis, and addition of SATA hot-swap backplane).  We use
the same motherboard (Supermicro PDSMI+) in all of our production
servers in our datacenter, and they're rock-solid.

I've done hot-swapping without any issue on those systems too, and I've
never seen any SATA system issues -- one of the systems is our
datacenter backup server, which holds nightly backups for all the other
boxes (about 6).  Due to the heavy disk I/O that occurs for hours at a
time, if this was some weird system quirk, motherboard problem, or SATA
bus/cable issue, we would've seen it by now.  FWIW: all our systems,
including the backup box, use UFS2 exclusively -- no ZFS in the picture.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |


icarus# smartctl -a /dev/ad6
smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10 family
Device Model: ST3500630AS
Serial Number:9QG1YWNL
Firmware Version: 3.AAE
User Capacity:500,107,862,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:Fri Jan 25 17:10:31 2008 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 12:24:20PM -0700, Joe Peterson wrote:
 In my case, I am using only one disk (ad0) for FreeBSD, and I am only
 using one partition on this disk in my ZFS pool.  So, in this case,
 unfortunately, it's not possible to tell from the fact that only ad0 is
 listed that it is specific to this drive.

Ah ha.  Well, in your below example, you may only be using one drive for
FreeBSD (ad0), but you do have a 2nd drive (ad1) which is installed.
I would try doing some I/O on /dev/ad1 to see if you can get the
timeouts to occur on that drive as well.  You don't have to do anything
risky with ad1 either: dd if=/dev/ad1 of=/dev/null bs=64k would probably
suffice.

 Yep, I am also always skeptical of smart reports.  That's one reason I
 am very interested in ZFS.  I don't trust the drive to be completely
 reliable, and the fact that ZFS does end-to-end data integrity is very
 intriguing.

I agree entirely -- and I also use ZFS myself (across two drives in a
RAID0-like fashion, with a completely separate drive which is used for
nightly backups of the ZFS pool).  I'm absolutely thrilled with it;
finally something clean, reliable, and simple -- something I've always
wanted in a LVM or LVM-like implementation.

  * smartctl -a /dev/ad0
 
 OK, I've attached this to the end of this email.

 atapci0: Intel ICH4 UDMA100 controller port
 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0
 ata0: ATA channel 0 on atapci0
 ata0: [ITHREAD]
 ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100

The smartctl output for /dev/ad0 looks good, minus the one uncorrected
sector.  I'm ignoring that since it's proof that the drive knew of it
and remapped it.  If that number starts incrementing over time, though,
replace the drive ASAP, of course.

The atacontrol cap output looks fine too; nothing wonky, and the LBA
capabilities look fine.

The controller is nothing out-of-the-ordinary; it's reliable under
FreeBSD (I've had many a motherboard which used it).  Of course I
haven't used an ICH4 since FreeBSD 3.x, and the ATA layer has changed
substantially, numerous times.

 {regarding -t short and -t long}
 Also, none of the numbers that were zero incremented, esp:
 
 198 Offline_Uncorrectable   0x0010   100   100   000Old_age
 Offline  -   0
 
 Also, no more errors were reported in the system log during the self-tests.

Seem to indicate that the drive considers itself healthy.

Another test I could recommend at this point would be one that would
require a few hours of downtime: download Seagate's SeaTools (will
require a CD burner or floppies) and consider doing both quick and
long scans.  Quick checks some of the stuff we've looked at here,
but it also looks at some vendor-specific stuff within the drive.
Long will scan every block on the disk for errors (and will not
destroy data).

 OK, I started a scrub, and it will take some more time to complete...
 But I get the following with status.  Could this be due to the timeouts
 and failures?  I suspect so, so maybe this is not surprizing.

It depends on whether or not you saw more timeouts and cache errors spit
out by the kernel while zpool scrub ran.  If so, then yes, I would
definitely say they're related.

 I'd also guess that this doesn't necessarily point to the drive, but
 anything in the chain of events...  I do not have a mirror or RADI-Z,
 so I guess the reason there was no data loss (yet) is because the
 checksum passed, and maybe it just had to retry...?

I'm still new to ZFS myself, so I don't have an answer for you.  Your
conclusion is the same thing I'd conclude, though.

 I've been using this same motherboard/BIOS for a long time (as well as
 this drive), so no changes have happened to the HW recently.  The BIOS
 is the newest, available, I believe (It's a Tyan Trinity S2099, so it's
 a few years old)

I'd say the BIOS is probably not responsible at this point; I'd expect
other weird things to be going on with the system if the BIOS was broken
in some way (or possibly bit rot in the flash).

It's going to be difficult to determine if maybe something on the
mainboard has decided to start failing (some transistor within the ICH4,
etc...) though.  :-(

 I'm using regular ATA 80-pin cables.  Also, these seem to have been
 working fine for quite a while now.  But, yes, I have also witnessed bad
 cable issues on older systems in the past.  I certainly could try a new
 cable and see if it helps.

I'd try that for sure.  It's just one more thing to rule out.

  * Getting a larger power supply (usually when lots of disk are involved)
 
 I only have two drives, so I think the PS has enough capacity in my case.

Agreed; even a 350W PSU should handle 2 disks without a problem.

Here's something to ponder:

The LBAs being reported as having errors are scattered all over.  They
aren't lumped together (usually the sign of part of a platter going
bad); instead, they're all over the drive.

This would indicate either cable problems, 

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Thomas Hurst
* Chuck Swiger ([EMAIL PROTECTED]) wrote:

 On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail  Always 
   -   82422948
 [ ... ]
 
  7 Seek_Error_Rate 0x000f   084   060   030Pre-fail  Always 
   -   286126605
 [ ... ]
 195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age   Always
-   166181300
 
 These numbers are quite worrysome-- they should be zero or nearly so in a 
 healthy drive.

No, these are perfectly reasonable for a Seagate.  I have about 12
7200.X's and all show the same sort of behavior.  If they're nearly zero
it's probably a sign your manufacturer isn't actually counting them
(marketroids hate accurate SMART readings).

Try graphing them as counters; with an idle disk you'll see periodic
sawtooth patterns as the heads crawl from one side of the disk to the
other.

-- 
Thomas 'Freaky' Hurst
http://hur.st/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Jeremy Chadwick
On Fri, Jan 25, 2008 at 12:46:08PM -0800, Chuck Swiger wrote:
 On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail  Always
-   82422948
 [ ... ]

  7 Seek_Error_Rate 0x000f   084   060   030Pre-fail  Always
-   286126605
 [ ... ]
 195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age   Always   
 -   166181300

 These numbers are quite worrysome-- they should be zero or nearly so in a 
 healthy drive.

On some drives, yes, but not all drives.  His is a Seagate drive --
Seagate uses some of the bits in the raw data section for some sort of
internal use by the drive firmware.  So as they may appear very high in
value, the drive appears to function normally, and the actual adjusted
SMART value (the field under VALUE) doesn't fluxuate.

I have Seagate drives all over the place which exhibit identical stats
to the above.  I've included some for comparison below; each listed is
on a different system.  Look at attribute 190 (Temperature Celcius) for
an example; I don't think any drive can reach 773849124C, for example.
Or, well, I sure hope not.  :-)  I believe in the case of attrib. 190,
that's why they present a human-readable value in attribute 194.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

==SNIP==

ad6: 476940MB Seagate ST3500630AS 3.AAE at ata3-master SATA300

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   112   094   006Pre-fail  Always   
-   221374987
  3 Spin_Up_Time0x0003   094   094   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   6
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   082   060   030Pre-fail  Always   
-   29014
  9 Power_On_Hours  0x0032   097   097   000Old_age   Always   
-   2967
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always   
-   9
187 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   0
189 Unknown_Attribute   0x003a   100   100   000Old_age   Always   
-   0
190 Temperature_Celsius 0x0022   064   050   045Old_age   Always   
-   773849124
194 Temperature_Celsius 0x0022   036   050   000Old_age   Always   
-   36 (Lifetime Min/Max 0/29)
195 Hardware_ECC_Recovered  0x001a   066   059   000Old_age   Always   
-   36458075
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   18
198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline  
-   18
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline  
-   0
202 TA_Increase_Count   0x0032   100   253   000Old_age   Always   
-   0


ad4: 114473MB Seagate ST3120827AS 3.42 at ata2-master SATA150

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   063   052   006Pre-fail  Always   
-   57703728
  3 Spin_Up_Time0x0003   096   096   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   24
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   082   060   030Pre-fail  Always   
-   169005025
  9 Power_On_Hours  0x0032   096   096   000Old_age   Always   
-   3536
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always   
-   24
194 Temperature_Celsius 0x0022   027   040   000Old_age   Always   
-   27 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered  0x001a   063   052   000Old_age   Always   
-   57703728
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0010   100   100   

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
 What you've shown is usually the sign of a disk-related problem.  It's
 very obvious when it's just one disk reporting DMA errors.  You use ZFS,
 so chances are you have more than one disk in a pool/volume -- there's
 no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
 something specific to ad0.

Jeremy, thanks for the response - I have tried to answer all of your
questions below...

In my case, I am using only one disk (ad0) for FreeBSD, and I am only
using one partition on this disk in my ZFS pool.  So, in this case,
unfortunately, it's not possible to tell from the fact that only ad0 is
listed that it is specific to this drive.

 Manufacturers pick very passive (non-aggressive) thresholds for error
 conditions on disks, so disks which are failing very commonly show
 PASSED during SMART analysis.  To make matters worse, most users I
 know read SMART stats incorrectly (they're easy to misinterpret).

Yep, I am also always skeptical of smart reports.  That's one reason I
am very interested in ZFS.  I don't trust the drive to be completely
reliable, and the fact that ZFS does end-to-end data integrity is very
intriguing.

 Can you please provide output of the following:
 
 * smartctl -a /dev/ad0

OK, I've attached this to the end of this email.

 * atacontrol cap ad0

Protocol  ATA/ATAPI revision 7
device model  ST3500630A
serial number 9QG0DG03
firmware revision 3.AAE
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported   976773168 sectors
dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Tagged Command Queuing (TCQ)   no   no  0/0x00
SMART  yes  yes
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  no   no  65278/0xFEFE
automatic acoustic management  no   no  0/0x00  208/0xD0

 * atacontrol info ata0, ata1, etc. -- any controller used by ZFS

Master:  ad0 ST3500630A/3.AAE ATA/ATAPI revision 7
Slave:   ad1 ST3160812A/3.AAH ATA/ATAPI revision 7

(but note that ad1 is not used by FreeBSD)

 * Relevant dmesg output that indicates what kind of ATA controller
   these disks are attached to.  Start with output from 'ad0:' and
   work backwards.  For example, ad0 on this machine is using an Intel
   ICH6 controller:
   atapci0: Intel ICH6 SATA150 controller port 
 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
   ata0: ATA channel 0 on atapci0
   ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150

atapci0: Intel ICH4 UDMA100 controller port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0

ata0: ATA channel 0 on atapci0

ata0: [ITHREAD]
ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100

 SMART stats which are labelled Offline are only updated when a short
 or long offline test is performed.  Have you tried using smartctl -t
 short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw
 values on the far right column increment?

I just tried one:

# 1  Short offline   Completed without error   00%  5252
 -
# 2  Short offline   Completed without error   00%  5252
 -

Also, none of the numbers that were zero incremented, esp:

198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0

Also, no more errors were reported in the system log during the self-tests.

 Have you tried using zpool scrub on the ZFS pool, then zpool status
 to see if READ/WRITE/CHKSUM counters increment or if the scrub line
 states there were errors?

OK, I started a scrub, and it will take some more time to complete...
But I get the following with status.  Could this be due to the timeouts
and failures?  I suspect so, so maybe this is not surprizing.  I'd also
guess that this doesn't necessarily point to the drive, but anything in
the chain of events...  I do not have a mirror or RADI-Z, so I guess the
reason there was no data loss (yet) is because the checksum passed,
and maybe it just had to retry...?  Anyway, here's the output so far:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.50% done, 1h58m to go
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 0
  ad0s1dONLINE   1 3 0

errors: No known data errors

 Other 

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
 What you've shown is usually the sign of a disk-related problem.  It's
 very obvious when it's just one disk reporting DMA errors.  You use ZFS,
 so chances are you have more than one disk in a pool/volume -- there's
 no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
 something specific to ad0.

Jeremy, thanks for the response - I have tried to answer all of your
questions below...

In my case, I am using only one disk (ad0) for FreeBSD, and I am only
using one partition on this disk in my ZFS pool.  So, in this case,
unfortunately, it's not possible to tell from the fact that only ad0 is
listed that it is specific to this drive.

 Manufacturers pick very passive (non-aggressive) thresholds for error
 conditions on disks, so disks which are failing very commonly show
 PASSED during SMART analysis.  To make matters worse, most users I
 know read SMART stats incorrectly (they're easy to misinterpret).

Yep, I am also always skeptical of smart reports.  That's one reason I
am very interested in ZFS.  I don't trust the drive to be completely
reliable, and the fact that ZFS does end-to-end data integrity is very
intriguing.

 Can you please provide output of the following:
 
 * smartctl -a /dev/ad0

OK, I've attached this to the end of this email.

 * atacontrol cap ad0

Protocol  ATA/ATAPI revision 7
device model  ST3500630A
serial number 9QG0DG03
firmware revision 3.AAE
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported   976773168 sectors
dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Tagged Command Queuing (TCQ)   no   no  0/0x00
SMART  yes  yes
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  no   no  65278/0xFEFE
automatic acoustic management  no   no  0/0x00  208/0xD0

 * atacontrol info ata0, ata1, etc. -- any controller used by ZFS

Master:  ad0 ST3500630A/3.AAE ATA/ATAPI revision 7
Slave:   ad1 ST3160812A/3.AAH ATA/ATAPI revision 7

(but note that ad1 is not used by FreeBSD)

 * Relevant dmesg output that indicates what kind of ATA controller
   these disks are attached to.  Start with output from 'ad0:' and
   work backwards.  For example, ad0 on this machine is using an Intel
   ICH6 controller:
   atapci0: Intel ICH6 SATA150 controller port 
 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
   ata0: ATA channel 0 on atapci0
   ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150

atapci0: Intel ICH4 UDMA100 controller port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0

ata0: ATA channel 0 on atapci0

ata0: [ITHREAD]
ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100

 SMART stats which are labelled Offline are only updated when a short
 or long offline test is performed.  Have you tried using smartctl -t
 short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw
 values on the far right column increment?

I just tried one:

# 1  Short offline   Completed without error   00%  5252
 -
# 2  Short offline   Completed without error   00%  5252
 -

Also, none of the numbers that were zero incremented, esp:

198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0

Also, no more errors were reported in the system log during the self-tests.

 Have you tried using zpool scrub on the ZFS pool, then zpool status
 to see if READ/WRITE/CHKSUM counters increment or if the scrub line
 states there were errors?

OK, I started a scrub, and it will take some more time to complete...
But I get the following with status.  Could this be due to the timeouts
and failures?  I suspect so, so maybe this is not surprizing.  I'd also
guess that this doesn't necessarily point to the drive, but anything in
the chain of events...  I do not have a mirror or RADI-Z, so I guess the
reason there was no data loss (yet) is because the checksum passed,
and maybe it just had to retry...?  Anyway, here's the output so far:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.50% done, 1h58m to go
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 0
  ad0s1dONLINE   1 3 0

errors: No known data errors

 Other 

Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Chuck Swiger wrote:
 On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE   
 UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail   
 Always   -   82422948
 [ ... ]
  7 Seek_Error_Rate 0x000f   084   060   030Pre-fail   
 Always   -   286126605
 [ ... ]
 195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age
 Always   -   166181300
 
 These numbers are quite worrysome-- they should be zero or nearly so  
 in a healthy drive.

It seems to depend on the drive manufacturer.  E.g. this is a Seagate.  Every
Seagate I've ever had (or heard about on the web via smartctl dumps) reports
very large numbers for these values.  I've heard it described that Seagate
shows you the raw numbers (and correctable errors do happen all the time in
all drives).

In Western Digital drives (IIRC), the numbers shown are the ones that *should*
be zero, thereby hiding the low-level errors.

Hard to say if my numbers are too high, but these corrected error counts
are always frighteningly high in Seagates.

-Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]