Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Tue, Feb 12, 2008 at 01:03:35AM +0100, Torfinn Ingolfsen wrote: On Mon, 11 Feb 2008 13:00:57 +0100 [EMAIL PROTECTED] (Remco van Bekkum) wrote: here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. FWIW, I have the almost the same motherboard (m2a-vm hdmi) with an AMD Phenom 9500 and 4GB RAM[1]. Different disk, though. The (single) disk drive has worked without problems so far. I'm using standard ufs2 filesystems on that disk. I'm running RELENG_7: [EMAIL PROTECTED] uname -a FreeBSD kg-vm.kg4.no 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #6: Sat Jan 26 20:58:51 CET 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64 [EMAIL PROTECTED] atacontrol list ATA channel 0: Master: acd0 Optiarc DVD RW AD-5170A/1.12 ATA/ATAPI revision 0 Slave: no device present ATA channel 2: Master: ad4 SAMSUNG HD501LJ/CR100-12 Serial ATA II Slave: no device present ATA channel 3: Master: no device present Slave: no device present ATA channel 4: Master: no device present Slave: no device present ATA channel 5: Master: no device present Slave: no device present References: 1) http://tingox.googlepages.com/asus_m2a-vm_hdmi_freebsd -- Regards, Torfinn Ingolfsen Thanks, here some more detailed info from me: xaero# dmesg | grep atapci atapci0: ATI AHCI controller port 0xfc00-0xfc07,0xf800-0xf803,0xf400-0xf407,0xf000-0xf003,0xec00-0xec0f mem 0xfe02f000-0xfe02f3ff irq 22 at device 18.0 on pci0 atapci0: [ITHREAD] atapci0: AHCI Version 01.10 controller with 4 ports detected ata2: ATA channel 0 on atapci0 ata3: ATA channel 1 on atapci0 ata4: ATA channel 2 on atapci0 ata5: ATA channel 3 on atapci0 atapci1: ATI IXP600 UDMA133 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe400-0xe40f at device 20.1 on pci0 ata0: ATA channel 0 on atapci1 xaero# atacontrol list ATA channel 0: Master: no device present Slave: no device present ATA channel 2: Master: ad4 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 3: Master: ad6 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 4: Master: ad8 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present ATA channel 5: Master: ad10 Hitachi HDP725050GLA360/GM4OA50E Serial ATA II Slave: no device present xaero# uname -a FreeBSD xaero.spacemarines.us 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #1: Sun Feb 10 16:07:39 CET 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64 I'm using bios 1603, and a seasonic 330W PSU. The errors appear to happen at random, heavy I/O doesn't trigger it. I can rebuild world without problems, so I guess the CPU is ok. The memory has been tested and showed no errors. What's left is cables and mainboard. But how error-prone are sata cables? Considering that I've got 50% failing... Okay, maybe that should prove that the mainboard is faulty :) -Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, Feb 11, 2008 at 07:24:55AM -1000, Clifton Royston wrote: On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote: On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Did you try replacing cabling as a previous poster recommended? I've had similar problems with both traditional parallel ATA and SATA due to marginal cables, which of course are not solved by swapping drives. Not saying there's not a software problem here, just that there is still one area to eliminate. -- Clifton -- Clifton Royston -- [EMAIL PROTECTED] / [EMAIL PROTECTED] President - I and I Computing * http://www.iandicomputing.com/ Custom programming, network design, systems and network consulting services Hi Clifton, I don't recall exactly anymore, but at least 3 cables have been used without problems on other systems. I'm wondering, the mainboard acts weird sometimes as well: when I press the reset button, it sometimes powers down. Also, I just did a reset after it deadlocked on shutdown because of the errors, and when the system booted, 2 disks were not seen by the bios. I had to power down the box and when it came up again, the disks were back. Can software leave the disks in a state that the bios doesn't detect them after pressing the reset button? I'm 100% certain that on my previous installation, in a 100% different system, I got the same errors. That should normally mean either software or disk. The disk has been replaced, the OS is the same. I'm either having really bad luck or something else is wrong. What is a good way of stress testing disks? Thanks! - Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote: On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Did you try replacing cabling as a previous poster recommended? I've had similar problems with both traditional parallel ATA and SATA due to marginal cables, which of course are not solved by swapping drives. Not saying there's not a software problem here, just that there is still one area to eliminate. -- Clifton -- Clifton Royston -- [EMAIL PROTECTED] / [EMAIL PROTECTED] President - I and I Computing * http://www.iandicomputing.com/ Custom programming, network design, systems and network consulting services ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, 11 Feb 2008 13:00:57 +0100 [EMAIL PROTECTED] (Remco van Bekkum) wrote: here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. FWIW, I have the almost the same motherboard (m2a-vm hdmi) with an AMD Phenom 9500 and 4GB RAM[1]. Different disk, though. The (single) disk drive has worked without problems so far. I'm using standard ufs2 filesystems on that disk. I'm running RELENG_7: [EMAIL PROTECTED] uname -a FreeBSD kg-vm.kg4.no 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #6: Sat Jan 26 20:58:51 CET 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64 [EMAIL PROTECTED] atacontrol list ATA channel 0: Master: acd0 Optiarc DVD RW AD-5170A/1.12 ATA/ATAPI revision 0 Slave: no device present ATA channel 2: Master: ad4 SAMSUNG HD501LJ/CR100-12 Serial ATA II Slave: no device present ATA channel 3: Master: no device present Slave: no device present ATA channel 4: Master: no device present Slave: no device present ATA channel 5: Master: no device present Slave: no device present References: 1) http://tingox.googlepages.com/asus_m2a-vm_hdmi_freebsd -- Regards, Torfinn Ingolfsen ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Mon, Feb 11, 2008 at 01:00:57PM +0100, Remco van Bekkum wrote: On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] Hi all, After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out LBA=298013590 So of 6 new disk I have 4 with the same errors. It would be quite safe then to not blame the disks imho. I've tested the second drive in another machine, but still got these timeout errors. What's wrong here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. Regards, Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe,
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] Hi all, After having replaced my first SATA disk with one of the same type, having still the same errors, I replaced this 1TB drive with 4x500GB Hitachi P7K500 in raidz. It worked fine for a week, but yesterday I cvsupped and rebuild world. This afternoon everything is breaking down again with the same errors: Feb 11 12:34:09 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:13 xaero kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:17 xaero kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:21 xaero kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:25 xaero kernel: ad6: FAILURE - WRITE_DMA48 timed out LBA=298014274 Feb 11 12:34:29 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:33 xaero kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Feb 11 12:34:37 xaero kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Feb 11 12:34:41 xaero kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Feb 11 12:34:45 xaero kernel: ad8: FAILURE - WRITE_DMA48 timed out LBA=298013590 So of 6 new disk I have 4 with the same errors. It would be quite safe then to not blame the disks imho. I've tested the second drive in another machine, but still got these timeout errors. What's wrong here? It's on an amd64, Asus m2a-vm with ati xp600, AMD BE-2350 CPU, 2GB 800MHz RAM. Regards, Remco ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
it should /definitely/ display a diagnostic which encourages the admin to use /etc/rc.d/hostid Ahhh, rather, display a diagnostic which encourages the use of zpool import -a. --JH ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Richard Todd wrote: Workaround: always make sure you run /etc/rc.d/hostid start in single-user before doing any ZFS tinkering. Good advice -- thank you. But it still sounds like Jeremy's assessment, it's a bug, is accurate. ZFS could certainly check for zero hostid. If zero, it should /definitely/ display a diagnostic which encourages the admin to use /etc/rc.d/hostid (or a printout of it). If zero, it /might/ additionally do some reads in case a likely-looking /etc/rc.d/hostid is available, and display the hostid, perhaps even speculatively start using it. It would save some needless no datasets available hair pulling. Cheers, JH ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Henri Hennebert wrote: Jeremy Chadwick wrote: On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote: Glad you got it back! Yes, when I was first playing with ZFS, I noti= ced that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... I did go into single-user mode and attempt to do ZFS-related commands,= which might explain the no datasets available once I was back in multiuser! I would classify that as a bug, and one which is going to cause all sorts of hair-pulling for administrators in the future. I wonder what it's caused by. =20 In single user / is read only and so /boot/zfs/zpool.cache can't be=20 created/updated But it's still readable. The issue is that hostid isn't set (by=20 /etc/rc.d/hostid). if the root is read only, as the case of diskless/dataless boot, it's the fact that /boot/zfs/zpool.cache cannot be used which causes the problem, so adding zpool import -a solves the issue. danny ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote: Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... I did go into single-user mode and attempt to do ZFS-related commands, which might explain the no datasets available once I was back in multiuser! I would classify that as a bug, and one which is going to cause all sorts of hair-pulling for administrators in the future. I wonder what it's caused by. In single user / is read only and so /boot/zfs/zpool.cache can't be created/updated Henri The import technique I found on a forum somewhere, or possibly on a Solaris mailing list. I was really sweating there for a moment... So, is the disk toast, or can you still read anything from it (part table, etc.)? The ad6 disk (/backups) fsck'd cleanly without any missing files or anomalies. The ZFS pool that has two striped disks (ad8 and ad10) is fully intact too, with no loss of data that I can see. I'll have to run a scrub after I'm done copying data over to ad6, just to make sure though. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
I performed a ZFS scrub, which finished yesterday, and no new /var/log/messages errors were reported during that time. However, the scrub found something interesting: crater# zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008 config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 2 ad0s1dONLINE 1 3 2 errors: Permanent errors have been detected in the following files: /home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_ Bachelor_Pad/07-Snowfall.mp3 Note that I have not touched this file since copying it to this drive. So, it seems one file failed a checksum check during the scrub. I now (expectedly) get errors trying to read this file - probably ZFS indicating the condition. When I just logged in tonight, I got two more /var/log/messages disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just as I was typing my password). Also, smartctl still shows PASSED, however, this is interesting: 195 Hardware_ECC_Recovered 0x001a 061 046 000Old_age Always - 9070 The number is much *smaller* now! It was 6 a few minutes before this... wrap around? Hmm, I'm really not sure, at this point, what is going on. So I have started a SeaTools (disk scanner from Seagate) long test of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. If I can turn on any debugging info to help determine if this is software-related, let me know the magic keywords to use. :) -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Joe Peterson wrote: So I have started a SeaTools (disk scanner from Seagate) long test of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. Update: both SHORT and LONG tests passed for this drive in SeaTools. Hmph... the mystery remains. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Henri Hennebert wrote: Jeremy Chadwick wrote: On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote: Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... I did go into single-user mode and attempt to do ZFS-related commands, which might explain the no datasets available once I was back in multiuser! I would classify that as a bug, and one which is going to cause all sorts of hair-pulling for administrators in the future. I wonder what it's caused by. In single user / is read only and so /boot/zfs/zpool.cache can't be created/updated But it's still readable. The issue is that hostid isn't set (by /etc/rc.d/hostid). signature.asc Description: OpenPGP digital signature
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Joe Peterson wrote: Joe Peterson wrote: So I have started a SeaTools (disk scanner from Seagate) long test of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. Update: both SHORT and LONG tests passed for this drive in SeaTools. Hmph... the mystery remains. Were both tests done in the same machine (actually, I mean the same PSU)? signature.asc Description: OpenPGP digital signature
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Joe Peterson [EMAIL PROTECTED] writes: Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... Yeah. ZFS pools record the hostid of the system that accessed them last. When you boot in single-user mode, /etc/rc.d/hostid doesn't get run, so the hostid is zero, which doesn't match the hostid in the pool, so the pool doesn't show up without an import. Workaround: always make sure you run /etc/rc.d/hostid start in single-user before doing any ZFS tinkering. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Ivan Voras wrote: Were both tests done in the same machine (actually, I mean the same PSU)? Yes - I deliberately changed nothing (not even cables) before I ran the tests. I didn't want any variables. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Sat, Jan 26, 2008 at 01:15:31PM -0700, Joe Peterson wrote: Joe Peterson wrote: So I have started a SeaTools (disk scanner from Seagate) long test of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. Update: both SHORT and LONG tests passed for this drive in SeaTools. Hmph... the mystery remains. As do mine -- I also completed both short and long tests in SeaTools on my drive (finished early this evening). Absolutely no errors, everything passed. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: wondering if this is a known issue. Note that smartctl does not report errors logged and gives a PASSED to the drive. I am running at UDMA100 ATA. Also, if it matters, I am using ZFS. Can you please provide output of the following: * smartctl -a /dev/ad0 From ports/sysutils/smartmontools I presume ? ( Asking as I also have a DMA prob. to solve, at present needing hw.ata.ata_dma=0 in /boot/loader.conf to boot, ( interuptions on sound on 7-stable, though no ZFS here)). smartctl: Not installed by /usr/src-7 No /usr/ports/*/smartctl Clues found with locate for ports: sysutils/munin-node/files/patch-hddtemp_smartctl.in sysutils/sensors-applet/files/smartctl-helper.c sysutils/sensors-applet/files/smartctl-sensors-interface.c sysutils/sensors-applet/files/smartctl-sensors-interface.h sysutils/munin-main # Not really ? ports/sysutils/sensors-applet - ports/sysutils/smartmontools -- Julian Stacey. Munich Computer Consultant, BSD Unix C Linux. http://berklix.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 12:46:08PM -0800, Chuck Swiger wrote: On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail Always - 82422948 [ ... ] 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail Always - 286126605 [ ... ] 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age Always - 166181300 These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. I see similarly wierd values from a basically new drive. I'm not sure that there's a requirement that the raw values start from 0 and increment on each detected event. -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgpgupRZgQQEC.pgp Description: PGP signature
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 06:17:24PM -0700, Joe Peterson wrote: Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... I did go into single-user mode and attempt to do ZFS-related commands, which might explain the no datasets available once I was back in multiuser! I would classify that as a bug, and one which is going to cause all sorts of hair-pulling for administrators in the future. I wonder what it's caused by. The import technique I found on a forum somewhere, or possibly on a Solaris mailing list. I was really sweating there for a moment... So, is the disk toast, or can you still read anything from it (part table, etc.)? The ad6 disk (/backups) fsck'd cleanly without any missing files or anomalies. The ZFS pool that has two striped disks (ad8 and ad10) is fully intact too, with no loss of data that I can see. I'll have to run a scrub after I'm done copying data over to ad6, just to make sure though. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, 25 Jan 2008, Jeremy Chadwick wrote: On Fri, Jan 25, 2008 at 06:03:33PM -0700, Joe Peterson wrote: Wow, pretty crazy! Hmm, and yes, those LBAs do look close together. Well, let me know how the smartctl output looks. I'd be curious if your bad sector count rises. Absolutely nada on the SMART statistics. Nothing incremented or changed in any way. My short and long tests did not change any of the data in the fields either. Full output is below my .sig. [..] It is interesting to note that we both have Seagate disks... :-) I'll have to run SeaTools on my disk to see if anything comes back, or run a selective LBA test in smartctl (since the drive supports it). [..] smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce Allen === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 family Device Model: ST3500630AS Serial Number:9QG1YWNL Firmware Version: 3.AAE Same firmware as Joe's, too, though his ad1 was a bit later (3.AAG or H?) ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 094 006Pre-fail Always - 131599973 3 Spin_Up_Time0x0003 094 094 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 100 100 020Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 082 060 030Pre-fail Always - 200325271 9 Power_On_Hours 0x0032 097 097 000Old_age Always - 2970 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020Old_age Always - 9 187 Unknown_Attribute 0x0032 100 100 000Old_age Always - 0 189 Unknown_Attribute 0x003a 100 100 000Old_age Always - 0 190 Temperature_Celsius 0x0022 063 050 045Old_age Always - 773849125 194 Temperature_Celsius 0x0022 037 050 000Old_age Always - 37 (Lifetime Min/Max 0/29) I noticed Joe's Temp readings look similarly borked too - attribute 190 is likely something else, despite same flag value as 194, which then shows clearly wrong values for min/max, though raw temp is reasonable: 190 Temperature_Celsius 0x0022 065 056 045Old_age Always - 605749283 194 Temperature_Celsius 0x0022 035 044 000Old_age Always - 35 (Lifetime Min/Max 0/15) .. which only goes to show, as I've seen with other attributes on other drive brands, that smartctl's database isn't necessarily reliable over all versions / revisions of a given drive. Add salt to taste .. Cheers, Ian ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: * Getting a larger power supply (usually when lots of disk are involved) I only have two drives, so I think the PS has enough capacity in my case. Agreed; even a 350W PSU should handle 2 disks without a problem. I've seen power supplies with a sagging 12V rail cause these sorts of problems. -- - Andrew I MacIntyre These thoughts are mine alone... E-mail: [EMAIL PROTECTED] (pref) | Snail: PO Box 370 [EMAIL PROTECTED] (alt) |Belconnen ACT 2616 Web:http://www.andymac.org/ |Australia ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote: icarus# zfs list no datasets available This doesn't bode well, and doesn't make me happy. At all. Pshew! I was able to get ZFS to start seeing the pool again by doing the following: (Supposedly zpool import by itself will show you a list of pools which it manages to see...) icarus# zpool import -f storage icarus# df -k /storage Filesystem 1024-blocks Used Avail Capacity Mounted on storage 957873024 106124032 85174899211%/storage icarus# zfs list NAME USED AVAIL REFER MOUNTPOINT storage 101G 812G 101G /storage icarus# zpool status pool: storage state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM storage ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10 ONLINE 0 0 0 errors: No known data errors Back to the drawing board. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 04:38:46PM -0800, Jeremy Chadwick wrote: I'll have to poke at SMART stats later to see what showed up. So the box did indeed panic. The backtrace contained about 1.5 screens of function calls from the stack, which makes taking a photo of the screen a bit worthless. All the functions shown were predominantly I/O related, and a disk locked up (or something), this didn't surprise me. SMART stats showed absolutely nothing wrong with ad6, or any of the other drives on the system. Worse: my ZFS pool appears *completely* gone -- that's about 170GB of data. I don't even know how that happened, because there were absolutely no issues reported on either of the disks on the ZFS pool. It's like the situation somehow caused ZFS to go crazy and lose all of it's metadata. icarus# zfs list no datasets available This doesn't bode well, and doesn't make me happy. At all. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools invisible. Import seemed to bring them back... So, is the disk toast, or can you still read anything from it (part table, etc.)? -Joe Jeremy Chadwick wrote: On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote: icarus# zfs list no datasets available This doesn't bode well, and doesn't make me happy. At all. Pshew! I was able to get ZFS to start seeing the pool again by doing the following: (Supposedly zpool import by itself will show you a list of pools which it manages to see...) icarus# zpool import -f storage icarus# df -k /storage Filesystem 1024-blocks Used Avail Capacity Mounted on storage 957873024 106124032 85174899211%/storage icarus# zfs list NAME USED AVAIL REFER MOUNTPOINT storage 101G 812G 101G /storage icarus# zpool status pool: storage state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM storage ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10 ONLINE 0 0 0 errors: No known data errors Back to the drawing board. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: Joe, I wanted to send you a note about something that I'm still in the process of dealing with. The timing couldn't be more ironic. I decided it would be worthwhile to migrate from my two-disk ZFS stripe with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 disks combined (since they're all the same size). I had another terminal with gstat -I500ms running in it, so I could see overall I/O. All was going well until about the 81GB mark of the copy. gstat started showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg (summarised): ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 ad6: FAILURE - WRITE_DMA timed out LBA=13951071 ad6: FAILURE - WRITE_DMA timed out LBA=13951327 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 ad6: FAILURE - WRITE_DMA timed out LBA=13951583 ad6: FAILURE - WRITE_DMA timed out LBA=13951839 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 It appears my /dev/ad6 (a Seagate -- more irony) must have some bad blocks. Actually, after letting things go for a while, I realised the box just locked up. Probably kernel panic'd due to the I/O problem. I'll have to poke at SMART stats later to see what showed up. Wow, pretty crazy! Hmm, and yes, those LBAs do look close together. Well, let me know how the smartctl output looks. I'd be curious if your bad sector count rises. I had noticed that 1 BTW, I tried: crater# dd if=/dev/ad1s4 of=/dev/null bs=64k ^C1408596+0 records in 1408596+0 records out 92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec) (I let it go for 92GB or so) - no messages about ad1. So I wonder if this points at either the cable connector on ad0 or the drive itself. I guess I'd rather have a failing drive than motherboard... I originally was wondering if somehow something peculiar about ZFS's disk access pattern was making it happen... THanks for the recomendations. I'll keep an eye on it, and I'll let you know what a cable change does for me. Still, I have not had any ad0 messages since this morning (I haven't been using the system today much, but maybe the cron processes are more likely to trigger it... -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Jan 25, 2008, at 1:05 PM, Thomas Hurst wrote: These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. No, these are perfectly reasonable for a Seagate. I have about 12 7200.X's and all show the same sort of behavior. If they're nearly zero it's probably a sign your manufacturer isn't actually counting them (marketroids hate accurate SMART readings). Try graphing them as counters; with an idle disk you'll see periodic sawtooth patterns as the heads crawl from one side of the disk to the other. SMART attributes which end with _Ct or _Count are supposed to increment with every event; things which end with _Rate (ie, Raw_Read_Error_Rate, Seek_Error_Rate) are supposed to indicate the frequency of such errors over time. It would be reasonable for Hardware_ECC_Recovered to keep the incremental count, but not the other two. I agree that minor periodic errors happen over time and are not a great concern, but a happy drive will show zero reallocated sectors, or perhaps a few over the span of a year or two, and will have a ECC recovered or UDMA_CRC count which is much smaller than was reported by Joe. YMMV, of course... -- -Chuck ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail Always - 82422948 [ ... ] 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail Always - 286126605 [ ... ] 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age Always - 166181300 These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. -- -Chuck ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 06:42:04PM +0100, Julian H. Stacey wrote: Jeremy Chadwick wrote: wondering if this is a known issue. Note that smartctl does not report errors logged and gives a PASSED to the drive. I am running at UDMA100 ATA. Also, if it matters, I am using ZFS. Can you please provide output of the following: * smartctl -a /dev/ad0 From ports/sysutils/smartmontools I presume ? ( Asking as I also have a DMA prob. to solve, at present needing hw.ata.ata_dma=0 in /boot/loader.conf to boot, ( interuptions on sound on 7-stable, though no ZFS here)). Yep! smartctl comes with ports/sysutils/smartmontools. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote: I've seen mention of this kind of issue before, but I never saw a solution, except that someone reported that a certain version of 6.x seemed to make it go away - accounts of this problem are a bit vague. I am running 7.0-RC1, and I am seeing the errors periodically, and I am wondering if this is a known issue. Note that smartctl does not report errors logged and gives a PASSED to the drive. I am running at UDMA100 ATA. Also, if it matters, I am using ZFS. What you've shown is usually the sign of a disk-related problem. It's very obvious when it's just one disk reporting DMA errors. You use ZFS, so chances are you have more than one disk in a pool/volume -- there's no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate something specific to ad0. Manufacturers pick very passive (non-aggressive) thresholds for error conditions on disks, so disks which are failing very commonly show PASSED during SMART analysis. To make matters worse, most users I know read SMART stats incorrectly (they're easy to misinterpret). Can you please provide output of the following: * smartctl -a /dev/ad0 * atacontrol cap ad0 * atacontrol info ata0, ata1, etc. -- any controller used by ZFS * Relevant dmesg output that indicates what kind of ATA controller these disks are attached to. Start with output from 'ad0:' and work backwards. For example, ad0 on this machine is using an Intel ICH6 controller: atapci0: Intel ICH6 SATA150 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: ATA channel 0 on atapci0 ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150 Other stuff: SMART stats which are labelled Offline are only updated when a short or long offline test is performed. Have you tried using smartctl -t short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw values on the far right column increment? Have you tried using zpool scrub on the ZFS pool, then zpool status to see if READ/WRITE/CHKSUM counters increment or if the scrub line states there were errors? Other things which have fixed problems in the past for others: * BIOS updates * Change of motherboards (sometimes replacing board with same model, other times going with a completely different vendor (implies weird implementation issues or BIOS problems)) * Changing SATA cables * Getting a larger power supply (usually when lots of disk are involved) -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 06:03:33PM -0700, Joe Peterson wrote: Wow, pretty crazy! Hmm, and yes, those LBAs do look close together. Well, let me know how the smartctl output looks. I'd be curious if your bad sector count rises. Absolutely nada on the SMART statistics. Nothing incremented or changed in any way. My short and long tests did not change any of the data in the fields either. Full output is below my .sig. BTW, I tried: crater# dd if=/dev/ad1s4 of=/dev/null bs=64k ^C1408596+0 records in 1408596+0 records out 92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec) (I let it go for 92GB or so) - no messages about ad1. So I wonder if this points at either the cable connector on ad0 or the drive itself. I guess I'd rather have a failing drive than motherboard... I originally was wondering if somehow something peculiar about ZFS's disk access pattern was making it happen... Since I'm used to dealing with disk issues (at work and personally), I'm left wondering if this is some strange ATA subsystem quirk, or ultimately something with ZFS (your something peculiar about ZFS's disk access pattern claim is starting to look more plausible). This may sound suicidal, but I'm hoping to recreate the scenario somehow, and then punt the details to Soren or Xin Li for further investigation -- if it looks like an ATA subsystem thing, that is. It is interesting to note that we both have Seagate disks... :-) I'll have to run SeaTools on my disk to see if anything comes back, or run a selective LBA test in smartctl (since the drive supports it). I've restarted my rsync since, and it's happily chomping away without an issue. If my problem was TRULY a bad block or something causing mechanical lock-up on the disk, I'd have expected my latest rsync to induce it. There's always the chance of some bizarre drive firmware bug too. THanks for the recomendations. I'll keep an eye on it, and I'll let you know what a cable change does for me. Still, I have not had any ad0 messages since this morning (I haven't been using the system today much, but maybe the cron processes are more likely to trigger it... Understood. In my case, I *know* the cables are fine, because the box itself I just built and migrated to a few days ago (change of motherboard, chassis, and addition of SATA hot-swap backplane). We use the same motherboard (Supermicro PDSMI+) in all of our production servers in our datacenter, and they're rock-solid. I've done hot-swapping without any issue on those systems too, and I've never seen any SATA system issues -- one of the systems is our datacenter backup server, which holds nightly backups for all the other boxes (about 6). Due to the heavy disk I/O that occurs for hours at a time, if this was some weird system quirk, motherboard problem, or SATA bus/cable issue, we would've seen it by now. FWIW: all our systems, including the backup box, use UFS2 exclusively -- no ZFS in the picture. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | icarus# smartctl -a /dev/ad6 smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 family Device Model: ST3500630AS Serial Number:9QG1YWNL Firmware Version: 3.AAE User Capacity:500,107,862,016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Fri Jan 25 17:10:31 2008 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported.
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 12:24:20PM -0700, Joe Peterson wrote: In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. Ah ha. Well, in your below example, you may only be using one drive for FreeBSD (ad0), but you do have a 2nd drive (ad1) which is installed. I would try doing some I/O on /dev/ad1 to see if you can get the timeouts to occur on that drive as well. You don't have to do anything risky with ad1 either: dd if=/dev/ad1 of=/dev/null bs=64k would probably suffice. Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. I agree entirely -- and I also use ZFS myself (across two drives in a RAID0-like fashion, with a completely separate drive which is used for nightly backups of the ZFS pool). I'm absolutely thrilled with it; finally something clean, reliable, and simple -- something I've always wanted in a LVM or LVM-like implementation. * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. atapci0: Intel ICH4 UDMA100 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: ATA channel 0 on atapci0 ata0: [ITHREAD] ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100 The smartctl output for /dev/ad0 looks good, minus the one uncorrected sector. I'm ignoring that since it's proof that the drive knew of it and remapped it. If that number starts incrementing over time, though, replace the drive ASAP, of course. The atacontrol cap output looks fine too; nothing wonky, and the LBA capabilities look fine. The controller is nothing out-of-the-ordinary; it's reliable under FreeBSD (I've had many a motherboard which used it). Of course I haven't used an ICH4 since FreeBSD 3.x, and the ATA layer has changed substantially, numerous times. {regarding -t short and -t long} Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. Seem to indicate that the drive considers itself healthy. Another test I could recommend at this point would be one that would require a few hours of downtime: download Seagate's SeaTools (will require a CD burner or floppies) and consider doing both quick and long scans. Quick checks some of the stuff we've looked at here, but it also looks at some vendor-specific stuff within the drive. Long will scan every block on the disk for errors (and will not destroy data). OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. It depends on whether or not you saw more timeouts and cache errors spit out by the kernel while zpool scrub ran. If so, then yes, I would definitely say they're related. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was no data loss (yet) is because the checksum passed, and maybe it just had to retry...? I'm still new to ZFS myself, so I don't have an answer for you. Your conclusion is the same thing I'd conclude, though. I've been using this same motherboard/BIOS for a long time (as well as this drive), so no changes have happened to the HW recently. The BIOS is the newest, available, I believe (It's a Tyan Trinity S2099, so it's a few years old) I'd say the BIOS is probably not responsible at this point; I'd expect other weird things to be going on with the system if the BIOS was broken in some way (or possibly bit rot in the flash). It's going to be difficult to determine if maybe something on the mainboard has decided to start failing (some transistor within the ICH4, etc...) though. :-( I'm using regular ATA 80-pin cables. Also, these seem to have been working fine for quite a while now. But, yes, I have also witnessed bad cable issues on older systems in the past. I certainly could try a new cable and see if it helps. I'd try that for sure. It's just one more thing to rule out. * Getting a larger power supply (usually when lots of disk are involved) I only have two drives, so I think the PS has enough capacity in my case. Agreed; even a 350W PSU should handle 2 disks without a problem. Here's something to ponder: The LBAs being reported as having errors are scattered all over. They aren't lumped together (usually the sign of part of a platter going bad); instead, they're all over the drive. This would indicate either cable problems,
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
* Chuck Swiger ([EMAIL PROTECTED]) wrote: On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail Always - 82422948 [ ... ] 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail Always - 286126605 [ ... ] 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age Always - 166181300 These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. No, these are perfectly reasonable for a Seagate. I have about 12 7200.X's and all show the same sort of behavior. If they're nearly zero it's probably a sign your manufacturer isn't actually counting them (marketroids hate accurate SMART readings). Try graphing them as counters; with an idle disk you'll see periodic sawtooth patterns as the heads crawl from one side of the disk to the other. -- Thomas 'Freaky' Hurst http://hur.st/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
On Fri, Jan 25, 2008 at 12:46:08PM -0800, Chuck Swiger wrote: On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail Always - 82422948 [ ... ] 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail Always - 286126605 [ ... ] 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age Always - 166181300 These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. On some drives, yes, but not all drives. His is a Seagate drive -- Seagate uses some of the bits in the raw data section for some sort of internal use by the drive firmware. So as they may appear very high in value, the drive appears to function normally, and the actual adjusted SMART value (the field under VALUE) doesn't fluxuate. I have Seagate drives all over the place which exhibit identical stats to the above. I've included some for comparison below; each listed is on a different system. Look at attribute 190 (Temperature Celcius) for an example; I don't think any drive can reach 773849124C, for example. Or, well, I sure hope not. :-) I believe in the case of attrib. 190, that's why they present a human-readable value in attribute 194. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ==SNIP== ad6: 476940MB Seagate ST3500630AS 3.AAE at ata3-master SATA300 SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 094 006Pre-fail Always - 221374987 3 Spin_Up_Time0x0003 094 094 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 100 100 020Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 082 060 030Pre-fail Always - 29014 9 Power_On_Hours 0x0032 097 097 000Old_age Always - 2967 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020Old_age Always - 9 187 Unknown_Attribute 0x0032 100 100 000Old_age Always - 0 189 Unknown_Attribute 0x003a 100 100 000Old_age Always - 0 190 Temperature_Celsius 0x0022 064 050 045Old_age Always - 773849124 194 Temperature_Celsius 0x0022 036 050 000Old_age Always - 36 (Lifetime Min/Max 0/29) 195 Hardware_ECC_Recovered 0x001a 066 059 000Old_age Always - 36458075 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 18 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 18 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000Old_age Always - 0 ad4: 114473MB Seagate ST3120827AS 3.42 at ata2-master SATA150 SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 063 052 006Pre-fail Always - 57703728 3 Spin_Up_Time0x0003 096 096 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 100 100 020Old_age Always - 24 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 082 060 030Pre-fail Always - 169005025 9 Power_On_Hours 0x0032 096 096 000Old_age Always - 3536 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020Old_age Always - 24 194 Temperature_Celsius 0x0022 027 040 000Old_age Always - 27 (Lifetime Min/Max 0/15) 195 Hardware_ECC_Recovered 0x001a 063 052 000Old_age Always - 57703728 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: What you've shown is usually the sign of a disk-related problem. It's very obvious when it's just one disk reporting DMA errors. You use ZFS, so chances are you have more than one disk in a pool/volume -- there's no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate something specific to ad0. Jeremy, thanks for the response - I have tried to answer all of your questions below... In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. Manufacturers pick very passive (non-aggressive) thresholds for error conditions on disks, so disks which are failing very commonly show PASSED during SMART analysis. To make matters worse, most users I know read SMART stats incorrectly (they're easy to misinterpret). Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. Can you please provide output of the following: * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. * atacontrol cap ad0 Protocol ATA/ATAPI revision 7 device model ST3500630A serial number 9QG0DG03 firmware revision 3.AAE cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 976773168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Tagged Command Queuing (TCQ) no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 65278/0xFEFE automatic acoustic management no no 0/0x00 208/0xD0 * atacontrol info ata0, ata1, etc. -- any controller used by ZFS Master: ad0 ST3500630A/3.AAE ATA/ATAPI revision 7 Slave: ad1 ST3160812A/3.AAH ATA/ATAPI revision 7 (but note that ad1 is not used by FreeBSD) * Relevant dmesg output that indicates what kind of ATA controller these disks are attached to. Start with output from 'ad0:' and work backwards. For example, ad0 on this machine is using an Intel ICH6 controller: atapci0: Intel ICH6 SATA150 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: ATA channel 0 on atapci0 ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150 atapci0: Intel ICH4 UDMA100 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: ATA channel 0 on atapci0 ata0: [ITHREAD] ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100 SMART stats which are labelled Offline are only updated when a short or long offline test is performed. Have you tried using smartctl -t short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw values on the far right column increment? I just tried one: # 1 Short offline Completed without error 00% 5252 - # 2 Short offline Completed without error 00% 5252 - Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. Have you tried using zpool scrub on the ZFS pool, then zpool status to see if READ/WRITE/CHKSUM counters increment or if the scrub line states there were errors? OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was no data loss (yet) is because the checksum passed, and maybe it just had to retry...? Anyway, here's the output so far: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 2.50% done, 1h58m to go config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 0 ad0s1dONLINE 1 3 0 errors: No known data errors Other
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Jeremy Chadwick wrote: What you've shown is usually the sign of a disk-related problem. It's very obvious when it's just one disk reporting DMA errors. You use ZFS, so chances are you have more than one disk in a pool/volume -- there's no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate something specific to ad0. Jeremy, thanks for the response - I have tried to answer all of your questions below... In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. Manufacturers pick very passive (non-aggressive) thresholds for error conditions on disks, so disks which are failing very commonly show PASSED during SMART analysis. To make matters worse, most users I know read SMART stats incorrectly (they're easy to misinterpret). Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. Can you please provide output of the following: * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. * atacontrol cap ad0 Protocol ATA/ATAPI revision 7 device model ST3500630A serial number 9QG0DG03 firmware revision 3.AAE cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 976773168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Tagged Command Queuing (TCQ) no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 65278/0xFEFE automatic acoustic management no no 0/0x00 208/0xD0 * atacontrol info ata0, ata1, etc. -- any controller used by ZFS Master: ad0 ST3500630A/3.AAE ATA/ATAPI revision 7 Slave: ad1 ST3160812A/3.AAH ATA/ATAPI revision 7 (but note that ad1 is not used by FreeBSD) * Relevant dmesg output that indicates what kind of ATA controller these disks are attached to. Start with output from 'ad0:' and work backwards. For example, ad0 on this machine is using an Intel ICH6 controller: atapci0: Intel ICH6 SATA150 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: ATA channel 0 on atapci0 ad0: 238475MB WDC WD2500KS-00MJB0 02.01C03 at ata0-master SATA150 atapci0: Intel ICH4 UDMA100 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: ATA channel 0 on atapci0 ata0: [ITHREAD] ad0: 476940MB Seagate ST3500630A 3.AAE at ata0-master UDMA100 SMART stats which are labelled Offline are only updated when a short or long offline test is performed. Have you tried using smartctl -t short /dev/ad0 and smartctl -t long /dev/ad0 to see if any of the raw values on the far right column increment? I just tried one: # 1 Short offline Completed without error 00% 5252 - # 2 Short offline Completed without error 00% 5252 - Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. Have you tried using zpool scrub on the ZFS pool, then zpool status to see if READ/WRITE/CHKSUM counters increment or if the scrub line states there were errors? OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was no data loss (yet) is because the checksum passed, and maybe it just had to retry...? Anyway, here's the output so far: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 2.50% done, 1h58m to go config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 0 ad0s1dONLINE 1 3 0 errors: No known data errors Other
Re: ad0: TIMEOUT - WRITE_DMA type errors with 7.0-RC1
Chuck Swiger wrote: On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail Always - 82422948 [ ... ] 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail Always - 286126605 [ ... ] 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age Always - 166181300 These numbers are quite worrysome-- they should be zero or nearly so in a healthy drive. It seems to depend on the drive manufacturer. E.g. this is a Seagate. Every Seagate I've ever had (or heard about on the web via smartctl dumps) reports very large numbers for these values. I've heard it described that Seagate shows you the raw numbers (and correctable errors do happen all the time in all drives). In Western Digital drives (IIRC), the numbers shown are the ones that *should* be zero, thereby hiding the low-level errors. Hard to say if my numbers are too high, but these corrected error counts are always frighteningly high in Seagates. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]