Re: [zfs-discuss] Cause for data corruption?
> I thought RAIDZ would correct data errors automatically with the parity data. Right. However, if the data is corrupted while in memory (e.g. on a PC with non-parity memory), there's nothing ZFS can do to detect that. I mean, not even theoretically. The best we could do would be to narrow the windows of vulnerability by recomputing the checksum every time we accessed an in-memory object, which would be terribly expensive. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
Thanks for your reassuring post, loomy :) I'm pretty sure the reason for all this is some bad hardware.. But I can't get VTS to work, looks like its not supported for this kind of hardware. And in order to run some other stresstest software or something I would have to connect monitor, keyboard and dvd rom.. which I'm just so sick of doing :) Hopefully I can motivate myself on the weekend .. I'll keep you all here updated when I find something. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
> So I scrubbed the whole pool and it found a lot more corrupted files. My condolences :) General questions and comments about ZFS and data corruption: I thought RAIDZ would correct data errors automatically with the parity data. How wrong am I on that? Perhaps a parity correction was already tried, and there was too much corruption to be successful, implying a very significant amount of data corruption? Assuming the errors are being generated by bad hardware somewhere between the disk and the CPU (inclusively), how could ZFS be configured to handle these errors automatically? Set data copies to equal 2, I think. Anything else? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
haha very funny :D Just the controllers are on a 32bit PCI bus.. solaris itself is running 64bit: [EMAIL PROTECTED] /var/tmp/ # isainfo amd64 i386 And besides, a lot of our customers are having serious problems with their thumpers and zfs and stuff... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
Le mardi 26 février 2008 à 05:59 -0800, Sandro a écrit : > Hey > > Thanks for your answers guys. > > I'll run VTS to stresstest cpu and memory. > > And I just checked the block diagram of my motherboard (Gigabyte M61P-S3). > It doesn't even have 64bit pci slots.. just standard old 33mhz 32bit pci .. > and a couple of newer pci-e. > But my two controllers are both the same vendor / version and are both > connected to the same pci bus. looks like 32 bits & ZFS definitively hurts :D -- Nicolas Szalay Administrateur systèmes & réseaux -- _ ASCII ribbon campaign ( ) - against HTML email X & vCards / \ signature.asc Description: Ceci est une partie de message numériquement signée ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
Hey Thanks for your answers guys. I'll run VTS to stresstest cpu and memory. And I just checked the block diagram of my motherboard (Gigabyte M61P-S3). It doesn't even have 64bit pci slots.. just standard old 33mhz 32bit pci .. and a couple of newer pci-e. But my two controllers are both the same vendor / version and are both connected to the same pci bus. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
Le lundi 25 février 2008 à 11:05 -0800, Sandro a écrit : > hi folks Hi, > I've been running my fileserver at home with linux for a couple of years and > last week I finally reinstalled it with solaris 10 u4. > > I borrowed a bunch of disks from a friend, copied over all the files, > reinstalled my fileserver and copied the data back. > > Everything went fine, but after a few days now, quite a lot of files got > corrupted. > here's the output: > > # zpool status data > pool: data > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. >see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008 > config: > > NAMESTATE READ WRITE CKSUM > dataONLINE 0 0 5.52K > raidz1ONLINE 0 0 5.52K > c0t0d0 ONLINE 0 0 10.72 > c0t1d0 ONLINE 0 0 4.59K > c0t2d0 ONLINE 0 0 5.18K > c0t3d0 ONLINE 0 0 9.10K > c1t0d0 ONLINE 0 0 7.64K > c1t1d0 ONLINE 0 0 3.75K > c1t2d0 ONLINE 0 0 4.39K > c1t3d0 ONLINE 0 0 6.04K > > errors: 388 data errors, use '-v' for a list > > Last night I found out about this, it told me there were errors in like 50 > files. > So I scrubbed the whole pool and it found a lot more corrupted files. > > The temporary system which I used to hold the data while I'm installing > solaris on my fileserver is running nv build 80 and no errors on there. > > What could be the cause of these errors?? > I don't see any hw errors on my disks.. > > # iostat -En | grep -i error > c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t0d0 Soft Errors: 574 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t0d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t1d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t2d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t3d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t1d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t2d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t3d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > > although a lot of soft errors. > Linux said that one disk had gone bad, but I figured the sata cable was > somehow broken, so I replaced that before installing solaris. And solaris > didn't and doesn't see any actual hw errors on the disks, does it? I had the same symptoms recently. I also thought the disk were dying but I was wrong. Suspected the RAM, no. Finally it was because I mixed raid cards on different PCI buses : 2 64bits buses (no problem with these ones) and 1 32 Bits PCI bus which caused *all* the checksum errors. Kicked ou the card on the 32 bit PCI bus and all worked fine. Hope it helps, -- Nicolas Szalay Administrateur systèmes & réseaux -- _ ASCII ribbon campaign ( ) - against HTML email X & vCards / \ signature.asc Description: Ceci est une partie de message numériquement signée ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
My guess is that you have some defective hardware in the system that's causing bit flips in the checksum or the data payload. I'd suggest running some sort of system diagnostics for a few hours to see if you can locate the bad piece of hardware. My suspicion would be your memory or CPU, but that's just a wild guess, based on the number of errors you have and the number of devices it's spread over. Could it be that you have been corrupting data for some time and now known it? Oh - And i'd also look around based on your disk controller and ensure that there are no newer patches for it, just in case it's one for which there was a known problem. (which was worked around in the driver) I *think* there was an issue with at least one or two... Cheers! Nathan. Sandro wrote: > hi folks > > I've been running my fileserver at home with linux for a couple of years and > last week I finally reinstalled it with solaris 10 u4. > > I borrowed a bunch of disks from a friend, copied over all the files, > reinstalled my fileserver and copied the data back. > > Everything went fine, but after a few days now, quite a lot of files got > corrupted. > here's the output: > > # zpool status data > pool: data > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. >see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008 > config: > > NAMESTATE READ WRITE CKSUM > dataONLINE 0 0 5.52K > raidz1ONLINE 0 0 5.52K > c0t0d0 ONLINE 0 0 10.72 > c0t1d0 ONLINE 0 0 4.59K > c0t2d0 ONLINE 0 0 5.18K > c0t3d0 ONLINE 0 0 9.10K > c1t0d0 ONLINE 0 0 7.64K > c1t1d0 ONLINE 0 0 3.75K > c1t2d0 ONLINE 0 0 4.39K > c1t3d0 ONLINE 0 0 6.04K > > errors: 388 data errors, use '-v' for a list > > Last night I found out about this, it told me there were errors in like 50 > files. > So I scrubbed the whole pool and it found a lot more corrupted files. > > The temporary system which I used to hold the data while I'm installing > solaris on my fileserver is running nv build 80 and no errors on there. > > What could be the cause of these errors?? > I don't see any hw errors on my disks.. > > # iostat -En | grep -i error > c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t0d0 Soft Errors: 574 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t0d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t1d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t2d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c0t3d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t1d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t2d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > c1t3d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > > although a lot of soft errors. > Linux said that one disk had gone bad, but I figured the sata cable was > somehow broken, so I replaced that before installing solaris. And solaris > didn't and doesn't see any actual hw errors on the disks, does it? > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Cause for data corruption?
hi folks I've been running my fileserver at home with linux for a couple of years and last week I finally reinstalled it with solaris 10 u4. I borrowed a bunch of disks from a friend, copied over all the files, reinstalled my fileserver and copied the data back. Everything went fine, but after a few days now, quite a lot of files got corrupted. here's the output: # zpool status data pool: data state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008 config: NAMESTATE READ WRITE CKSUM dataONLINE 0 0 5.52K raidz1ONLINE 0 0 5.52K c0t0d0 ONLINE 0 0 10.72 c0t1d0 ONLINE 0 0 4.59K c0t2d0 ONLINE 0 0 5.18K c0t3d0 ONLINE 0 0 9.10K c1t0d0 ONLINE 0 0 7.64K c1t1d0 ONLINE 0 0 3.75K c1t2d0 ONLINE 0 0 4.39K c1t3d0 ONLINE 0 0 6.04K errors: 388 data errors, use '-v' for a list Last night I found out about this, it told me there were errors in like 50 files. So I scrubbed the whole pool and it found a lot more corrupted files. The temporary system which I used to hold the data while I'm installing solaris on my fileserver is running nv build 80 and no errors on there. What could be the cause of these errors?? I don't see any hw errors on my disks.. # iostat -En | grep -i error c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t0d0 Soft Errors: 574 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t0d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t1d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t2d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t3d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t1d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t2d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t3d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 although a lot of soft errors. Linux said that one disk had gone bad, but I figured the sata cable was somehow broken, so I replaced that before installing solaris. And solaris didn't and doesn't see any actual hw errors on the disks, does it? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss