comment below... Adrian Saul wrote: > Howdy, > I have at several times had issues with consumer grade PC hardware and ZFS > not getting along. The problem is not the disks but the fact I dont have ECC > and end to end checking on the datapath. What is happening is that random > memory errors and bit flips are written out to disk and when read back again > ZFS reports it as a checksum failure: > > pool: myth > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > myth ONLINE 0 0 48 > raidz1 ONLINE 0 0 48 > c7t1d0 ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 0 > c6t2d0 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > /myth/tv/1504_20080216203700.mpg > /myth/tv/1509_20080217192700.mpg > > Note there are no disk errors, just entire RAID errors. I get the same thing > on a mirror pool where both sides of the mirror have identical errors. All I > can assume is that it was corrupted after the checksum was calculated and > flushed to disk like that. In the past it was a motherboard capacitor that > had popped - but it was enough to generate these errors under load. > > At any rate ZFS is doing the right thing by telling me - what I dont like is > that from that point on I cant convince ZFS to ignore it. The data in > question is video files - a bit flip here or there wont matter. But if ZFS > reads the affected block it returns and I/O error and until I restore the > file I have no option but to try and make the application skip over it. If > it was UFS for example I would have never known, but ZFS makes a point of > stopping anything using it - understandably, but annoyingly as well. > > What I would like to see is an option to ZFS in the style of the 'onerror' > for UFS i.e the ability to tell ZFS to join fight club - let what doesnt > matter truely slide. For example: > > zfs set erroraction=[iofail|log|ignore] > > This would default to the current action of "iofail" but in the event you > wanted to try and recover or repair data, you could set log to say generate > an FMA event that there is bad checksums, or ignore, to get on with your day. > > As mentioned, I see this as mostly an option to help repair data after the > issue is identified or repaired. Of course its data specific, but if the > application can allow it or handle it, why should ZFS get in the way? > > Just a thought. > > Cheers, > Adrian > > PS: And yes, I am now buying some ECC memory. >
I don't recall when this arrived in NV, but the failmode parameter for storage pools has already been implemented. From zpool(1m) failmode=wait | continue | panic Controls the system behavior in the event of catas- trophic pool failure. This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is determined as follows: wait Blocks all I/O access until the device con- nectivity is recovered and the errors are cleared. This is the default behavior. continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and gen- erates a system crash dump. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss