Re: [zfs-discuss] ZFS error handling - suggestion

Richard Elling Mon, 18 Feb 2008 08:23:39 -0800

comment below...

Adrian Saul wrote:
> Howdy,
>  I have at several times had issues with consumer grade PC hardware and ZFS 
> not getting along.  The problem is not the disks but the fact I dont have ECC 
> and end to end checking on the datapath.  What is happening is that random 
> memory errors and bit flips are written out to disk and when read back again 
> ZFS reports it as a checksum failure:
>
>   pool: myth
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         myth        ONLINE       0     0    48
>           raidz1    ONLINE       0     0    48
>             c7t1d0  ONLINE       0     0     0
>             c7t3d0  ONLINE       0     0     0
>             c6t1d0  ONLINE       0     0     0
>             c6t2d0  ONLINE       0     0     0
>
> errors: Permanent errors have been detected in the following files:
>
>         /myth/tv/1504_20080216203700.mpg
>         /myth/tv/1509_20080217192700.mpg
>  
> Note there are no disk errors, just entire RAID errors.  I get the same thing 
> on a mirror pool where both sides of the mirror have identical errors.  All I 
> can assume is that it was corrupted after the checksum was calculated and 
> flushed to disk like that.  In the past it was a motherboard capacitor that 
> had popped - but it was enough to generate these errors under load.
>
> At any rate ZFS is doing the right thing by telling me - what I dont like is 
> that from that point on I cant convince ZFS to ignore it.  The data in 
> question is video files - a bit flip here or there wont matter.  But if ZFS 
> reads the affected block it returns and I/O error and until I restore the 
> file I have no option but to try and make the application skip over it.  If 
> it was UFS for example I would have never known, but ZFS makes a point of 
> stopping anything using it - understandably, but annoyingly as well.
>
> What I would like to see is an option to ZFS in the style of the 'onerror' 
> for UFS i.e the ability to tell ZFS to join fight club - let what doesnt 
> matter truely slide.  For example:
>
> zfs set erroraction=[iofail|log|ignore]
>
> This would default to the current action of "iofail" but in the event you 
> wanted to try and recover or repair data, you could set log to say generate 
> an FMA event that there is bad checksums, or ignore, to get on with your day.
>
> As mentioned, I see this as mostly an option to help repair data after the 
> issue is identified or repaired.  Of course its data specific, but if the 
> application can allow it or handle it, why should ZFS get in the way?
>
> Just a thought.
>
> Cheers,
>   Adrian
>
> PS: And yes, I am now buying some ECC memory.
>


I don't recall when this arrived in NV, but the failmode parameter
for storage pools has already been implemented.  From zpool(1m)
     failmode=wait | continue | panic

         Controls the system behavior  in  the  event  of  catas-
         trophic  pool  failure.  This  condition  is typically a
         result of a  loss  of  connectivity  to  the  underlying
         storage device(s) or a failure of all devices within the
         pool. The behavior of such an  event  is  determined  as
         follows:

         wait        Blocks all I/O access until the device  con-
                     nectivity  is  recovered  and the errors are
                     cleared. This is the default behavior.

         continue    Returns EIO to any new  write  I/O  requests
                     but  allows  reads  to  any of the remaining
                     healthy devices.  Any  write  requests  that
                     have  yet  to  be committed to disk would be
                     blocked.

         panic       Prints out a message to the console and gen-
                     erates a system crash dump.

 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS error handling - suggestion

Reply via email to