Re: [zfs-discuss] What are the usual suspects in data errors?

JZ Wed, 14 Jan 2009 16:50:01 -0800

folks, please, chatting on - don't make me stop you, we are all open folks.



[but darn]

ok, thank you much for the anticipation for something actually useful, here 
is another thing I shared with MS Storage but not with you folks yet --

we win with real advantages, not lies, not scales, but only real knowhow.

cheers,
z



----- Original Message ----- 
From: "JZ" <j...@excelsioritsolutions.com>
To: "A Darren Dunham" <ddun...@taos.com>; <zfs-discuss@opensolaris.org>
Sent: Wednesday, January 14, 2009 7:38 PM
Subject: Re: [zfs-discuss] What are the usual suspects in data errors?


> darn, Darren, learning fast!
>
> best,
> z
>
>
> ----- Original Message ----- 
> From: "A Darren Dunham" <ddun...@taos.com>
> To: <zfs-discuss@opensolaris.org>
> Sent: Wednesday, January 14, 2009 6:15 PM
> Subject: Re: [zfs-discuss] What are the usual suspects in data errors?
>
>
>> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote:
>>> I realize that any error can occur in a storage subsystem, but most
>>> of these have an extremely low probability.  I'm interested in this
>>> discussion in only those that do occur occasionally, and that are
>>> not catastrophic.
>>
>> What level is "extremely low" here?
>>
>>> Many of those components have their own error checking.  Some have
>>> error correction.  For example, parity checking is done on a SCSI bus,
>>> unless it's specifically disabled.  Do SATA and PATA connections also
>>> do error checking?  Disk sector I/O uses CRC error checking and
>>> correction.  Memory buffers would often be protected by parity memory.
>>> Is there any more that I've missed?
>>
>> Reports suggest that bugs in drive firmware could account for errors at
>> a level that is not insignificant.
>>
>>> What can go wrong with the disk controller?  A simple seek to the
>>> wrong track is not a problem because the track number is encoded on
>>> the platter.  The controller will simply recalibrate the mechanism and
>>> retry the seek.  If it computes the wrong sector, that would be a
>>> problem.  Does this happen with any frequency?
>>
>> Netapp documents certain rewrite bugs that they've specifically seen.  I
>> would imagine they have good data on the frequency that they see it in
>> the field.
>>
>>> In this case, ZFS
>>> would detect a checksum error and obtain the data from its redundant
>>> copy.
>>
>> Correct.
>>
>>> A logic error in ZFS might result in incorrect metadata being written
>>> with valid checksum.  In this case, ZFS might panic on import or might
>>> correct the error.  How is this sort of error prevented?
>>
>> It's very difficult to protect yourself from software bugs with the same
>> piece of software.  You can create assertions that are hopefully simpler
>> and less prone to errors, but they will not catch all bugs.
>>
>>> Some errors might result from a loss of power if some ZFS data was
>>> written to a disk cache but never was written to the disk platter.
>>> Again, ZFS might panic on import or might correct the error.  How is
>>> this sort of error prevented?
>>
>> ZFS uses a multi-stage commit.  It relies on the "disk" responding to a
>> request to flush caches to the disk.  If that assumption is correct,
>> then there is no problem in general with power issues.  The disk is
>> consistent both before and after the cache is flushed.
>>
>> -- 
>> Darren
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What are the usual suspects in data errors?

Reply via email to