> I could use a little clarification on how these unrecoverable disk errors
> behave -- or maybe a lot, depending on one's point of view.
> 
> So, when one of these "once in around ten (or 100) terabytes read" events
> occurs, my understanding is that a read error is returned by the drive,
> and the corresponding data is lost as far as the drive is concerned.

Yes -- the data being one or more disk blocks.  (You can't lose a smaller
amount of data, from the drive's point of view, since the error correction
code covers the whole block.)

> If my assumptions are correct about how these unrecoverable disk errors
> are manifested, then a "dumb" scrubber will find such errors by simply
> trying to read everything on disk -- no additional checksum is required.
> Without some form of parity or replication, the data is lost, but at
> least somebody will know about it.

Right.  Generally if you have replication and scrubbing, then you'll also
re-write any data which was found to be unreadable, thus fixing the
problem (and protecting yourself against future loss of the second copy).

> Now it seems to me that without parity/replication, there's not much
> point in doing the scrubbing, because you could just wait for the error
> to be detected when someone tries to read the data for real.  It's
> only if you can repair such an error (before the data is needed) that
> such scrubbing is useful.

Pretty much, though if you're keeping backups, you could recover the
data from backup at this point. Of course, backups could be considered
a form of replication, but most of us in file systems don't think of them
that way.

Anton
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to