Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes:
>
> Well, I probably need to wade through the paper (and recall Galois field
> theory) before answering this. We did a few tests in a 16 disk RAID6
> where we wrote data to the RAID, powered the system down, pulled out one
> disk, inserted it into another computer and changed the sector checksum
> of a few sectors (using hdparm's utility makebadsector). The we
> reinserted this into the original box, powered it up and ran a volume
> check and the controller did indeed find the corrupted sector and
> repaired the correct one without destroying data on another disk (as far
> as we know and tested).

Note that there are cases of single-disk corruption that are trivially
recoverable (for example if the corruption affects the P or Q parity 
block, as opposed to the data blocks). Maybe that's what you
inadvertently tested ? Overwrite a number of contiguous sectors to
span 3 stripes on a single disk to be sure to correctly stress-test
the self-healing mechanism.

> For the other point: dual-disk corruption can (to my understanding)
> never be healed by the controller since there is no redundant
> information available to check against. I don't recall if we performed
> some tests on that part as well, but maybe we should do that to learn
> how the controller will behave. As a matter of fact at that point it
> should just start crying out loud and tell me, that it cannot recover
> for that. 

The paper explains that the best RAID-6 can do is use probabilistic 
methods to distinguish between single and dual-disk corruption, eg. 
"there are 95% chances it is single-disk corruption so I am going to
fix it assuming that, but there are 5% chances I am going to actually
corrupt more data, I just can't tell". I wouldn't want to rely on a
RAID controller that takes gambles :-)

-marc


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to