Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes: > > Well, I probably need to wade through the paper (and recall Galois field > theory) before answering this. We did a few tests in a 16 disk RAID6 > where we wrote data to the RAID, powered the system down, pulled out one > disk, inserted it into another computer and changed the sector checksum > of a few sectors (using hdparm's utility makebadsector). The we > reinserted this into the original box, powered it up and ran a volume > check and the controller did indeed find the corrupted sector and > repaired the correct one without destroying data on another disk (as far > as we know and tested).
Note that there are cases of single-disk corruption that are trivially recoverable (for example if the corruption affects the P or Q parity block, as opposed to the data blocks). Maybe that's what you inadvertently tested ? Overwrite a number of contiguous sectors to span 3 stripes on a single disk to be sure to correctly stress-test the self-healing mechanism. > For the other point: dual-disk corruption can (to my understanding) > never be healed by the controller since there is no redundant > information available to check against. I don't recall if we performed > some tests on that part as well, but maybe we should do that to learn > how the controller will behave. As a matter of fact at that point it > should just start crying out loud and tell me, that it cannot recover > for that. The paper explains that the best RAID-6 can do is use probabilistic methods to distinguish between single and dual-disk corruption, eg. "there are 95% chances it is single-disk corruption so I am going to fix it assuming that, but there are 5% chances I am going to actually corrupt more data, I just can't tell". I wouldn't want to rely on a RAID controller that takes gambles :-) -marc _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss