At 10/12/2016 07:58 AM, Chris Murphy wrote:
https://btrfs.wiki.kernel.org/index.php/Status
Scrub + RAID56 Unstable will verify but not repair
This doesn't seem quite accurate. It does repair the vast majority of
the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
data strip results in a.) fixed up data strip from parity b.) wrong
recomputation of replacement parity c.) good parity is overwritten
with bad, silently, d.) if parity reconstruction is needed in the
future e.g. device or sector failure, it results in EIO, a kind of
data loss.
Bad bug. For sure.
But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.
Totally true.
Original RAID5/6 design is only to handle missing device, not rotted bits.
So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.
Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree
and data checksum.
In ideal situation, btrfs should detect which stripe is corrupted, and
only try to recover data/parity if recovered data checksum matches.
For example, for a very traditional RAID5 layout like the following:
Disk 1 | Disk 2 | Disk 3 |
-----------------------------------------
Data 1 | Data 2 | Parity |
Scrub should check data stripe 1 and 2, against their checksum first
[All data extents has csum]
1) All csum matches
Good, then check parity.
1.1) Parity matches
Nothing wrong at all
1.1) Parity mismatch
Just recalculate parity. Corruption may happen in unused data
space or in parity. Either way recalculate parity is good
enough.
2) One data stripe csum mismatches(missing), parity mismatches too
We only know one data stripe mismatch, not sure if parity is OK.
Try to recover that data stripe from parity, and recheck csum.
2.1) Recovered data stripe matches csum
That data stripe is corrupted and parity is OK
Recoverable.
2.2) Recovered data stripe mismatch csum
Both that data stripe and parity is corrupted.
3) Two data stripes csum mismatch, no matter parity matches or not
At least 2 stripes are screwed up. no fix anyway.
[Some data extents has no csum(nodatasum)]
4) Existing(or no csum at all) csum matches, parity matches
Good, nothing to worry about
5) Exist csum mismatch for one data stripe, parity mismatch
Like 2), try to recover that data stripe, and re-check csum.
5.1) recovered data stripes matches csum
At least we can recover the data covered by csum.
Corrupted no-csum data is not our concern.
5.2) recovered data stripes mismatches csum
Screwed up
6) No csum at all, parity mismatch
We are screwed up, just like traditional RAID5.
And I'm coding for the above cases in btrfs-progs to implement an
off-line scrub tool.
Currently it looks good, and can already handle case from 1) to 3).
And I tend to ignore any full stripe who lacks checksum and parity
mismatches.
But as you can see, there are so many things(csum exists,matches pairty
matches, missing devices) involved in btrfs RAID5(RAID6 will be more
complex), it's already much complex than traditional RAID5/6 or current
scrub implementation.
So what current kernel scub lacks is:
1) Detection of good/bad stripes
2) Recheck of recovery attempts
But that's all traditional RAID5/6 lacks unless there is some hidden
checksum like btrfs they can use.
Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html