Re: Adventures in btrfs raid5 disk recovery

Henk Slager Mon, 27 Jun 2016 14:03:28 -0700

On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
> <ahferro...@gmail.com> wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>> <ahferro...@gmail.com> wrote:
>>>
>>>> Well, the obvious major advantage that comes to mind for me to
>>>> checksumming
>>>> parity is that it would let us scrub the parity data itself and verify
>>>> it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.


What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adventures in btrfs raid5 disk recovery

Reply via email to