Re: [PATCH v6 2/2] btrfs: scrub: Fix RAID56 recovery race condition

Goffredo Baroncelli Thu, 27 Apr 2017 12:26:29 -0700

On 2017-04-26 02:13, Qu Wenruo wrote:
> 
> 
> At 04/26/2017 01:58 AM, Goffredo Baroncelli wrote:
>> I Qu,
>> 
>> I tested these two patches on top of 4.10.12; however when I
>> corrupt disk1, It seems that BTRFS is still unable to rebuild
>> parity.
>> 
>> Because in the past the patches set V4 was composed by 5 patches
>> and this one (V5) is composed by only 2 patches, are these 2
>> sufficient to solve all known bugs of raid56 ? Or I have to cherry
>> pick other two ones ?
>> 
>> BR G.Baroncelli
> 
> These 2 patches are for David to merge.
> 
> The rest 3 patches are reviewed by Liu Bo and has nothing to be
> modified.
> 
> To test the full ability to recovery, please try latest mainline
> master, which doesn't only include my RAID56 scrub patches, but also
> patches from Liu Bo to do scrub recovery correctly.


You are right; I tested the 4.11-rc8 and it is able to detect and correct the 
defects

BR
G.Baroncelli
> 
> Thanks, Qu
> 
>> 
>> On 2017-04-14 02:35, Qu Wenruo wrote:
>>> When scrubbing a RAID5 which has recoverable data corruption
>>> (only one data stripe is corrupted), sometimes scrub will report
>>> more csum errors than expected. Sometimes even unrecoverable
>>> error will be reported.
>>> 
>>> The problem can be easily reproduced by the following steps: 1)
>>> Create a btrfs with RAID5 data profile with 3 devs 2) Mount it
>>> with nospace_cache or space_cache=v2 To avoid extra data space
>>> usage. 3) Create a 128K file and sync the fs, unmount it Now the
>>> 128K file lies at the beginning of the data chunk 4) Locate the
>>> physical bytenr of data chunk on dev3 Dev3 is the 1st data
>>> stripe. 5) Corrupt the first 64K of the data chunk stripe on
>>> dev3 6) Mount the fs and scrub it
>>> 
>>> The correct csum error number should be 16(assuming using
>>> x86_64). Larger csum error number can be reported in a 1/3
>>> chance. And unrecoverable error can also be reported in a 1/10
>>> chance.
>>> 
>>> The root cause of the problem is RAID5/6 recover code has race 
>>> condition, due to the fact that full scrub is initiated per
>>> device.
>>> 
>>> While for other mirror based profiles, each mirror is independent
>>> with each other, so race won't cause any big problem.
>>> 
>>> For example: Corrupted       |       Correct          |
>>> Correct        | |   Scrub dev3 (D1)     |    Scrub dev2 (D2)
>>> |    Scrub dev1(P)    | 
>>> ------------------------------------------------------------------------
>>>
>>> 
Read out D1             |Read out D2             |Read full stripe     |
>>> Check csum              |Check csum              |Check parity
>>> | Csum mismatch           |Csum match, continue    |Parity
>>> mismatch      | handle_errored_block    |
>>> |handle_errored_block | Read out full stripe   |
>>> | Read out full stripe| D1 csum error(err++)   |
>>> | D1 csum error(err++)| Recover D1             |
>>> | Recover D1          |
>>> 
>>> So D1's csum error is accounted twice, just because 
>>> handle_errored_block() doesn't have enough protect, and race can
>>> happen.
>>> 
>>> On even worse case, for example D1's recovery code is re-writing 
>>> D1/D2/P, and P's recovery code is just reading out full stripe,
>>> then we can cause unrecoverable error.
>>> 
>>> This patch will use previously introduced lock_full_stripe() and 
>>> unlock_full_stripe() to protect the whole
>>> scrub_handle_errored_block() function for RAID56 recovery. So no
>>> extra csum error nor unrecoverable error.
>>> 
>>> Reported-by: Goffredo Baroncelli <kreij...@libero.it> 
>>> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> --- 
>>> fs/btrfs/scrub.c | 22 ++++++++++++++++++++++ 1 file changed, 22
>>> insertions(+)
>>> 
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
>>> 980deb8aae47..7d8ce57fd08a 100644 --- a/fs/btrfs/scrub.c +++
>>> b/fs/btrfs/scrub.c @@ -1113,6 +1113,7 @@ static int
>>> scrub_handle_errored_block(struct scrub_block *sblock_to_check) 
>>> int mirror_index; int page_num; int success; +    bool
>>> full_stripe_locked; static DEFINE_RATELIMIT_STATE(_rs,
>>> DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -1138,6
>>> +1139,24 @@ static int scrub_handle_errored_block(struct
>>> scrub_block *sblock_to_check) have_csum =
>>> sblock_to_check->pagev[0]->have_csum; dev =
>>> sblock_to_check->pagev[0]->dev; +    /* +     * For RAID5/6 race
>>> can happen for different dev scrub thread. +     * For data
>>> corruption, Parity and Data thread will both try +     * to
>>> recovery the data. +     * Race can lead to double added csum
>>> error, or even unrecoverable +     * error. +     */ +    ret =
>>> lock_full_stripe(fs_info, logical, &full_stripe_locked); +    if
>>> (ret < 0) { +        spin_lock(&sctx->stat_lock); +        if
>>> (ret == -ENOMEM) +            sctx->stat.malloc_errors++; +
>>> sctx->stat.read_errors++; +
>>> sctx->stat.uncorrectable_errors++; +
>>> spin_unlock(&sctx->stat_lock); +        return ret; +    } + if
>>> (sctx->is_dev_replace && !is_metadata && !have_csum) { 
>>> sblocks_for_recheck = NULL; goto nodatasum_case; @@ -1472,6
>>> +1491,9 @@ static int scrub_handle_errored_block(struct
>>> scrub_block *sblock_to_check) kfree(sblocks_for_recheck); } +
>>> ret = unlock_full_stripe(fs_info, logical, full_stripe_locked); +
>>> if (ret < 0) +        return ret; return 0; }
>>> 
>> 
>> 
> 
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/2] btrfs: scrub: Fix RAID56 recovery race condition

Reply via email to