On Thu, Mar 30, 2017 at 02:32:48PM +0800, Qu Wenruo wrote: > When scrubbing a RAID5 which has recoverable data corruption (only one > data stripe is corrupted), sometimes scrub will report more csum errors > than expected. Sometimes even unrecoverable error will be reported. > > The problem can be easily reproduced by the following steps: > 1) Create a btrfs with RAID5 data profile with 3 devs > 2) Mount it with nospace_cache or space_cache=v2 > To avoid extra data space usage. > 3) Create a 128K file and sync the fs, unmount it > Now the 128K file lies at the beginning of the data chunk > 4) Locate the physical bytenr of data chunk on dev3 > Dev3 is the 1st data stripe. > 5) Corrupt the first 64K of the data chunk stripe on dev3 > 6) Mount the fs and scrub it > > The correct csum error number should be 16(assuming using x86_64). > Larger csum error number can be reported in a 1/3 chance. > And unrecoverable error can also be reported in a 1/10 chance. > > The root cause of the problem is RAID5/6 recover code has race > condition, due to the fact that full scrub is initiated per device. > > While for other mirror based profiles, each mirror is independent with > each other, so race won't cause any big problem. > > For example: > Corrupted | Correct | Correct | > | Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) | > ------------------------------------------------------------------------ > Read out D1 |Read out D2 |Read full stripe | > Check csum |Check csum |Check parity | > Csum mismatch |Csum match, continue |Parity mismatch | > handle_errored_block | |handle_errored_block | > Read out full stripe | | Read out full stripe| > D1 csum error(err++) | | D1 csum error(err++)| > Recover D1 | | Recover D1 | > > So D1's csum error is accounted twice, just because > handle_errored_block() doesn't have enough protect, and race can happen. > > On even worse case, for example D1's recovery code is re-writing > D1/D2/P, and P's recovery code is just reading out full stripe, then we > can cause unrecoverable error. > > This patch will use previously introduced lock_full_stripe() and > unlock_full_stripe() to protect the whole scrub_handle_errored_block() > function for RAID56 recovery. > So no extra csum error nor unrecoverable error. > > Reported-by: Goffredo Baroncelli <kreij...@libero.it> > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> > --- > fs/btrfs/scrub.c | 23 +++++++++++++++++++++++ > 1 file changed, 23 insertions(+) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 5fc99a92b4ff..4bbefc96485d 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -1109,6 +1109,7 @@ static int scrub_handle_errored_block(struct > scrub_block *sblock_to_check) > int mirror_index; > int page_num; > int success; > + bool full_stripe_locked; > static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > > @@ -1134,6 +1135,25 @@ static int scrub_handle_errored_block(struct > scrub_block *sblock_to_check) > have_csum = sblock_to_check->pagev[0]->have_csum; > dev = sblock_to_check->pagev[0]->dev; > > + /* > + * For RAID5/6 race can happen for different dev scrub thread. > + * For data corruption, Parity and Data thread will both try > + * to recovery the data. > + * Race can lead to double added csum error, or even unrecoverable > + * error. > + */ > + ret = lock_full_stripe(fs_info, logical, &full_stripe_locked); > + if (ret < 0) { > + spin_lock(&sctx->stat_lock); > + /* Either malloc failure or bg_cache not found */ > + if (ret == -ENOMEM) > + sctx->stat.malloc_errors++; > + else > + sctx->stat.uncorrectable_errors++;
Other places in scrub_handle_errored_block() also set read_errors and uncorrectable_errors, why the above is an exception? I'm fine with putting a bool into lock_full_stripe(), but it's easier to tell whether locking is successful by setting the flag after locking. Thanks, -liubo > + spin_unlock(&sctx->stat_lock); > + return ret; > + } > + > if (sctx->is_dev_replace && !is_metadata && !have_csum) { > sblocks_for_recheck = NULL; > goto nodatasum_case; > @@ -1468,6 +1488,9 @@ static int scrub_handle_errored_block(struct > scrub_block *sblock_to_check) > kfree(sblocks_for_recheck); > } > > + ret = unlock_full_stripe(fs_info, logical, full_stripe_locked); > + if (ret < 0) > + return ret; > return 0; > } > > -- > 2.12.1 > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html