At 11/18/2016 02:09 AM, Goffredo Baroncelli wrote:
Hi Qu,

On 2016-11-15 03:50, Qu Wenruo wrote:
Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)

Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.

The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.

Unfortunately, even with these patches btrfs still fails to rebuild a raid5 
array in my tests.
To perform the tests I create a filesystem raid5-three disks; then I created a 
file

# python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt

The filesystem layout is composed by stripes large 128k of data, and 64k of 
parity.

The idea is that the first third of the stripe (64k on disk0) is composed by 
"adaaa...";
the second third (again 64k on disk1 is composed by "bdbbbb"; and finally the 
parity (disk0^disk1) is stored on the last portion of the stripe.

Doing some calculation it is possible to know where the data is physically 
stored.

I perform 3 tests:
1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks if the bytes of the stripe 
are corrected
2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks the bytes of the stripe are 
corrected
3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; 
run scrub; unmount the filesystem; check all the disks the bytes of the stripe 
are corrected

Before your patches, my all tests fail (not all the times, but very often).
After your patches, test1 and test2 still fail. In my test test3 succeded.

Thanks for your report and script.

I compared my script with yours, and found the difference is, I'm corrupting the whole data stripe (64K, and without space cache to make sure all data are at the 1st full stripe), while you only corrupt 5 bytes of it.

And in that case, btrfs won't report more error than expected 1, which means the race fix is working.

But new btrfs-progs scrub will report that the recovered data stripes are all OK, but P/Q is not correct.

This seems to be a new bug, unrelated to the RAID5/6 scrub race(whose main symptom is incorrect csum error number and sometimes even unrecoverable error).


I assume it will be easier to fix than the race, but I can be totally wrong unless I dig into it further.

Anyway I'll fix it in a new patchset.

Thanks,
Qu


Enclosed my script which performs the tests (it uses loop device; please run in 
a VM; firts you have to uncomment the function make_imgs to create the disk 
image.).

Let me know if I can help you providing more information.

BR
G.Baroncelli



Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.

Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h       |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c       | 192 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 199 insertions(+)





Attachment: raid56_recover.sh
Description: application/shellscript

Reply via email to