Hi Qu, On 2016-11-15 03:50, Qu Wenruo wrote: > Fix the so-called famous RAID5/6 scrub error. > > Thanks Goffredo Baroncelli for reporting the bug, and make it into our > sight. > (Yes, without the Phoronix report on this, > https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-RAID-56-Is-Bad, > I won't ever be aware of it) > > Unlike many of us(including myself) assumed, it's not a timed bomb buried > deeply into the RAID5/6 code, but a race condition in scrub recovery > code. > > The problem is not found because normal mirror based profiles aren't > affected by the race, since they are independent with each other.
Unfortunately, even with these patches btrfs still fails to rebuild a raid5 array in my tests. To perform the tests I create a filesystem raid5-three disks; then I created a file # python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt The filesystem layout is composed by stripes large 128k of data, and 64k of parity. The idea is that the first third of the stripe (64k on disk0) is composed by "adaaa..."; the second third (again 64k on disk1 is composed by "bdbbbb"; and finally the parity (disk0^disk1) is stored on the last portion of the stripe. Doing some calculation it is possible to know where the data is physically stored. I perform 3 tests: 1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run scrub; unmount the filesystem; check all the disks if the bytes of the stripe are corrected 2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected 3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected Before your patches, my all tests fail (not all the times, but very often). After your patches, test1 and test2 still fail. In my test test3 succeded. Enclosed my script which performs the tests (it uses loop device; please run in a VM; firts you have to uncomment the function make_imgs to create the disk image.). Let me know if I can help you providing more information. BR G.Baroncelli > > Although this time the fix doesn't affect the scrub code much, it should > warn us that current scrub code is really hard to maintain. > > Abuse of workquque to delay works and the full fs scrub is race prone. > > Xfstest will follow a little later, as we don't have good enough tools > to corrupt data stripes pinpointly. > > Qu Wenruo (2): > btrfs: scrub: Introduce full stripe lock for RAID56 > btrfs: scrub: Fix RAID56 recovery race condition > > fs/btrfs/ctree.h | 4 ++ > fs/btrfs/extent-tree.c | 3 + > fs/btrfs/scrub.c | 192 > +++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 199 insertions(+) > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
test-btrfs.sh
Description: Bourne shell script