Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote: > Hi Zygo > On 2016-11-18 00:13, Zygo Blaxell wrote: > > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > >> Fix the so-called famous RAID5/6 scrub error. > >> > >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our > >> sight. > >> (Yes, without the Phoronix report on this, > >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > >> I won't ever be aware of it) > > > > If you're hearing about btrfs RAID5 bugs for the first time through > > Phoronix, then your testing coverage is *clearly* inadequate. > > > > Fill up a RAID5 array, start a FS stress test, pull a drive out while > > that's running, let the FS stress test run for another hour, then try > > to replace or delete the missing device. If there are any crashes, > > corruptions, or EIO during any part of this process (assuming all the > > remaining disks are healthy), then btrfs RAID5 is still broken, and > > you've found another bug to fix. > > > > The fact that so many problems in btrfs can still be found this way > > indicates to me that nobody is doing this basic level of testing > > (or if they are, they're not doing anything about the results). > > [...] > > Sorry but I don't find useful this kind of discussion. Yes BTRFS > RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be > complete; but this is not a fault of a single person; and Qu tried to > solve one issue and for this we should say only tanks.. > > Even if you don't find valuable the work of Qu (and my little one :-) > ), this required some time and need to be respected. I do find this work valuable, and I do thank you and Qu for it. I've been following it with great interest because I haven't had time to dive into it myself. It's a use case I used before and would like to use again. Most of my recent frustration, if directed at anyone, is really directed at Phoronix for conflating "one bug was fixed" with "ready for production use today," and I wanted to ensure that the latter rumor was promptly quashed. This is why I'm excited about Qu's work: on my list of 7 btrfs-raid5 recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them, maybe as many as 4, with the patches so far. I can fix 2 of the others, for a total of 6 fixed out of 7. Specifically, the 7 bugs I know of are: 1-2. BUG_ONs in functions that should return errors (I had fixed both already when trying to recover my broken arrays) 3. scrub can't identify which drives or files are corrupted (Qu might have fixed this--I won't know until I do testing) 4-6. symptom groups related to wrong data or EIO in scrub recovery, including Goffredo's (Qu might have fixed all of these, but from a quick read of the patch I think at least two are done). 7. the write hole. I'll know more after I've had a chance to run Qu's patches through testing, which I intend to do at some point. Optimistically, this means there could be only *one* bug remaining in the critical path for btrfs RAID56 single disk failure recovery. That last bug is the write hole, which is why I keep going on about it. It's the only bug I know exists in btrfs RAID56 that has neither an existing fix nor any evidence of someone actively working on it, even at the design proposal stage. Please, I'd love to be wrong about this. When I described the situation recently as "a thin layer of bugs on top of a design defect", I was not trying to be mean. I was trying to describe the situation *precisely*. The thin layer of bugs is much thinner thanks to Qu's work, and thanks in part to his work, I now have confidence that further investment in this area won't be wasted. > Finally, I don't think that we should compare the RAID-hole with this > kind of bug(fix). The former is a design issue, the latter is a bug > related of one of the basic feature of the raid system (recover from > the lost of a disk/corruption). > > Even the MD subsystem (which is far behind btrfs) had tolerated > the raid-hole until last year. My frustration against this point is the attitude that mdadm was ever good enough, much less a model to emulate in the future. It's 2016--there have been some advancements in the state of the art since the IBM patent describing RAID5 30 years ago, yet in the btrfs world, we seem to insist on repeating all the same mistakes in the same order. "We're as good as some existing broken-by-design thing" isn't a really useful attitude. We should aspire to do *better* than the existing broken-by-design things. If we didn't, we wouldn't be here, we'd all be lurking on some other list, running ext4 or xfs on mdadm or lvm. > And its solution is far to be cheap > (basically the MD subsystem wrote the data first in the journal > then on the disk... which is the kind of issue that a COW filesystem > would solve). Journalling isn't required. It's
Re: [PATCH 0/2] RAID5/6 scrub race fix
Hi Zygo On 2016-11-18 00:13, Zygo Blaxell wrote: > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: >> Fix the so-called famous RAID5/6 scrub error. >> >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our >> sight. >> (Yes, without the Phoronix report on this, >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, >> I won't ever be aware of it) > > If you're hearing about btrfs RAID5 bugs for the first time through > Phoronix, then your testing coverage is *clearly* inadequate. > > Fill up a RAID5 array, start a FS stress test, pull a drive out while > that's running, let the FS stress test run for another hour, then try > to replace or delete the missing device. If there are any crashes, > corruptions, or EIO during any part of this process (assuming all the > remaining disks are healthy), then btrfs RAID5 is still broken, and > you've found another bug to fix. > > The fact that so many problems in btrfs can still be found this way > indicates to me that nobody is doing this basic level of testing > (or if they are, they're not doing anything about the results). [...] Sorry but I don't find useful this kind of discussion. Yes BTRFS RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be complete; but this is not a fault of a single person; and Qu tried to solve one issue and for this we should say only tanks.. Even if you don't find valuable the work of Qu (and my little one :-) ), this required some time and need to be respected. Finally, I don't think that we should compare the RAID-hole with this kind of bug(fix). The former is a design issue, the latter is a bug related of one of the basic feature of the raid system (recover from the lost of a disk/corruption). Even the MD subsystem (which is far behind btrfs) had tolerated the raid-hole until last year. And its solution is far to be cheap (basically the MD subsystem wrote the data first in the journal then on the disk... which is the kind of issue that a COW filesystem would solve). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] RAID5/6 scrub race fix
On 11/18/2016 12:43 PM, Zygo Blaxell wrote: On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote: At 11/18/2016 09:56 AM, Hugo Mills wrote: On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote: At 11/18/2016 07:13 AM, Zygo Blaxell wrote: On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: Fix the so-called famous RAID5/6 scrub error. Thanks Goffredo Baroncelli for reporting the bug, and make it into our sight. (Yes, without the Phoronix report on this, https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, I won't ever be aware of it) If you're hearing about btrfs RAID5 bugs for the first time through Phoronix, then your testing coverage is *clearly* inadequate. I'm not fixing everything, I'm just focusing on the exact one bug reported by Goffredo Baroncelli. Although it seems that, the bug reported by him is in fact two bugs. One is race condition I'm fixing, another one is that recovery is recovering data correctly, but screwing up parity. I just don't understand why you always want to fix everything in one step. Fix the important, fundamental things first, and the others later. This, from my understanding of Zygo's comments, appears to be one of the others. It's papering over the missing bricks in the wall instead of chipping out the mortar and putting new bricks in. It may need to be fixed, but it's not the fundamental "OMG, everything's totally broken" problem. If anything, it's only a serious problem *because* the other thing (write hole) is still there. It just seems like a piece of mis-prioritised effort. It seems that, we have different standards on the priority. My concern isn't priority. Easier bugs often get fixed first. That's just the way Linux development works. I am very concerned by articles like this: http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed with headlines like "btrfs RAID5/RAID6 support is finally fixed" when that's very much not the case. Only one bug has been removed for the key use case that makes RAID5 interesting, and it's just the first of many that still remain in the path of a user trying to recover from a normal disk failure. Admittedly this is Michael's (Phoronix's) problem more than Qu's, but it's important to always be clear and _complete_ when stating bug status because people quote statements out of context. When the article quoted the text "it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code" the commenters on Phoronix are clearly interpreting this to mean "famous RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo was the time bomb issue. It's more accurate to say something like "Goffredo's issue is not the time bomb buried deeply in the RAID5/6 code, but a separate issue caused by a race condition in scrub recovery code" Reading the Phoronix article, one might imagine RAID5 is now working as well as RAID1 on btrfs. To be clear, it's not--although the gap is now significantly narrower. For me, if some function on the very basic/minimal environment can't work reliably, then it's a high priority bug. In this case, in a very minimal setup, with only 128K data spreading on 3 devices RAID5. With a data stripe fully corrupted, without any other thing interfering. Scrub can't return correct csum error number and even cause false unrecoverable error, then it's a high priority thing. If the problem involves too many steps like removing devices, degraded mode, fsstress and some time. Then it's not that priority unless one pin-downs the root case to, for example, degraded mode itself with special sequenced operations. There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem. Run the test, choose any of the bugs that occur to work on, repeat until the test stops finding new bugs for a while. There are currently several bugs to choose from with various levels of difficulty to fix them, and you should hit the first level of bugs in a matter of hours if not minutes. Using this method, you would have discovered Goffredo's bug years ago. Instead, you only discovered it after Phoronix quoted the conclusion of an investigation that started because of problems I reported here on this list when I discovered half a dozen of btrfs's critical RAID5 bugs from analyzing *one* disk failure event. Nope, I don't want to touch anything related to degraded mode if any fundamental function is broken, and the bug Goffredo reported has nothing to do with any steps you mentioned. How I find the problem is never related to the bug fix. Yeah, I found it in Phoronix or whatever other source, then does it have anything related to the fix? You can continue your test or fix or report. But that has nothing related to the
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote: > > > At 11/18/2016 09:56 AM, Hugo Mills wrote: > >On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote: > >> > >> > >>At 11/18/2016 07:13 AM, Zygo Blaxell wrote: > >>>On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > Fix the so-called famous RAID5/6 scrub error. > > Thanks Goffredo Baroncelli for reporting the bug, and make it into our > sight. > (Yes, without the Phoronix report on this, > https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > I won't ever be aware of it) > >>> > >>>If you're hearing about btrfs RAID5 bugs for the first time through > >>>Phoronix, then your testing coverage is *clearly* inadequate. > >> > >>I'm not fixing everything, I'm just focusing on the exact one bug > >>reported by Goffredo Baroncelli. > >> > >>Although it seems that, the bug reported by him is in fact two bugs. > >>One is race condition I'm fixing, another one is that recovery is > >>recovering data correctly, but screwing up parity. > >> > >>I just don't understand why you always want to fix everything in one step. > > > > Fix the important, fundamental things first, and the others > >later. This, from my understanding of Zygo's comments, appears to be > >one of the others. > > > > It's papering over the missing bricks in the wall instead of > >chipping out the mortar and putting new bricks in. It may need to be > >fixed, but it's not the fundamental "OMG, everything's totally broken" > >problem. If anything, it's only a serious problem *because* the other > >thing (write hole) is still there. > > > > It just seems like a piece of mis-prioritised effort. > > It seems that, we have different standards on the priority. My concern isn't priority. Easier bugs often get fixed first. That's just the way Linux development works. I am very concerned by articles like this: http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed with headlines like "btrfs RAID5/RAID6 support is finally fixed" when that's very much not the case. Only one bug has been removed for the key use case that makes RAID5 interesting, and it's just the first of many that still remain in the path of a user trying to recover from a normal disk failure. Admittedly this is Michael's (Phoronix's) problem more than Qu's, but it's important to always be clear and _complete_ when stating bug status because people quote statements out of context. When the article quoted the text "it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code" the commenters on Phoronix are clearly interpreting this to mean "famous RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo was the time bomb issue. It's more accurate to say something like "Goffredo's issue is not the time bomb buried deeply in the RAID5/6 code, but a separate issue caused by a race condition in scrub recovery code" Reading the Phoronix article, one might imagine RAID5 is now working as well as RAID1 on btrfs. To be clear, it's not--although the gap is now significantly narrower. > For me, if some function on the very basic/minimal environment can't work > reliably, then it's a high priority bug. > > In this case, in a very minimal setup, with only 128K data spreading on 3 > devices RAID5. With a data stripe fully corrupted, without any other thing > interfering. > Scrub can't return correct csum error number and even cause false > unrecoverable error, then it's a high priority thing. > If the problem involves too many steps like removing devices, degraded mode, > fsstress and some time. Then it's not that priority unless one pin-downs the > root case to, for example, degraded mode itself with special sequenced > operations. There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem. Run the test, choose any of the bugs that occur to work on, repeat until the test stops finding new bugs for a while. There are currently several bugs to choose from with various levels of difficulty to fix them, and you should hit the first level of bugs in a matter of hours if not minutes. Using this method, you would have discovered Goffredo's bug years ago. Instead, you only discovered it after Phoronix quoted the conclusion of an investigation that started because of problems I reported here on this list when I discovered half a dozen of btrfs's critical RAID5 bugs from analyzing *one* disk failure event. > Yes, in fact sometimes our developers are fixing such bug at high priority, > but that's because other part has no such obvious problem. > But if we have obvious and easy to reproduce bug, then we should start from > that. To be able to use a RAID5 in production it must be possible to recover from
Re: [PATCH 0/2] RAID5/6 scrub race fix
At 11/18/2016 09:56 AM, Hugo Mills wrote: On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote: At 11/18/2016 07:13 AM, Zygo Blaxell wrote: On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: Fix the so-called famous RAID5/6 scrub error. Thanks Goffredo Baroncelli for reporting the bug, and make it into our sight. (Yes, without the Phoronix report on this, https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, I won't ever be aware of it) If you're hearing about btrfs RAID5 bugs for the first time through Phoronix, then your testing coverage is *clearly* inadequate. I'm not fixing everything, I'm just focusing on the exact one bug reported by Goffredo Baroncelli. Although it seems that, the bug reported by him is in fact two bugs. One is race condition I'm fixing, another one is that recovery is recovering data correctly, but screwing up parity. I just don't understand why you always want to fix everything in one step. Fix the important, fundamental things first, and the others later. This, from my understanding of Zygo's comments, appears to be one of the others. It's papering over the missing bricks in the wall instead of chipping out the mortar and putting new bricks in. It may need to be fixed, but it's not the fundamental "OMG, everything's totally broken" problem. If anything, it's only a serious problem *because* the other thing (write hole) is still there. It just seems like a piece of mis-prioritised effort. It seems that, we have different standards on the priority. For me, if some function on the very basic/minimal environment can't work reliably, then it's a high priority bug. In this case, in a very minimal setup, with only 128K data spreading on 3 devices RAID5. With a data stripe fully corrupted, without any other thing interfering. Scrub can't return correct csum error number and even cause false unrecoverable error, then it's a high priority thing. If the problem involves too many steps like removing devices, degraded mode, fsstress and some time. Then it's not that priority unless one pin-downs the root case to, for example, degraded mode itself with special sequenced operations. Yes, in fact sometimes our developers are fixing such bug at high priority, but that's because other part has no such obvious problem. But if we have obvious and easy to reproduce bug, then we should start from that. So I don't consider the problem Zygo written is a high priority problem. Fill up a RAID5 array, start a FS stress test, pull a drive out while that's running, let the FS stress test run for another hour, then try to replace or delete the missing device. If there are any crashes, corruptions, or EIO during any part of this process (assuming all the remaining disks are healthy), then btrfs RAID5 is still broken, and you've found another bug to fix. Then it will be another bug to fix, not the bug I'm fixing. Unless you can prove whatever the bug is, is relating to the fix, I don't see any help then. The fact that so many problems in btrfs can still be found this way indicates to me that nobody is doing this basic level of testing (or if they are, they're not doing anything about the results). Unlike many of us(including myself) assumed, it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code. I don't see how this patch fixes the write hole issue at the core of btrfs RAID56. It just makes the thin layer of bugs over that issue a little thinner. There's still the metadata RMW update timebomb at the bottom of the bug pile that can't be fixed by scrub (the filesystem is unrecoverably damaged when the bomb goes off, so scrub isn't possible). Not "the core of btrfs RAID56", but "the core of all RAID56". And, this is just another unrelated problem, I didn't even want to repeat what I've written. That's the one that needs fixing *first*, though. (IMO, YMMV, Contents may have settled in transit). And forget to mention, we don't even have an idea on how to fix the generic RAID56 RMW problem, even for mdadm RAID56. BTW, for mdadm RAID56, they just scrub the whole array after every power loss, we could just call user to do that, and btrfs csum can assist scrub to do a better job than mdadm raid56. Although, we must make sure btrfs scrub on raid56 won't cause false alert(the patchset) and no screwed up parity(I'll fix soon) first. Thanks, Qu Hugo. Thanks, Qu The problem is not found because normal mirror based profiles aren't affected by the race, since they are independent with each other. True. Although this time the fix doesn't affect the scrub code much, it should warn us that current scrub code is really hard to maintain. This last sentence is true. I found and fixed three BUG_ONs in RAID5 code on the first day I started testing in degraded mode, then hit the scrub code and had to give up. It was like a brick wall made out of
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote: > > > At 11/18/2016 07:13 AM, Zygo Blaxell wrote: > >On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > >>Fix the so-called famous RAID5/6 scrub error. > >> > >>Thanks Goffredo Baroncelli for reporting the bug, and make it into our > >>sight. > >>(Yes, without the Phoronix report on this, > >>https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > >>I won't ever be aware of it) > > > >If you're hearing about btrfs RAID5 bugs for the first time through > >Phoronix, then your testing coverage is *clearly* inadequate. > > I'm not fixing everything, I'm just focusing on the exact one bug > reported by Goffredo Baroncelli. > > Although it seems that, the bug reported by him is in fact two bugs. > One is race condition I'm fixing, another one is that recovery is > recovering data correctly, but screwing up parity. > > I just don't understand why you always want to fix everything in one step. Fix the important, fundamental things first, and the others later. This, from my understanding of Zygo's comments, appears to be one of the others. It's papering over the missing bricks in the wall instead of chipping out the mortar and putting new bricks in. It may need to be fixed, but it's not the fundamental "OMG, everything's totally broken" problem. If anything, it's only a serious problem *because* the other thing (write hole) is still there. It just seems like a piece of mis-prioritised effort. > >Fill up a RAID5 array, start a FS stress test, pull a drive out while > >that's running, let the FS stress test run for another hour, then try > >to replace or delete the missing device. If there are any crashes, > >corruptions, or EIO during any part of this process (assuming all the > >remaining disks are healthy), then btrfs RAID5 is still broken, and > >you've found another bug to fix. > > Then it will be another bug to fix, not the bug I'm fixing. > Unless you can prove whatever the bug is, is relating to the fix, I > don't see any help then. > > > > >The fact that so many problems in btrfs can still be found this way > >indicates to me that nobody is doing this basic level of testing > >(or if they are, they're not doing anything about the results). > > > >>Unlike many of us(including myself) assumed, it's not a timed bomb buried > >>deeply into the RAID5/6 code, but a race condition in scrub recovery > >>code. > > > >I don't see how this patch fixes the write hole issue at the core of > >btrfs RAID56. It just makes the thin layer of bugs over that issue a > >little thinner. There's still the metadata RMW update timebomb at the > >bottom of the bug pile that can't be fixed by scrub (the filesystem is > >unrecoverably damaged when the bomb goes off, so scrub isn't possible). > > Not "the core of btrfs RAID56", but "the core of all RAID56". > > And, this is just another unrelated problem, I didn't even want to > repeat what I've written. That's the one that needs fixing *first*, though. (IMO, YMMV, Contents may have settled in transit). Hugo. > Thanks, > Qu > > > > >>The problem is not found because normal mirror based profiles aren't > >>affected by the race, since they are independent with each other. > > > >True. > > > >>Although this time the fix doesn't affect the scrub code much, it should > >>warn us that current scrub code is really hard to maintain. > > > >This last sentence is true. I found and fixed three BUG_ONs in RAID5 > >code on the first day I started testing in degraded mode, then hit > >the scrub code and had to give up. It was like a brick wall made out > >of mismatched assumptions and layering inversions, using uninitialized > >kernel data as mortar (though I suppose the "uninitialized" data symptom > >might just have been an unprotected memory access). > > > >>Abuse of workquque to delay works and the full fs scrub is race prone. > >> > >>Xfstest will follow a little later, as we don't have good enough tools > >>to corrupt data stripes pinpointly. > >> > >>Qu Wenruo (2): > >> btrfs: scrub: Introduce full stripe lock for RAID56 > >> btrfs: scrub: Fix RAID56 recovery race condition > >> > >> fs/btrfs/ctree.h | 4 ++ > >> fs/btrfs/extent-tree.c | 3 + > >> fs/btrfs/scrub.c | 192 > >> + > >> 3 files changed, 199 insertions(+) > >> -- Hugo Mills | UDP jokes: It's OK if no-one gets them. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: [PATCH 0/2] RAID5/6 scrub race fix
At 11/18/2016 07:13 AM, Zygo Blaxell wrote: On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: Fix the so-called famous RAID5/6 scrub error. Thanks Goffredo Baroncelli for reporting the bug, and make it into our sight. (Yes, without the Phoronix report on this, https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, I won't ever be aware of it) If you're hearing about btrfs RAID5 bugs for the first time through Phoronix, then your testing coverage is *clearly* inadequate. I'm not fixing everything, I'm just focusing on the exact one bug reported by Goffredo Baroncelli. Although it seems that, the bug reported by him is in fact two bugs. One is race condition I'm fixing, another one is that recovery is recovering data correctly, but screwing up parity. I just don't understand why you always want to fix everything in one step. Fill up a RAID5 array, start a FS stress test, pull a drive out while that's running, let the FS stress test run for another hour, then try to replace or delete the missing device. If there are any crashes, corruptions, or EIO during any part of this process (assuming all the remaining disks are healthy), then btrfs RAID5 is still broken, and you've found another bug to fix. Then it will be another bug to fix, not the bug I'm fixing. Unless you can prove whatever the bug is, is relating to the fix, I don't see any help then. The fact that so many problems in btrfs can still be found this way indicates to me that nobody is doing this basic level of testing (or if they are, they're not doing anything about the results). Unlike many of us(including myself) assumed, it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code. I don't see how this patch fixes the write hole issue at the core of btrfs RAID56. It just makes the thin layer of bugs over that issue a little thinner. There's still the metadata RMW update timebomb at the bottom of the bug pile that can't be fixed by scrub (the filesystem is unrecoverably damaged when the bomb goes off, so scrub isn't possible). Not "the core of btrfs RAID56", but "the core of all RAID56". And, this is just another unrelated problem, I didn't even want to repeat what I've written. Thanks, Qu The problem is not found because normal mirror based profiles aren't affected by the race, since they are independent with each other. True. Although this time the fix doesn't affect the scrub code much, it should warn us that current scrub code is really hard to maintain. This last sentence is true. I found and fixed three BUG_ONs in RAID5 code on the first day I started testing in degraded mode, then hit the scrub code and had to give up. It was like a brick wall made out of mismatched assumptions and layering inversions, using uninitialized kernel data as mortar (though I suppose the "uninitialized" data symptom might just have been an unprotected memory access). Abuse of workquque to delay works and the full fs scrub is race prone. Xfstest will follow a little later, as we don't have good enough tools to corrupt data stripes pinpointly. Qu Wenruo (2): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race condition fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/scrub.c | 192 + 3 files changed, 199 insertions(+) -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] RAID5/6 scrub race fix
At 11/18/2016 02:09 AM, Goffredo Baroncelli wrote: Hi Qu, On 2016-11-15 03:50, Qu Wenruo wrote: Fix the so-called famous RAID5/6 scrub error. Thanks Goffredo Baroncelli for reporting the bug, and make it into our sight. (Yes, without the Phoronix report on this, https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, I won't ever be aware of it) Unlike many of us(including myself) assumed, it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code. The problem is not found because normal mirror based profiles aren't affected by the race, since they are independent with each other. Unfortunately, even with these patches btrfs still fails to rebuild a raid5 array in my tests. To perform the tests I create a filesystem raid5-three disks; then I created a file # python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt The filesystem layout is composed by stripes large 128k of data, and 64k of parity. The idea is that the first third of the stripe (64k on disk0) is composed by "adaaa..."; the second third (again 64k on disk1 is composed by "bd"; and finally the parity (disk0^disk1) is stored on the last portion of the stripe. Doing some calculation it is possible to know where the data is physically stored. I perform 3 tests: 1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run scrub; unmount the filesystem; check all the disks if the bytes of the stripe are corrected 2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected 3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected Before your patches, my all tests fail (not all the times, but very often). After your patches, test1 and test2 still fail. In my test test3 succeded. Thanks for your report and script. I compared my script with yours, and found the difference is, I'm corrupting the whole data stripe (64K, and without space cache to make sure all data are at the 1st full stripe), while you only corrupt 5 bytes of it. And in that case, btrfs won't report more error than expected 1, which means the race fix is working. But new btrfs-progs scrub will report that the recovered data stripes are all OK, but P/Q is not correct. This seems to be a new bug, unrelated to the RAID5/6 scrub race(whose main symptom is incorrect csum error number and sometimes even unrecoverable error). I assume it will be easier to fix than the race, but I can be totally wrong unless I dig into it further. Anyway I'll fix it in a new patchset. Thanks, Qu Enclosed my script which performs the tests (it uses loop device; please run in a VM; firts you have to uncomment the function make_imgs to create the disk image.). Let me know if I can help you providing more information. BR G.Baroncelli Although this time the fix doesn't affect the scrub code much, it should warn us that current scrub code is really hard to maintain. Abuse of workquque to delay works and the full fs scrub is race prone. Xfstest will follow a little later, as we don't have good enough tools to corrupt data stripes pinpointly. Qu Wenruo (2): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race condition fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/scrub.c | 192 + 3 files changed, 199 insertions(+) raid56_recover.sh Description: application/shellscript
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > Fix the so-called famous RAID5/6 scrub error. > > Thanks Goffredo Baroncelli for reporting the bug, and make it into our > sight. > (Yes, without the Phoronix report on this, > https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > I won't ever be aware of it) If you're hearing about btrfs RAID5 bugs for the first time through Phoronix, then your testing coverage is *clearly* inadequate. Fill up a RAID5 array, start a FS stress test, pull a drive out while that's running, let the FS stress test run for another hour, then try to replace or delete the missing device. If there are any crashes, corruptions, or EIO during any part of this process (assuming all the remaining disks are healthy), then btrfs RAID5 is still broken, and you've found another bug to fix. The fact that so many problems in btrfs can still be found this way indicates to me that nobody is doing this basic level of testing (or if they are, they're not doing anything about the results). > Unlike many of us(including myself) assumed, it's not a timed bomb buried > deeply into the RAID5/6 code, but a race condition in scrub recovery > code. I don't see how this patch fixes the write hole issue at the core of btrfs RAID56. It just makes the thin layer of bugs over that issue a little thinner. There's still the metadata RMW update timebomb at the bottom of the bug pile that can't be fixed by scrub (the filesystem is unrecoverably damaged when the bomb goes off, so scrub isn't possible). > The problem is not found because normal mirror based profiles aren't > affected by the race, since they are independent with each other. True. > Although this time the fix doesn't affect the scrub code much, it should > warn us that current scrub code is really hard to maintain. This last sentence is true. I found and fixed three BUG_ONs in RAID5 code on the first day I started testing in degraded mode, then hit the scrub code and had to give up. It was like a brick wall made out of mismatched assumptions and layering inversions, using uninitialized kernel data as mortar (though I suppose the "uninitialized" data symptom might just have been an unprotected memory access). > Abuse of workquque to delay works and the full fs scrub is race prone. > > Xfstest will follow a little later, as we don't have good enough tools > to corrupt data stripes pinpointly. > > Qu Wenruo (2): > btrfs: scrub: Introduce full stripe lock for RAID56 > btrfs: scrub: Fix RAID56 recovery race condition > > fs/btrfs/ctree.h | 4 ++ > fs/btrfs/extent-tree.c | 3 + > fs/btrfs/scrub.c | 192 > + > 3 files changed, 199 insertions(+) > > -- > 2.10.2 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [PATCH 0/2] RAID5/6 scrub race fix
Hi Qu, On 2016-11-15 03:50, Qu Wenruo wrote: > Fix the so-called famous RAID5/6 scrub error. > > Thanks Goffredo Baroncelli for reporting the bug, and make it into our > sight. > (Yes, without the Phoronix report on this, > https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > I won't ever be aware of it) > > Unlike many of us(including myself) assumed, it's not a timed bomb buried > deeply into the RAID5/6 code, but a race condition in scrub recovery > code. > > The problem is not found because normal mirror based profiles aren't > affected by the race, since they are independent with each other. Unfortunately, even with these patches btrfs still fails to rebuild a raid5 array in my tests. To perform the tests I create a filesystem raid5-three disks; then I created a file # python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt The filesystem layout is composed by stripes large 128k of data, and 64k of parity. The idea is that the first third of the stripe (64k on disk0) is composed by "adaaa..."; the second third (again 64k on disk1 is composed by "bd"; and finally the parity (disk0^disk1) is stored on the last portion of the stripe. Doing some calculation it is possible to know where the data is physically stored. I perform 3 tests: 1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run scrub; unmount the filesystem; check all the disks if the bytes of the stripe are corrected 2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected 3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; run scrub; unmount the filesystem; check all the disks the bytes of the stripe are corrected Before your patches, my all tests fail (not all the times, but very often). After your patches, test1 and test2 still fail. In my test test3 succeded. Enclosed my script which performs the tests (it uses loop device; please run in a VM; firts you have to uncomment the function make_imgs to create the disk image.). Let me know if I can help you providing more information. BR G.Baroncelli > > Although this time the fix doesn't affect the scrub code much, it should > warn us that current scrub code is really hard to maintain. > > Abuse of workquque to delay works and the full fs scrub is race prone. > > Xfstest will follow a little later, as we don't have good enough tools > to corrupt data stripes pinpointly. > > Qu Wenruo (2): > btrfs: scrub: Introduce full stripe lock for RAID56 > btrfs: scrub: Fix RAID56 recovery race condition > > fs/btrfs/ctree.h | 4 ++ > fs/btrfs/extent-tree.c | 3 + > fs/btrfs/scrub.c | 192 > + > 3 files changed, 199 insertions(+) > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 test-btrfs.sh Description: Bourne shell script