Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-18 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote:
> Hi Zygo
> On 2016-11-18 00:13, Zygo Blaxell wrote:
> > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >> Fix the so-called famous RAID5/6 scrub error.
> >>
> >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >> sight.
> >> (Yes, without the Phoronix report on this,
> >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >> I won't ever be aware of it)
> > 
> > If you're hearing about btrfs RAID5 bugs for the first time through
> > Phoronix, then your testing coverage is *clearly* inadequate.
> > 
> > Fill up a RAID5 array, start a FS stress test, pull a drive out while
> > that's running, let the FS stress test run for another hour, then try
> > to replace or delete the missing device.  If there are any crashes,
> > corruptions, or EIO during any part of this process (assuming all the
> > remaining disks are healthy), then btrfs RAID5 is still broken, and
> > you've found another bug to fix.
> > 
> > The fact that so many problems in btrfs can still be found this way
> > indicates to me that nobody is doing this basic level of testing
> > (or if they are, they're not doing anything about the results).
> 
> [...]
> 
> Sorry but I don't find useful this kind of discussion.  Yes BTRFS
> RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be
> complete; but this is not a fault of a single person; and Qu tried to
> solve one issue and for this we should say only tanks..
>
> Even if you don't find valuable the work of Qu (and my little one :-)
> ), this required some time and need to be respected.

I do find this work valuable, and I do thank you and Qu for it.
I've been following it with great interest because I haven't had time
to dive into it myself.  It's a use case I used before and would like
to use again.

Most of my recent frustration, if directed at anyone, is really directed
at Phoronix for conflating "one bug was fixed" with "ready for production
use today," and I wanted to ensure that the latter rumor was promptly
quashed.

This is why I'm excited about Qu's work:  on my list of 7 btrfs-raid5
recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them,
maybe as many as 4, with the patches so far.  I can fix 2 of the others,
for a total of 6 fixed out of 7.

Specifically, the 7 bugs I know of are:

1-2. BUG_ONs in functions that should return errors (I had
fixed both already when trying to recover my broken arrays)

3. scrub can't identify which drives or files are corrupted
(Qu might have fixed this--I won't know until I do testing)

4-6. symptom groups related to wrong data or EIO in scrub
recovery, including Goffredo's (Qu might have fixed all of these,
but from a quick read of the patch I think at least two are done).

7. the write hole.

I'll know more after I've had a chance to run Qu's patches through
testing, which I intend to do at some point.

Optimistically, this means there could be only *one* bug remaining
in the critical path for btrfs RAID56 single disk failure recovery.
That last bug is the write hole, which is why I keep going on about it.
It's the only bug I know exists in btrfs RAID56 that has neither an
existing fix nor any evidence of someone actively working on it, even
at the design proposal stage.  Please, I'd love to be wrong about this.

When I described the situation recently as "a thin layer of bugs on
top of a design defect", I was not trying to be mean.  I was trying to
describe the situation *precisely*.

The thin layer of bugs is much thinner thanks to Qu's work, and thanks
in part to his work, I now have confidence that further investment in
this area won't be wasted.

> Finally, I don't think that we should compare the RAID-hole with this
> kind of bug(fix). The former is a design issue, the latter is a bug
> related of one of the basic feature of the raid system (recover from
> the lost of a disk/corruption).
>
> Even the MD subsystem (which is far behind btrfs) had tolerated
> the raid-hole until last year. 

My frustration against this point is the attitude that mdadm was ever
good enough, much less a model to emulate in the future.  It's 2016--there
have been some advancements in the state of the art since the IBM patent
describing RAID5 30 years ago, yet in the btrfs world, we seem to insist
on repeating all the same mistakes in the same order.

"We're as good as some existing broken-by-design thing" isn't a really
useful attitude.  We should aspire to do *better* than the existing
broken-by-design things.  If we didn't, we wouldn't be here, we'd all
be lurking on some other list, running ext4 or xfs on mdadm or lvm.

> And its solution is far to be cheap
> (basically the MD subsystem wrote the data first in the journal
> then on the disk... which is the kind of issue that a COW filesystem
> would solve).

Journalling isn't required.  It's 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-18 Thread Goffredo Baroncelli
Hi Zygo
On 2016-11-18 00:13, Zygo Blaxell wrote:
> On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
>> Fix the so-called famous RAID5/6 scrub error.
>>
>> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
>> sight.
>> (Yes, without the Phoronix report on this,
>> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
>> I won't ever be aware of it)
> 
> If you're hearing about btrfs RAID5 bugs for the first time through
> Phoronix, then your testing coverage is *clearly* inadequate.
> 
> Fill up a RAID5 array, start a FS stress test, pull a drive out while
> that's running, let the FS stress test run for another hour, then try
> to replace or delete the missing device.  If there are any crashes,
> corruptions, or EIO during any part of this process (assuming all the
> remaining disks are healthy), then btrfs RAID5 is still broken, and
> you've found another bug to fix.
> 
> The fact that so many problems in btrfs can still be found this way
> indicates to me that nobody is doing this basic level of testing
> (or if they are, they're not doing anything about the results).

[...]

Sorry but I don't find useful this kind of discussion. 
Yes BTRFS RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be 
complete; but this is not a fault of a single person; and Qu tried to solve one 
issue and for this we should say only tanks..

Even if you don't find valuable the work of Qu (and my little one :-) ), this 
required some time and need to be respected.

Finally, I don't think that we should compare the RAID-hole with this kind of 
bug(fix). The former is a design issue, the latter is a bug related of one of 
the basic feature of the raid system (recover from the lost of a 
disk/corruption).

Even the MD subsystem (which is far behind btrfs) had tolerated the raid-hole 
until last year. And its solution is far to be cheap (basically the MD 
subsystem wrote the data first in the journal then on the disk... which is the 
kind of issue that a COW filesystem would solve).

BR
G.Baroncelli






-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



On 11/18/2016 12:43 PM, Zygo Blaxell wrote:

On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote:



At 11/18/2016 09:56 AM, Hugo Mills wrote:

On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug
reported by Goffredo Baroncelli.

Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is
recovering data correctly, but screwing up parity.

I just don't understand why you always want to fix everything in one step.


  Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

  It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

  It just seems like a piece of mis-prioritised effort.


It seems that, we have different standards on the priority.


My concern isn't priority.  Easier bugs often get fixed first.  That's
just the way Linux development works.

I am very concerned by articles like this:

http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed

with headlines like "btrfs RAID5/RAID6 support is finally fixed" when
that's very much not the case.  Only one bug has been removed for the
key use case that makes RAID5 interesting, and it's just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but
it's important to always be clear and _complete_ when stating bug status
because people quote statements out of context.  When the article quoted
the text

"it's not a timed bomb buried deeply into the RAID5/6 code,
but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous
RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo
was the time bomb issue.  It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the
RAID5/6 code, but a separate issue caused by a race condition
in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs.  To be clear, it's not--although the gap
is now significantly narrower.


For me, if some function on the very basic/minimal environment can't work
reliably, then it's a high priority bug.

In this case, in a very minimal setup, with only 128K data spreading on 3
devices RAID5. With a data stripe fully corrupted, without any other thing
interfering.
Scrub can't return correct csum error number and even cause false
unrecoverable error, then it's a high priority thing.



If the problem involves too many steps like removing devices, degraded mode,
fsstress and some time. Then it's not that priority unless one pin-downs the
root case to, for example, degraded mode itself with special sequenced
operations.


There are multiple bugs in the stress + remove device case.  Some are
quite easy to isolate.  They range in difficulty from simple BUG_ON
instead of error returns to finally solving the RMW update problem.

Run the test, choose any of the bugs that occur to work on, repeat until
the test stops finding new bugs for a while.  There are currently several
bugs to choose from with various levels of difficulty to fix them, and you
should hit the first level of bugs in a matter of hours if not minutes.

Using this method, you would have discovered Goffredo's bug years ago.
Instead, you only discovered it after Phoronix quoted the conclusion
of an investigation that started because of problems I reported here on
this list when I discovered half a dozen of btrfs's critical RAID5 bugs
from analyzing *one* disk failure event.


Nope, I don't want to touch anything related to degraded mode if any 
fundamental function is broken, and the bug Goffredo reported has 
nothing to do with any steps you mentioned.


How I find the problem is never related to the bug fix.
Yeah, I found it in Phoronix or whatever other source, then does it have 
anything related to the fix?


You can continue your test or fix or report. But that has nothing 
related to the 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/18/2016 09:56 AM, Hugo Mills wrote:
> >On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>At 11/18/2016 07:13 AM, Zygo Blaxell wrote:
> >>>On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)
> >>>
> >>>If you're hearing about btrfs RAID5 bugs for the first time through
> >>>Phoronix, then your testing coverage is *clearly* inadequate.
> >>
> >>I'm not fixing everything, I'm just focusing on the exact one bug
> >>reported by Goffredo Baroncelli.
> >>
> >>Although it seems that, the bug reported by him is in fact two bugs.
> >>One is race condition I'm fixing, another one is that recovery is
> >>recovering data correctly, but screwing up parity.
> >>
> >>I just don't understand why you always want to fix everything in one step.
> >
> >   Fix the important, fundamental things first, and the others
> >later. This, from my understanding of Zygo's comments, appears to be
> >one of the others.
> >
> >   It's papering over the missing bricks in the wall instead of
> >chipping out the mortar and putting new bricks in. It may need to be
> >fixed, but it's not the fundamental "OMG, everything's totally broken"
> >problem. If anything, it's only a serious problem *because* the other
> >thing (write hole) is still there.
> >
> >   It just seems like a piece of mis-prioritised effort.
> 
> It seems that, we have different standards on the priority.

My concern isn't priority.  Easier bugs often get fixed first.  That's
just the way Linux development works.

I am very concerned by articles like this:

http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed

with headlines like "btrfs RAID5/RAID6 support is finally fixed" when
that's very much not the case.  Only one bug has been removed for the
key use case that makes RAID5 interesting, and it's just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but
it's important to always be clear and _complete_ when stating bug status
because people quote statements out of context.  When the article quoted
the text

"it's not a timed bomb buried deeply into the RAID5/6 code,
but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous
RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo
was the time bomb issue.  It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the
RAID5/6 code, but a separate issue caused by a race condition
in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs.  To be clear, it's not--although the gap
is now significantly narrower.

> For me, if some function on the very basic/minimal environment can't work
> reliably, then it's a high priority bug.
> 
> In this case, in a very minimal setup, with only 128K data spreading on 3
> devices RAID5. With a data stripe fully corrupted, without any other thing
> interfering.
> Scrub can't return correct csum error number and even cause false
> unrecoverable error, then it's a high priority thing.

> If the problem involves too many steps like removing devices, degraded mode,
> fsstress and some time. Then it's not that priority unless one pin-downs the
> root case to, for example, degraded mode itself with special sequenced
> operations.

There are multiple bugs in the stress + remove device case.  Some are
quite easy to isolate.  They range in difficulty from simple BUG_ON
instead of error returns to finally solving the RMW update problem.

Run the test, choose any of the bugs that occur to work on, repeat until
the test stops finding new bugs for a while.  There are currently several
bugs to choose from with various levels of difficulty to fix them, and you
should hit the first level of bugs in a matter of hours if not minutes.

Using this method, you would have discovered Goffredo's bug years ago.
Instead, you only discovered it after Phoronix quoted the conclusion
of an investigation that started because of problems I reported here on
this list when I discovered half a dozen of btrfs's critical RAID5 bugs
from analyzing *one* disk failure event.

> Yes, in fact sometimes our developers are fixing such bug at high priority,
> but that's because other part has no such obvious problem.
> But if we have obvious and easy to reproduce bug, then we should start from
> that.

To be able to use a RAID5 in production it must be possible to recover
from 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 09:56 AM, Hugo Mills wrote:

On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug
reported by Goffredo Baroncelli.

Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is
recovering data correctly, but screwing up parity.

I just don't understand why you always want to fix everything in one step.


   Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

   It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

   It just seems like a piece of mis-prioritised effort.


It seems that, we have different standards on the priority.

For me, if some function on the very basic/minimal environment can't 
work reliably, then it's a high priority bug.


In this case, in a very minimal setup, with only 128K data spreading on 
3 devices RAID5. With a data stripe fully corrupted, without any other 
thing interfering.
Scrub can't return correct csum error number and even cause false 
unrecoverable error, then it's a high priority thing.


If the problem involves too many steps like removing devices, degraded 
mode, fsstress and some time. Then it's not that priority unless one 
pin-downs the root case to, for example, degraded mode itself with 
special sequenced operations.


Yes, in fact sometimes our developers are fixing such bug at high 
priority, but that's because other part has no such obvious problem.
But if we have obvious and easy to reproduce bug, then we should start 
from that.


So I don't consider the problem Zygo written is a high priority problem.




Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.


Then it will be another bug to fix, not the bug I'm fixing.
Unless you can prove whatever the bug is, is relating to the fix, I
don't see any help then.



The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).


Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.


I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).


Not "the core of btrfs RAID56", but "the core of all RAID56".

And, this is just another unrelated problem, I didn't even want to
repeat what I've written.


   That's the one that needs fixing *first*, though. (IMO, YMMV,
Contents may have settled in transit).


And forget to mention, we don't even have an idea on how to fix the 
generic RAID56 RMW problem, even for mdadm RAID56.


BTW, for mdadm RAID56, they just scrub the whole array after every power 
loss, we could just call user to do that, and btrfs csum can assist 
scrub to do a better job than mdadm raid56.


Although, we must make sure btrfs scrub on raid56 won't cause false 
alert(the patchset) and no screwed up parity(I'll fix soon) first.


Thanks,
Qu



   Hugo.


Thanks,
Qu




The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


True.


Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.


This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Hugo Mills
On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/18/2016 07:13 AM, Zygo Blaxell wrote:
> >On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >>Fix the so-called famous RAID5/6 scrub error.
> >>
> >>Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >>sight.
> >>(Yes, without the Phoronix report on this,
> >>https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >>I won't ever be aware of it)
> >
> >If you're hearing about btrfs RAID5 bugs for the first time through
> >Phoronix, then your testing coverage is *clearly* inadequate.
> 
> I'm not fixing everything, I'm just focusing on the exact one bug
> reported by Goffredo Baroncelli.
> 
> Although it seems that, the bug reported by him is in fact two bugs.
> One is race condition I'm fixing, another one is that recovery is
> recovering data correctly, but screwing up parity.
> 
> I just don't understand why you always want to fix everything in one step.

   Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

   It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

   It just seems like a piece of mis-prioritised effort.

> >Fill up a RAID5 array, start a FS stress test, pull a drive out while
> >that's running, let the FS stress test run for another hour, then try
> >to replace or delete the missing device.  If there are any crashes,
> >corruptions, or EIO during any part of this process (assuming all the
> >remaining disks are healthy), then btrfs RAID5 is still broken, and
> >you've found another bug to fix.
> 
> Then it will be another bug to fix, not the bug I'm fixing.
> Unless you can prove whatever the bug is, is relating to the fix, I
> don't see any help then.
> 
> >
> >The fact that so many problems in btrfs can still be found this way
> >indicates to me that nobody is doing this basic level of testing
> >(or if they are, they're not doing anything about the results).
> >
> >>Unlike many of us(including myself) assumed, it's not a timed bomb buried
> >>deeply into the RAID5/6 code, but a race condition in scrub recovery
> >>code.
> >
> >I don't see how this patch fixes the write hole issue at the core of
> >btrfs RAID56.  It just makes the thin layer of bugs over that issue a
> >little thinner.  There's still the metadata RMW update timebomb at the
> >bottom of the bug pile that can't be fixed by scrub (the filesystem is
> >unrecoverably damaged when the bomb goes off, so scrub isn't possible).
> 
> Not "the core of btrfs RAID56", but "the core of all RAID56".
> 
> And, this is just another unrelated problem, I didn't even want to
> repeat what I've written.

   That's the one that needs fixing *first*, though. (IMO, YMMV,
Contents may have settled in transit).

   Hugo.

> Thanks,
> Qu
> 
> >
> >>The problem is not found because normal mirror based profiles aren't
> >>affected by the race, since they are independent with each other.
> >
> >True.
> >
> >>Although this time the fix doesn't affect the scrub code much, it should
> >>warn us that current scrub code is really hard to maintain.
> >
> >This last sentence is true.  I found and fixed three BUG_ONs in RAID5
> >code on the first day I started testing in degraded mode, then hit
> >the scrub code and had to give up.  It was like a brick wall made out
> >of mismatched assumptions and layering inversions, using uninitialized
> >kernel data as mortar (though I suppose the "uninitialized" data symptom
> >might just have been an unprotected memory access).
> >
> >>Abuse of workquque to delay works and the full fs scrub is race prone.
> >>
> >>Xfstest will follow a little later, as we don't have good enough tools
> >>to corrupt data stripes pinpointly.
> >>
> >>Qu Wenruo (2):
> >>  btrfs: scrub: Introduce full stripe lock for RAID56
> >>  btrfs: scrub: Fix RAID56 recovery race condition
> >>
> >> fs/btrfs/ctree.h   |   4 ++
> >> fs/btrfs/extent-tree.c |   3 +
> >> fs/btrfs/scrub.c   | 192 
> >> +
> >> 3 files changed, 199 insertions(+)
> >>

-- 
Hugo Mills | UDP jokes: It's OK if no-one gets them.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug 
reported by Goffredo Baroncelli.


Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is 
recovering data correctly, but screwing up parity.


I just don't understand why you always want to fix everything in one step.



Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.


Then it will be another bug to fix, not the bug I'm fixing.
Unless you can prove whatever the bug is, is relating to the fix, I 
don't see any help then.




The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).


Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.


I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).


Not "the core of btrfs RAID56", but "the core of all RAID56".

And, this is just another unrelated problem, I didn't even want to 
repeat what I've written.


Thanks,
Qu




The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


True.


Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.


This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of mismatched assumptions and layering inversions, using uninitialized
kernel data as mortar (though I suppose the "uninitialized" data symptom
might just have been an unprotected memory access).


Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 192 +
 3 files changed, 199 insertions(+)

--
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 02:09 AM, Goffredo Baroncelli wrote:

Hi Qu,

On 2016-11-15 03:50, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)

Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.

The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


Unfortunately, even with these patches btrfs still fails to rebuild a raid5 
array in my tests.
To perform the tests I create a filesystem raid5-three disks; then I created a 
file

# python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt

The filesystem layout is composed by stripes large 128k of data, and 64k of 
parity.

The idea is that the first third of the stripe (64k on disk0) is composed by 
"adaaa...";
the second third (again 64k on disk1 is composed by "bd"; and finally the 
parity (disk0^disk1) is stored on the last portion of the stripe.

Doing some calculation it is possible to know where the data is physically 
stored.

I perform 3 tests:
1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks if the bytes of the stripe 
are corrected
2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks the bytes of the stripe are 
corrected
3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; 
run scrub; unmount the filesystem; check all the disks the bytes of the stripe 
are corrected

Before your patches, my all tests fail (not all the times, but very often).
After your patches, test1 and test2 still fail. In my test test3 succeded.


Thanks for your report and script.

I compared my script with yours, and found the difference is, I'm 
corrupting the whole data stripe (64K, and without space cache to make 
sure all data are at the 1st full stripe), while you only corrupt 5 
bytes of it.


And in that case, btrfs won't report more error than expected 1, which 
means the race fix is working.


But new btrfs-progs scrub will report that the recovered data stripes 
are all OK, but P/Q is not correct.


This seems to be a new bug, unrelated to the RAID5/6 scrub race(whose 
main symptom is incorrect csum error number and sometimes even 
unrecoverable error).



I assume it will be easier to fix than the race, but I can be totally 
wrong unless I dig into it further.


Anyway I'll fix it in a new patchset.

Thanks,
Qu



Enclosed my script which performs the tests (it uses loop device; please run in 
a VM; firts you have to uncomment the function make_imgs to create the disk 
image.).

Let me know if I can help you providing more information.

BR
G.Baroncelli




Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.

Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 192 +
 3 files changed, 199 insertions(+)









raid56_recover.sh
Description: application/shellscript


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)

If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.

Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.

The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).

> Unlike many of us(including myself) assumed, it's not a timed bomb buried
> deeply into the RAID5/6 code, but a race condition in scrub recovery
> code.

I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).

> The problem is not found because normal mirror based profiles aren't
> affected by the race, since they are independent with each other.

True.

> Although this time the fix doesn't affect the scrub code much, it should
> warn us that current scrub code is really hard to maintain.

This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of mismatched assumptions and layering inversions, using uninitialized
kernel data as mortar (though I suppose the "uninitialized" data symptom
might just have been an unprotected memory access).

> Abuse of workquque to delay works and the full fs scrub is race prone.
> 
> Xfstest will follow a little later, as we don't have good enough tools
> to corrupt data stripes pinpointly.
> 
> Qu Wenruo (2):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
> 
>  fs/btrfs/ctree.h   |   4 ++
>  fs/btrfs/extent-tree.c |   3 +
>  fs/btrfs/scrub.c   | 192 
> +
>  3 files changed, 199 insertions(+)
> 
> -- 
> 2.10.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Goffredo Baroncelli
Hi Qu,

On 2016-11-15 03:50, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)
> 
> Unlike many of us(including myself) assumed, it's not a timed bomb buried
> deeply into the RAID5/6 code, but a race condition in scrub recovery
> code.
> 
> The problem is not found because normal mirror based profiles aren't
> affected by the race, since they are independent with each other.

Unfortunately, even with these patches btrfs still fails to rebuild a raid5 
array in my tests.
To perform the tests I create a filesystem raid5-three disks; then I created a 
file

# python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt

The filesystem layout is composed by stripes large 128k of data, and 64k of 
parity.

The idea is that the first third of the stripe (64k on disk0) is composed by 
"adaaa...";
the second third (again 64k on disk1 is composed by "bd"; and finally the 
parity (disk0^disk1) is stored on the last portion of the stripe.

Doing some calculation it is possible to know where the data is physically 
stored.

I perform 3 tests:
1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks if the bytes of the stripe 
are corrected
2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks the bytes of the stripe are 
corrected
3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; 
run scrub; unmount the filesystem; check all the disks the bytes of the stripe 
are corrected

Before your patches, my all tests fail (not all the times, but very often).
After your patches, test1 and test2 still fail. In my test test3 succeded.

Enclosed my script which performs the tests (it uses loop device; please run in 
a VM; firts you have to uncomment the function make_imgs to create the disk 
image.).

Let me know if I can help you providing more information.

BR
G.Baroncelli


> 
> Although this time the fix doesn't affect the scrub code much, it should
> warn us that current scrub code is really hard to maintain.
> 
> Abuse of workquque to delay works and the full fs scrub is race prone.
> 
> Xfstest will follow a little later, as we don't have good enough tools
> to corrupt data stripes pinpointly.
> 
> Qu Wenruo (2):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
> 
>  fs/btrfs/ctree.h   |   4 ++
>  fs/btrfs/extent-tree.c |   3 +
>  fs/btrfs/scrub.c   | 192 
> +
>  3 files changed, 199 insertions(+)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


test-btrfs.sh
Description: Bourne shell script