Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Zygo Blaxell
On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote:
> Also, is there RMW with raid0, or even raid10? 

No.  Mirroring is writing the same data in two isolated places.  Striping
is writing data at different isolated places.  No matter which sectors
you write through these layers, it does not affect the correctness of
data in any sector at a different logical address.  None of these use
RMW--you read or write only complete sectors and act only on the specific
sectors requested.  Only parity RAID does RMW.

e.g. in RAID0, when you modify block 47, you may actually modify block
93 on a different disk, but there's always a 1:1 mapping between every
logical and physical address.  If there is a crash we go back to an
earlier tree that does not contain block 47/93 so we don't care if the
write was interrupted.

e.g. in RAID1, when you modify block 47, you modify physical block 47 on
two separate disks.  The state of disk1-block47 may be different from
the state of disk2-block47 if the write is interrupted.  If there is a
crash we go back to an earlier tree that does not contain either copy
of block 47 so we don't care about any inconsistency there.

So raid0, single, dup, raid1, and raid10 are OK--they fall into one or
both of the above cases.  CoW works there.  None of these properties
change in degraded mode with the mirroring profiles.

Parity RAID is writing data in non-isolated places.  When you write to
some sectors, additional sectors are implicitly modified in degraded mode
(whether you are in degraded mode at the time of the writes or not).
This is different from the other cases because the other cases never
modify any sectors that were not explicitly requested by the upper layer.
This is OK if and only if the CoW layer is aware of this behavior and
works around it.

> Or is that always CoW
> for metadata and data, just like single and dup? 

It's always CoW at the higher levels, even for parity RAID.  The problem
is that the CoW layer is not aware of the RMW behavior buried in the
parity RAID layer, so the combination doesn't work properly.

CoW thinks it's modifying only block 47, when in fact it's modifying
an entire stripe in degraded mode.  Let's assume 5-disk RAID5 with a
strip size of one block for this example, and say blocks 45-48 are one
RAID stripe.  If there is a crash, data in blocks 45, 46, 47, and 48
may be irretrievably damaged by inconsistent modification of parity and
data blocks.  When we try to go back to an earlier tree that does not
contain block 47, we will end up with a tree that contains corruption in
one of the blocks 45, 46, or 48.  This corruption will only be visible
when something else goes wrong (parity mismatch, data csum failure,
disk failure, or scrub) so a damaged filesystem that isn't degraded
could appear to be healthy for a long time.

If the CoW layer is aware of this, it can arrange operations such
that no stripe is modified while it is referenced by a committed tree.
Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at
block 49 instead of 47.  Since blocks 50-52 contain no data we care about,
we don't even have to bother reading them (just fill the other blocks
with zero or find some other data to write in the same commit), and we
can eliminate many slow RMW operations entirely*.  If there is a crash
we just fall back to an earlier tree that does not contain block 49.
This tree is not damaged because we left blocks 45-48 alone.

One way to tell this is done right is all data in each RAID stripe will
always belong to exactly zero or one transaction, not dozens of different
transactions as stripes have now.

The other way to fix things is to make stripe RMW atomic so that CoW
works properly.  You can tell this is done right if you can find a stripe
update journal in the disk format or the code.

> If raid0 is always
> CoW, then I don't think it's correct to consider raid5 minus parity to
> be anything like raid0 - in a Btrfs context anyway. Outside of that
> context, I understand the argument.
> 
> 
> 
> -- 
> Chris Murphy

[*] We'd still need parity RAID RMW for nodatacow and PREALLOC because
neither uses the CoW layer.  That doesn't matter for nodatacow because
nodatacow is how users tell us they don't want to read their data any
more, but it has interesting implications for PREALLOC.  Maybe a solution
for PREALLOC is to do the first write strictly in RAID-stripe-sized units?


signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
On Fri, Oct 14, 2016 at 3:38 PM, Chris Murphy  wrote:
> On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell
>  wrote:
>
>>
>>> And how common is RMW for metadata operations?
>>
>> RMW in metadata is the norm.  It happens on nearly all commits--the only
>> exception seems to be when both ends of a commit write happen to land
>> on stripe boundaries accidentally, which is less than 1% of the time on
>> 3 disks.
>
> In the interest of due diligence, and the fact I can't confirm or deny
> this myself from reading the code (although I do see many comments
> involving RMW in the code), I must ask Qu if he can corroborate this.
>
> Because basically means btrfs raid56 is not better than md raid56 - by
> design. It has nothing to do with bugs. This is substantially worse
> than the scrub->wrong parity bug.
>
> Does it make sense to proscribe raid5 profile for metadata? As in,
> disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6
> seems like it could be specious - yes there are two copies but if
> there is constant RMW, then there's no CoW and we're not really
> protected that well with all of these overwrites, statistically
> speaking.
>
> Basically you have to have a setup where there's no chance of torn or
> misdirected writes, and no corruptions, in which case Btrfs checksums
> aren't really helpful, you're using it for other reasons (snapshots
> and what not).
>
> Really seriously the CoW part of Btrfs being violated by all of this
> RMW to me sounds like it reduces the pros of Btrfs.


Also, is there RMW with raid0, or even raid10? Or is that always CoW
for metadata and data, just like single and dup? If raid0 is always
CoW, then I don't think it's correct to consider raid5 minus parity to
be anything like raid0 - in a Btrfs context anyway. Outside of that
context, I understand the argument.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell
 wrote:

>
>> And how common is RMW for metadata operations?
>
> RMW in metadata is the norm.  It happens on nearly all commits--the only
> exception seems to be when both ends of a commit write happen to land
> on stripe boundaries accidentally, which is less than 1% of the time on
> 3 disks.

In the interest of due diligence, and the fact I can't confirm or deny
this myself from reading the code (although I do see many comments
involving RMW in the code), I must ask Qu if he can corroborate this.

Because basically means btrfs raid56 is not better than md raid56 - by
design. It has nothing to do with bugs. This is substantially worse
than the scrub->wrong parity bug.

Does it make sense to proscribe raid5 profile for metadata? As in,
disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6
seems like it could be specious - yes there are two copies but if
there is constant RMW, then there's no CoW and we're not really
protected that well with all of these overwrites, statistically
speaking.

Basically you have to have a setup where there's no chance of torn or
misdirected writes, and no corruptions, in which case Btrfs checksums
aren't really helpful, you're using it for other reasons (snapshots
and what not).

Really seriously the CoW part of Btrfs being violated by all of this
RMW to me sounds like it reduces the pros of Btrfs.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Duncan
Zygo Blaxell posted on Fri, 14 Oct 2016 15:55:45 -0400 as excerpted:

> The current btrfs raid5 implementation is a thin layer of bugs on top of
> code that is still missing critical pieces.  There is no mechanism to
> prevent RMW-related failures combined with zero tolerance for
> RMW-related failures in metadata, so I expect a btrfs filesystem using
> raid5 metadata to be extremely fragile.  Failure is not likely--it's
> *inevitable*.

Wow, that's a signature-quality quote reflecting just how dire the 
situation with btrfs parity-raid is ATM.  First sentence for a short sig, 
full paragraph for a longer one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Zygo Blaxell
On Fri, Oct 14, 2016 at 01:16:05AM -0600, Chris Murphy wrote:
> OK so we know for raid5 data block groups there can be RMW. And
> because of that, any interruption results in the write hole. On Btrfs
> thought, the write hole is on disk only. If there's a lost strip
> (failed drive or UNC read), reconstruction from wrong parity results
> in a checksum error and EIO. That's good.
> 
> However, what happens in the metadata case? If metadata is raid5, and
> there's a crash or power failure during metadata RMW, same problem,
> wrong parity, bad reconstruction, csum mismatch, and EIO. So what's
> the effect of EIO when reading metadata? 

The effect is you can't access the page or anything referenced by
the page.  If the page happens to be a root or interior node of
something important, large parts of the filesystem are inaccessible,
or the filesystem is not mountable at all.  RAID device management and
balance operations don't work because they abort as soon as they find
the first unreadable metadata page.

In theory it's still possible to rebuild parts of the filesystem offline
using backrefs or brute-force search.  Using an old root might work too,
but in bad cases the newest viable root could be thousands of generations
old (i.e. it's more likely that no viable root exists at all).

> And how common is RMW for metadata operations?

RMW in metadata is the norm.  It happens on nearly all commits--the only
exception seems to be when both ends of a commit write happen to land
on stripe boundaries accidentally, which is less than 1% of the time on
3 disks.

> I wonder where all of these damn strange cases where people can't do
> anything at all with a normally degraded raid5 - one device failed,
> and no other failures, but they can't mount due to a bunch of csum
> errors.

I'm *astonished* to hear about real-world successes with raid5 metadata.
The total-loss failure reports are the result I expect.

The current btrfs raid5 implementation is a thin layer of bugs on top
of code that is still missing critical pieces.  There is no mechanism to
prevent RMW-related failures combined with zero tolerance for RMW-related
failures in metadata, so I expect a btrfs filesystem using raid5 metadata
to be extremely fragile.  Failure is not likely--it's *inevitable*.

The non-RMW-aware allocator almost maximizes the risk of RMW data loss.
Every transaction commit contains multiple tree root pages, which
are the most critical metadata that could be lost due to RMW failure.
There is a window at least a few milliseconds wide, and potentially
several seconds wide, where some data on disk is in an unrecoverable
state due to RMW.  This happens twice a minute with the default commit
interval and 99% of commits are affected.  That's a million opportunities
per machine-year to lose metadata.  If a crash lands on one of those,
boom, no more filesystem.

I expect one random crash (i.e. a crash that is not strongly correlated
to RMW activity) out of 30-2000 (depending on filesystem size, workload,
rotation speed, btrfs mount parameters) will destroy a filesystem under
typical conditions.  Real world crashes tend not to be random (i.e. they
are strongly correlated to RMW activity), so filesystem loss will be
much more frequent in practice.


> 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
OK so we know for raid5 data block groups there can be RMW. And
because of that, any interruption results in the write hole. On Btrfs
thought, the write hole is on disk only. If there's a lost strip
(failed drive or UNC read), reconstruction from wrong parity results
in a checksum error and EIO. That's good.

However, what happens in the metadata case? If metadata is raid5, and
there's a crash or power failure during metadata RMW, same problem,
wrong parity, bad reconstruction, csum mismatch, and EIO. So what's
the effect of EIO when reading metadata? And how common is RMW for
metadata operations?

I wonder where all of these damn strange cases where people can't do
anything at all with a normally degraded raid5 - one device failed,
and no other failures, but they can't mount due to a bunch of csum
errors.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-13 Thread Qu Wenruo



At 10/14/2016 05:03 AM, Zygo Blaxell wrote:

On Thu, Oct 13, 2016 at 08:35:02AM +0800, Qu Wenruo wrote:

At 10/13/2016 01:19 AM, Zygo Blaxell wrote:

On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:

True, but if we ignore parity, we'd find that, RAID5 is just RAID0.


Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
does not.  The way a RAID5 implementation behaves in degraded mode is
the thing that usually matters after a disk fails.


COW ensures (cowed) data and metadata are all safe and checksum will ensure
they are OK, so even for RAID0, it's not a problem for case like power loss.


This is not true.  btrfs does not use stripes correctly to get CoW to
work on RAID5/6.  This is why power failures result in small amounts of
data loss, if not filesystem-destroying disaster.


See my below comments.

And, I already said, forget parity.
In that case, RAID5 without parity is just RAID0 with device rotation.


This is only true in one direction.

If you start with RAID0, add parity, and rotate the blocks on the devices,
you get RAID5.  Each individual non-parity block is independent of every
other block on every other disk.

If you start with RAID5 and remove one device, the result is *not* RAID0.
Each individual block is now entangled with N other blocks on all the
other disks.

On RAID0 there's no parity.  On RAID5 with no failed devices parity is
irrelevant.  On RAID5 with a failed device, parity touches *all* data.


I understand all this.

But the point is, RAID5 should never reconstruct wrong/corrupted data 
parity.

It should either reconstruct good copy, or recover nothing.

So RAID5 should be:
1) RAID0 if nothing goes wrong (with RMW overhead)
2) A little higher chance (not always 100%) to recover one missing device.




For CoW to work you have to make sure that you never modify a RAID stripe
that already contains committed data.  Let's consider a 5-disk array
and look at what we get when we try to reconstruct disk 2:

Disk1  Disk2  Disk3  Disk4  Disk5
Data1  Data2  Parity Data3  Data4

Suppose one transaction writes Data1-Data4 and Parity.  This is OK
because no metadata reference would point to this stripe before it
was committed to disk.  Here's some data as an example:

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
               


Why do the d*mn reconstruction without checking csum?


If a disk fails, we need stripe reconstruction to rebuild the data before
we can verify its csum.  There is no disk to read the data from directly.


NOOO! Never recover anything without checking csum.
And that's the problem of current kernel scrub.

The root cause may be the non-atomic full stripe write, but the silent 
data corruption is what we should avoid.


We can read out all existing data stripes and parity into memory, and 
try to recover the missing device.

If recovered part (or existing data stripes) mismatches csum.
Then there is nothing we can recover reliably.

If data is not reliable, no meaning to recover.
Wrong data is never better than no data.




My strategy is clear enough.
Never trust parity unless all data and reconstructed data matches csum.

Csum is more precious than the unreliable parity.


We always have csum to verify at the end (and if we aren't verifying it at
the end, that's a bug).  It doesn't help the parity to be more reliable.


It's a bug, we didn't verify csum before writing recovered data stripes 
into disk, and it even writes wrong data into correct data stripes.


That's all these RAID5/6 kernel scrub reports about.




So, please forget csum first, just consider it as RAID0, and add parity back
when all csum matches with each other.


I can't reconstruct parity in the degraded RAID5 write case.  That only
works in the scrub case.


Solve normal case first, then more complex case.

If btrfs RAID5/6 scrub can't even handle normal case, no need to 
consider recover case.




Even with all disks present on RAID5, parity gets corrupted during writes.
The corruption is hidden by the fact that we can ignore parity and use the
data blocks instead, but it is revealed when one of the disks is missing
or has a csum failure.


For device missing case, try recover in memory, and re-check the 
existing data stripes and recover stripe against csum.

If any mismatches, that full stripe is just screwed up.
For csum mismatch, it is the same.

It just lower the possibility to recover one device from 100% to 
something lower, depending on how many screwed up parities there are.


But we should never recover anything wrong.




(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^
Data5 here)

Later, a transaction deletes Data3 and Data 4.  Still OK, because
we didn't modify any data in the stripe, so we may still be able to
reconstruct the data from missing disks.  The checksums for Data4 and
Data5 are missing, so if there is any bitrot we lose the whole stripe
(we 

Re: RAID system with adaption to changed number of disks

2016-10-13 Thread Zygo Blaxell
On Thu, Oct 13, 2016 at 08:35:02AM +0800, Qu Wenruo wrote:
> At 10/13/2016 01:19 AM, Zygo Blaxell wrote:
> >On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:
> >>True, but if we ignore parity, we'd find that, RAID5 is just RAID0.
> >
> >Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
> >does not.  The way a RAID5 implementation behaves in degraded mode is
> >the thing that usually matters after a disk fails.
> >
> >>COW ensures (cowed) data and metadata are all safe and checksum will ensure
> >>they are OK, so even for RAID0, it's not a problem for case like power loss.
> >
> >This is not true.  btrfs does not use stripes correctly to get CoW to
> >work on RAID5/6.  This is why power failures result in small amounts of
> >data loss, if not filesystem-destroying disaster.
> 
> See my below comments.
> 
> And, I already said, forget parity.
> In that case, RAID5 without parity is just RAID0 with device rotation.

This is only true in one direction.

If you start with RAID0, add parity, and rotate the blocks on the devices,
you get RAID5.  Each individual non-parity block is independent of every
other block on every other disk.

If you start with RAID5 and remove one device, the result is *not* RAID0.
Each individual block is now entangled with N other blocks on all the
other disks.

On RAID0 there's no parity.  On RAID5 with no failed devices parity is
irrelevant.  On RAID5 with a failed device, parity touches *all* data.

> >For CoW to work you have to make sure that you never modify a RAID stripe
> >that already contains committed data.  Let's consider a 5-disk array
> >and look at what we get when we try to reconstruct disk 2:
> >
> > Disk1  Disk2  Disk3  Disk4  Disk5
> > Data1  Data2  Parity Data3  Data4
> >
> >Suppose one transaction writes Data1-Data4 and Parity.  This is OK
> >because no metadata reference would point to this stripe before it
> >was committed to disk.  Here's some data as an example:
> >
> > Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
> >                
> 
> Why do the d*mn reconstruction without checking csum?

If a disk fails, we need stripe reconstruction to rebuild the data before
we can verify its csum.  There is no disk to read the data from directly.

> My strategy is clear enough.
> Never trust parity unless all data and reconstructed data matches csum.
> 
> Csum is more precious than the unreliable parity.

We always have csum to verify at the end (and if we aren't verifying it at
the end, that's a bug).  It doesn't help the parity to be more reliable.

> So, please forget csum first, just consider it as RAID0, and add parity back
> when all csum matches with each other.

I can't reconstruct parity in the degraded RAID5 write case.  That only
works in the scrub case.

Even with all disks present on RAID5, parity gets corrupted during writes.
The corruption is hidden by the fact that we can ignore parity and use the
data blocks instead, but it is revealed when one of the disks is missing
or has a csum failure.

> >(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^
> >Data5 here)
> >
> >Later, a transaction deletes Data3 and Data 4.  Still OK, because
> >we didn't modify any data in the stripe, so we may still be able to
> >reconstruct the data from missing disks.  The checksums for Data4 and
> >Data5 are missing, so if there is any bitrot we lose the whole stripe
> >(we can't tell whether the data is wrong or parity, we can't ignore the
> >rotted data because it's included in the parity, and we didn't update
> >the parity because deleting an extent doesn't modify its data stripe).
> >
> > Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
> >                
> 
> So data stripes are , , , .
> If trans committed, then csum and extents of , is also deleted.
> If trans not committed, , csum and extents exists.
> 
> Any way, if we check data stripes against their csum, they should match.

Let's assume they do.  If any one of the csums are wrong and all the
disks are online, we need correct parity to reconstruct the data blocks
with bad csums.  This imposes a requirement that we keep parity correct!

If any of the csums are wrong *and* a disk is missing, the affected
data blocks are irretrievable because there is no redundant data to
reconstruct them.  Since that case isn't very interesting, let's only
consider what happens with no csum failures anywhere (only disk failures).

> Either way, we know all data stripes matches their csum, that's enough.
> No matter parity matches or not, it's just rubbish.
> Re-calculate it using scrub.

When one of the disks is missing, we must reconstruct from parity.
At this point we still can, because the stripe isn't modified when we
delete extents within it.

> >Now a third transaction allocates Data3 and Data 4.  Bad.  First, Disk4
> >is written and existing data is 

Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Adam Borowski
On Wed, Oct 12, 2016 at 05:10:18PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote:
> > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> > > I had been thinking that we could inject "plug" extents to fill up
> > > RAID5 stripes.
> > Your idea sounds good, but there's one problem: most real users don't
> > balance.  Ever.  Contrary to the tribal wisdom here, this actually works
> > fine, unless you had a pathologic load skewed to either data or metadata on
> > the first write then fill the disk to near-capacity with a load skewed the
> > other way.
> 
> > Most usage patterns produce a mix of transient and persistent data (and at
> > write time you don't know which file is which), meaning that with time every
> > stripe will contain a smidge of cold data plus a fill of plug extents.
> 
> Yes, it'll certainly reduce storage efficiency.  I think all the
> RMW-avoidance strategies have this problem.  The alternative is to risk
> losing data or the entire filesystem on disk failure, so any of the
> RMW-avoidance strategies are probably a worthwhile tradeoff.  Big RAID5/6
> arrays tend to be used mostly for storing large sequentially-accessed
> files which are less susceptible to this kind of problem.
> 
> If the pattern is lots of small random writes then performance on raid5
> will be terrible anyway (though it may even be improved by using plug
> extents, since RMW stripe updates would be replaced with pure CoW).

I've looked at some simple scenarios, and it appears that, with your scheme,
the total amount of I/O would increase, but it would not hinder performance
as increases happen only when the disk would be otherwise idle.  There's
also a latency win and a fragmentation win -- all while fixing the write
hole!

Let's assume leaf size 16KB, stripe size 64KB.  The disk has four stripes,
each 75% full 25% deleted.  '*' marks cold data, '.' deleted/plug space, 'x'
new data.  I'm not drawing entirely empty stripes.
***.
***.
***.
***.
The user wants to write 64KB of data.
RMW needs to read 12 leafs, write 16, no matter if the data comes in one
commit or four.
***x
***x
***x
***x
Latency 28 (big commit)/7 per commit (small commits), total I/O 28.

The plug extents scheme requires compaction (partial balance):



I/O so far 24.
Big commit:




Latency 4, total I/O 28.
If we had to compact on-demand, the latency is 28 (assuming we can do
stripe-sized balance).

Small commits, no concurrent writes:



x...
x...
x...
x...
Latency 1 per commit, I/O so far 28, need another compaction:




Total I/O 32.

Small io, concurrent writes that peg the disk:



xyyy
xyyy
xyyy
xyyy
Total I/O 28 (not counting concurrent writes).


Other scenarios I've analyzed give similar results.

I'm not sure if my thinking is correct, but if it is, the outcome is quite
surprising: no performance loss even though we had to rewrite the stripes!

> > Thus, while the plug extents idea doesn't suffer from problems of big
> > sectors you just mentioned, we'd need some kind of auto-balance.
> 
> Another way to approach the problem is to relocate the blocks in
> partially filled RMW stripes so they can be effectively CoW stripes;
> however, the requirement to do full extent relocations leads to some
> nasty write amplification and performance ramifications.  Balance is
> hugely heavy I/O load and there are good reasons not to incur it at
> unexpected times.

We don't need balance in btrfs sense, it's enough to compact stripes -- ie,
something akin to balance except done at stripe level rather than allocation
block level.

As for write amplification, F2FS guys solved the issue by having two types
of cleaning (balancing):
* on demand (when there is no free space and thus it needs to be done NOW)
* in the background (done only on cold data)

The on-demand clean goes for juiciest targets first (least data/stripe),
background clean on the other hand uses a formula that takes into account
both the amount of space to reclaim and age of the stripe.  If the data is
hot, it shouldn't be cleaned yet -- it's likely to be deleted/modified soon.


Meow!
-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Qu Wenruo



At 10/13/2016 01:19 AM, Zygo Blaxell wrote:

On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:

btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.
As long as all disks are working, the parity can be reconstructed

>from the data disks.  If a disk fails prior to the completion of the

scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.


This is true.
I didn't take this into account.

But this is not a *single* problem, but 2 problems.
1) Power loss
2) Device crash

Before making things complex, why not focusing on single problem.


Solve one problem at a time--but don't lose sight of the whole list of
problems either, especially when they are interdependent.


Not to mention the possibility is much smaller than single problem.


Having field experience with both problems, I disagree with that.
The power loss/system crash problem is much more common than the device
failure/scrub problems.  More data is lost when a disk fails, but the
amount of data lost in a power failure isn't zero.  Before I gave up
on btrfs raid5, it worked out to about equal amounts of admin time
recovering from the two different failure modes.


If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).


That's why we used higher duplication level for metadata by default.
And considering metadata size, it's much acceptable to use RAID1 for
metadata other than RADI5/6.


Data RAID5 metadata RAID1 makes a limited amount of sense.  Small amounts
of data are still lost on power failures due to RMW on the data stripes.
It just doesn't break the entire filesystem because the metadata is
on RAID1 and RAID1 doesn't use RMW.

Data RAID6 does not make sense, unless we also have a way to have RAID1
make more than one mirror copy.  With one mirror copy an array is not
able to tolerate two disk failures, so the Q stripe for RAID6 is wasted
CPU and space.


Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.


True, but if we ignore parity, we'd find that, RAID5 is just RAID0.


Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
does not.  The way a RAID5 implementation behaves in degraded mode is
the thing that usually matters after a disk fails.


COW ensures (cowed) data and metadata are all safe and checksum will ensure
they are OK, so even for RAID0, it's not a problem for case like power loss.


This is not true.  btrfs does not use stripes correctly to get CoW to
work on RAID5/6.  This is why power failures result in small amounts of
data loss, if not filesystem-destroying disaster.


See my below comments.

And, I already said, forget parity.
In that case, RAID5 without parity is just RAID0 with device rotation.



For CoW to work you have to make sure that you never modify a RAID stripe
that already contains committed data.  Let's consider a 5-disk array
and look at what we get when we try to reconstruct disk 2:

Disk1  Disk2  Disk3  Disk4  Disk5
Data1  Data2  Parity Data3  Data4

Suppose one transaction writes Data1-Data4 and Parity.  This is OK
because no metadata reference would point to this stripe before it
was committed to disk.  Here's some data as an example:

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
               


Why do the d*mn reconstruction without checking csum?

My strategy is clear enough.
Never trust parity unless all data and reconstructed data matches csum.

Csum is more precious than the unreliable parity.

So, please forget csum first, just consider it as RAID0, and add parity 
back when all csum matches with each other.




(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^
Data5 here)

Later, a transaction deletes Data3 and Data 4.  Still OK, because
we didn't modify any data in the stripe, so we may still be able to
reconstruct the data from missing disks.  The checksums for Data4 and
Data5 are missing, so if there is any bitrot we lose the whole stripe
(we can't tell whether the data is wrong or parity, we can't ignore the
rotted data because it's included in the parity, and we didn't update
the parity because deleting an extent doesn't modify its data stripe).

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
               


So 

Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote:
> On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> > On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:
> > > In fact, the _concept_ to solve such RMW behavior is quite simple:
> > > 
> > > Make sector size equal to stripe length. (Or vice versa if you like)
> > > 
> > > Although the implementation will be more complex, people like Chandan are
> > > already working on sub page size sector size support.
> > 
> > So...metadata blocks would be 256K on the 5-disk RAID5 example above,
> > and any file smaller than 256K would be stored inline?  Ouch.  That would
> > also imply the compressed extent size limit (currently 128K) has to become
> > much larger.
> > 
> > I had been thinking that we could inject "plug" extents to fill up
> > RAID5 stripes.  This lets us keep the 4K block size for allocations,
> > but at commit (or delalloc) time we would fill up any gaps in new RAID
> > stripes to prevent them from being modified.  As the real data is deleted
> > from the RAID stripes, it would be replaced by "plug" extents to keep any
> > new data from being allocated in the stripe.  When the stripe consists
> > entirely of "plug" extents, the plug extent would be deleted, allowing
> > the stripe to be allocated again.  The "plug" data would be zero for
> > the purposes of parity reconstruction, regardless of what's on the disk.
> > Balance would just throw the plug extents away (no need to relocate them).
> 
> Your idea sounds good, but there's one problem: most real users don't
> balance.  Ever.  Contrary to the tribal wisdom here, this actually works
> fine, unless you had a pathologic load skewed to either data or metadata on
> the first write then fill the disk to near-capacity with a load skewed the
> other way.

> Most usage patterns produce a mix of transient and persistent data (and at
> write time you don't know which file is which), meaning that with time every
> stripe will contain a smidge of cold data plus a fill of plug extents.

Yes, it'll certainly reduce storage efficiency.  I think all the
RMW-avoidance strategies have this problem.  The alternative is to risk
losing data or the entire filesystem on disk failure, so any of the
RMW-avoidance strategies are probably a worthwhile tradeoff.  Big RAID5/6
arrays tend to be used mostly for storing large sequentially-accessed
files which are less susceptible to this kind of problem.

If the pattern is lots of small random writes then performance on raid5
will be terrible anyway (though it may even be improved by using plug
extents, since RMW stripe updates would be replaced with pure CoW).

> Thus, while the plug extents idea doesn't suffer from problems of big
> sectors you just mentioned, we'd need some kind of auto-balance.

Another way to approach the problem is to relocate the blocks in
partially filled RMW stripes so they can be effectively CoW stripes;
however, the requirement to do full extent relocations leads to some
nasty write amplification and performance ramifications.  Balance is
hugely heavy I/O load and there are good reasons not to incur it at
unexpected times.


> 
> -- 
> A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
> raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
> throw away the fruits (can dump them into a cake, etc), let the drink age
> at least 3-6 months.
> 


signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Chris Murphy
On Wed, Oct 12, 2016 at 11:19 AM, Zygo Blaxell
 wrote:

> Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
> does not.  The way a RAID5 implementation behaves in degraded mode is
> the thing that usually matters after a disk fails.

Is there degraded raid5 xfstesting happening? Or are the tests mainly
done non-degraded? In particular, 2x device fail degraded raid6,
because it's so expensive, has potential to expose even more bugs.


> So...metadata blocks would be 256K on the 5-disk RAID5 example above,
> and any file smaller than 256K would be stored inline?  Ouch.  That would
> also imply the compressed extent size limit (currently 128K) has to become
> much larger.

There are patches to set strip size. Does it make sense to specify
4KiB strip size for metadata block groups and 64+KiB for data block
groups?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Zygo Blaxell
On Thu, Oct 13, 2016 at 12:33:31AM +0500, Roman Mamedov wrote:
> On Wed, 12 Oct 2016 15:19:16 -0400
> Zygo Blaxell  wrote:
> 
> > I'm not even sure btrfs does this--I haven't checked precisely what
> > it does in dup mode.  It could send both copies of metadata to the
> > disks with a single barrier to separate both metadata updates from
> > the superblock updates.  That would be bad in this particular case.
> 
> It would be bad in any case, including a single physical disk and no RAID, and

No, a single disk does not have these problems.  On a single disk we don't
have to deal with temporarily corrupted metadata _outside_ the areas we
are writing, as the disk will confine damaged data to individual sectors.
On RAID5, data damage is only limited at the stripe level, a unit orders
of magnitude larger than a sector.

> I don't think there's any basis to speculate that mdadm doesn't implement
> write barriers properly.

btrfs and mdadm have to use them properly together.  It's possible to
get it fatally wrong from the btrfs side even if mdadm does everything
perfectly.  Single disks don't have stripe consistency requirements,
so if btrfs has single-disk assumptions about the behavior of writes
then it can do the wrong thing on multi-disk systems.

> > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
> > is an interruption (system crash, a disk times out, etc) in degraded mode,
> 
> Moreover, in any non-COW system writes temporarily corrupt data. So again,
> writing to a (degraded or not) mdadm RAID5 is not much different than writing
> to a single physical disk. However I believe in the Btrfs case metadata is
> always COW, so this particular problem may be not as relevant here in the
> first place.

Degraded RAID5 does not behave like a single disk.  That's the point
people seem to keep missing when thinking about this.  btrfs CoW relies
on single-disk behavior, and fails badly when it doesn't get it.

btrfs CoW requires that writes to one sector don't modify or jeopardize
data integrity in any other sectors.  mdadm in degraded raid5/6 mode with
no stripe journal device cannot deliver this requirement.  Writes always
temporarily disrupt data on other disks in the same RAID stripe.  Each
individual disruption lasts only milliseconds, but there may be hundreds
or thousands of failure windows per second.

> 
> -- 
> With respect,
> Roman




signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Adam Borowski
On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:
> > In fact, the _concept_ to solve such RMW behavior is quite simple:
> > 
> > Make sector size equal to stripe length. (Or vice versa if you like)
> > 
> > Although the implementation will be more complex, people like Chandan are
> > already working on sub page size sector size support.
> 
> So...metadata blocks would be 256K on the 5-disk RAID5 example above,
> and any file smaller than 256K would be stored inline?  Ouch.  That would
> also imply the compressed extent size limit (currently 128K) has to become
> much larger.
> 
> I had been thinking that we could inject "plug" extents to fill up
> RAID5 stripes.  This lets us keep the 4K block size for allocations,
> but at commit (or delalloc) time we would fill up any gaps in new RAID
> stripes to prevent them from being modified.  As the real data is deleted
> from the RAID stripes, it would be replaced by "plug" extents to keep any
> new data from being allocated in the stripe.  When the stripe consists
> entirely of "plug" extents, the plug extent would be deleted, allowing
> the stripe to be allocated again.  The "plug" data would be zero for
> the purposes of parity reconstruction, regardless of what's on the disk.
> Balance would just throw the plug extents away (no need to relocate them).

Your idea sounds good, but there's one problem: most real users don't
balance.  Ever.  Contrary to the tribal wisdom here, this actually works
fine, unless you had a pathologic load skewed to either data or metadata on
the first write then fill the disk to near-capacity with a load skewed the
other way.

Most usage patterns produce a mix of transient and persistent data (and at
write time you don't know which file is which), meaning that with time every
stripe will contain a smidge of cold data plus a fill of plug extents.

Thus, while the plug extents idea doesn't suffer from problems of big
sectors you just mentioned, we'd need some kind of auto-balance.

-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Roman Mamedov
On Wed, 12 Oct 2016 15:19:16 -0400
Zygo Blaxell  wrote:

> I'm not even sure btrfs does this--I haven't checked precisely what
> it does in dup mode.  It could send both copies of metadata to the
> disks with a single barrier to separate both metadata updates from
> the superblock updates.  That would be bad in this particular case.

It would be bad in any case, including a single physical disk and no RAID, and
I don't think there's any basis to speculate that mdadm doesn't implement
write barriers properly.

> In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
> is an interruption (system crash, a disk times out, etc) in degraded mode,

Moreover, in any non-COW system writes temporarily corrupt data. So again,
writing to a (degraded or not) mdadm RAID5 is not much different than writing
to a single physical disk. However I believe in the Btrfs case metadata is
always COW, so this particular problem may be not as relevant here in the
first place.

-- 
With respect,
Roman


pgpM_a8ZbdVne.pgp
Description: OpenPGP digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 01:31:41PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote:
> > Zygo Blaxell  wrote:
> > 
> > > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> > > snowball's chance in hell of surviving a disk failure on a live array
> > > with only data losses.  This would work if mdadm and btrfs successfully
> > > arrange to have each dup copy of metadata updated separately, and one
> > > of the copies survives the raid5 write hole.  I've never tested this
> > > configuration, and I'd test the heck out of it before considering
> > > using it.
> > 
> > Not sure what you mean here, a non-fatal disk failure (i.e. within being
> > compensated by redundancy) is invisible to the upper layers on mdadm arrays.
> > They do not need to "arrange" anything, on such failure from the point of 
> > view
> > of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
> > still perfectly and correctly readable and writable.
> 
> btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm
> forwards those writes to the disks.  btrfs sends a barrier to mdadm,
> mdadm must properly forward that barrier to all the disks and wait until
> they're all done.  Repeat the above for the other metadata copy.

I'm not even sure btrfs does this--I haven't checked precisely what
it does in dup mode.  It could send both copies of metadata to the
disks with a single barrier to separate both metadata updates from
the superblock updates.  That would be bad in this particular case.

> If that's all implemented correctly in mdadm, all is well; otherwise,
> mdadm and btrfs fail to arrange to have each dup copy of metadata
> updated separately.

To be clearer about the consequences of this:

If both copies of metadata are updated at the same time (because btrfs
and mdadm failed to get the barriers right), it's possible to have both
copies of metadata in an inconsistent (unreadable) state at the same time,
ending the filesystem.

In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
is an interruption (system crash, a disk times out, etc) in degraded mode,
one of the metadata copies will be damaged.  The damage may not be limited
to the current commit, so we need the second copy of the metadata intact
to recover from broken changes to the first copy.  Usually metadata chunks
are larger than RAID5 stripes, so this works out for btrfs on mdadm RAID5
(maybe not if two metadata chunks are adjacent and not stripe-aligned,
but that's a rare case, and one that only affects array sizes that are
not a power of 2 + 1 disk for RAID5, or power of 2 + 2 disks for RAID6).

> The present state of the disks is irrelevant.  The array could go
> degraded due to a disk failure at any time, so for practical failure
> analysis purposes, only the behavior in degraded mode is relevant.
> 
> > 
> > -- 
> > With respect,
> > Roman
> 
> 




signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote:
> Zygo Blaxell  wrote:
> 
> > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> > snowball's chance in hell of surviving a disk failure on a live array
> > with only data losses.  This would work if mdadm and btrfs successfully
> > arrange to have each dup copy of metadata updated separately, and one
> > of the copies survives the raid5 write hole.  I've never tested this
> > configuration, and I'd test the heck out of it before considering
> > using it.
> 
> Not sure what you mean here, a non-fatal disk failure (i.e. within being
> compensated by redundancy) is invisible to the upper layers on mdadm arrays.
> They do not need to "arrange" anything, on such failure from the point of view
> of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
> still perfectly and correctly readable and writable.

btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm
forwards those writes to the disks.  btrfs sends a barrier to mdadm,
mdadm must properly forward that barrier to all the disks and wait until
they're all done.  Repeat the above for the other metadata copy.

If that's all implemented correctly in mdadm, all is well; otherwise,
mdadm and btrfs fail to arrange to have each dup copy of metadata
updated separately.

The present state of the disks is irrelevant.  The array could go
degraded due to a disk failure at any time, so for practical failure
analysis purposes, only the behavior in degraded mode is relevant.

> 
> -- 
> With respect,
> Roman




signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:
> >btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
> >a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
> >to reconstruct any parity that was damaged by an incomplete data stripe
> >update.
> > As long as all disks are working, the parity can be reconstructed
> >from the data disks.  If a disk fails prior to the completion of the
> >scrub, any data stripes that were written during previous crashes may
> >be destroyed.  And all that assumes the scrub bugs are fixed first.
> 
> This is true.
> I didn't take this into account.
> 
> But this is not a *single* problem, but 2 problems.
> 1) Power loss
> 2) Device crash
> 
> Before making things complex, why not focusing on single problem.

Solve one problem at a time--but don't lose sight of the whole list of
problems either, especially when they are interdependent.

> Not to mention the possibility is much smaller than single problem.

Having field experience with both problems, I disagree with that.
The power loss/system crash problem is much more common than the device
failure/scrub problems.  More data is lost when a disk fails, but the
amount of data lost in a power failure isn't zero.  Before I gave up
on btrfs raid5, it worked out to about equal amounts of admin time
recovering from the two different failure modes.

> >If writes occur after a disk fails, they all temporarily corrupt small
> >amounts of data in the filesystem.  btrfs cannot tolerate any metadata
> >corruption (it relies on redundant metadata to self-repair), so when a
> >write to metadata is interrupted, the filesystem is instantly doomed
> >(damaged beyond the current tools' ability to repair and mount
> >read-write).
> 
> That's why we used higher duplication level for metadata by default.
> And considering metadata size, it's much acceptable to use RAID1 for
> metadata other than RADI5/6.

Data RAID5 metadata RAID1 makes a limited amount of sense.  Small amounts
of data are still lost on power failures due to RMW on the data stripes.
It just doesn't break the entire filesystem because the metadata is
on RAID1 and RAID1 doesn't use RMW.

Data RAID6 does not make sense, unless we also have a way to have RAID1
make more than one mirror copy.  With one mirror copy an array is not
able to tolerate two disk failures, so the Q stripe for RAID6 is wasted
CPU and space.

> >Currently the upper layers of the filesystem assume that once data
> >blocks are written to disk, they are stable.  This is not true in raid5/6
> >because the parity and data blocks within each stripe cannot be updated
> >atomically.
> 
> True, but if we ignore parity, we'd find that, RAID5 is just RAID0.

Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
does not.  The way a RAID5 implementation behaves in degraded mode is
the thing that usually matters after a disk fails.

> COW ensures (cowed) data and metadata are all safe and checksum will ensure
> they are OK, so even for RAID0, it's not a problem for case like power loss.

This is not true.  btrfs does not use stripes correctly to get CoW to
work on RAID5/6.  This is why power failures result in small amounts of
data loss, if not filesystem-destroying disaster.

For CoW to work you have to make sure that you never modify a RAID stripe
that already contains committed data.  Let's consider a 5-disk array
and look at what we get when we try to reconstruct disk 2:

Disk1  Disk2  Disk3  Disk4  Disk5
Data1  Data2  Parity Data3  Data4

Suppose one transaction writes Data1-Data4 and Parity.  This is OK
because no metadata reference would point to this stripe before it
was committed to disk.  Here's some data as an example:

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
               

(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^
Data5 here)

Later, a transaction deletes Data3 and Data 4.  Still OK, because
we didn't modify any data in the stripe, so we may still be able to
reconstruct the data from missing disks.  The checksums for Data4 and
Data5 are missing, so if there is any bitrot we lose the whole stripe
(we can't tell whether the data is wrong or parity, we can't ignore the
rotted data because it's included in the parity, and we didn't update
the parity because deleting an extent doesn't modify its data stripe).

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
               

Now a third transaction allocates Data3 and Data 4.  Bad.  First, Disk4
is written and existing data is temporarily corrupted:

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
         1234      7452

then Disk5 is written, and the data is still corrupted:

Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
         1234   5678   aaa2

then parity is written, and the 

Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Roman Mamedov
On Tue, 11 Oct 2016 17:58:22 -0600
Chris Murphy  wrote:

> But consider the identical scenario with md or LVM raid5, or any
> conventional hardware raid5. A scrub check simply reports a mismatch.
> It's unknown whether data or parity is bad, so the bad data strip is
> propagated upward to user space without error. On a scrub repair, the
> data strip is assumed to be good, and good parity is overwritten with
> bad.

That's why I love to use Btrfs on top of mdadm RAID5/6 -- combining a mature
and stable RAID implementation with Btrfs anti-corruption checksumming
"watchdog". In the case that you described, no silent corruption will occur,
as Btrfs will report an uncorrectable read error -- and I can just restore the
file in question from backups.


On Wed, 12 Oct 2016 00:37:19 -0400
Zygo Blaxell  wrote:

> A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> snowball's chance in hell of surviving a disk failure on a live array
> with only data losses.  This would work if mdadm and btrfs successfully
> arrange to have each dup copy of metadata updated separately, and one
> of the copies survives the raid5 write hole.  I've never tested this
> configuration, and I'd test the heck out of it before considering
> using it.

Not sure what you mean here, a non-fatal disk failure (i.e. within being
compensated by redundancy) is invisible to the upper layers on mdadm arrays.
They do not need to "arrange" anything, on such failure from the point of view
of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
still perfectly and correctly readable and writable.

-- 
With respect,
Roman


pgpCiQALhZ93Z.pgp
Description: OpenPGP digital signature


Re: RAID system with adaption to changed number of disks

2016-10-12 Thread Anand Jain





Missing device is the _only_ thing the current design handles.


Right. below patches in the ML added two more device states
offline and failed. It is tested with raid1.

[PATCH 11/13] btrfs: introduce device dynamic state transition to 
offline or failed


[PATCH 12/13] btrfs: check device for critical errors and mark failed

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Qu Wenruo



At 10/12/2016 12:37 PM, Zygo Blaxell wrote:

On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote:

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.


Totally true.

Original RAID5/6 design is only to handle missing device, not rotted bits.


Missing device is the _only_ thing the current design handles.  i.e. you
umount the filesystem cleanly, remove a disk, and mount it again degraded,
and then the only thing you can safely do with the filesystem is delete
or replace a device.  There is also a probability of being able to repair
bitrot under some circumstances.

If your disk failure looks any different from this, btrfs can't handle it.
If a disk fails while the array is running and the filesystem is writing,
the filesystem is likely to be severely damaged, possibly unrecoverably.

A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
snowball's chance in hell of surviving a disk failure on a live array
with only data losses.  This would work if mdadm and btrfs successfully
arrange to have each dup copy of metadata updated separately, and one
of the copies survives the raid5 write hole.  I've never tested this
configuration, and I'd test the heck out of it before considering
using it.


So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.



Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and
data checksum.

[snip]

This leads directly to a variety of problems with the diagnostic tools,
e.g.  scrub reports errors randomly across devices, and cannot report the
path of files containing corrupted blocks if it's the parity block that
gets corrupted.


At least better than screwing up good stripes.

The tool is just used to let user know if there is any corrupted stripes 
like kernel scrub, but with better behavior, like won't reconstruct 
stripes ignoring checksum.



For human readable report, it's not that hard (compared the the complex 
csum and parity check) to implement and can be added later.
For parity report, there is no way to output any human readable result 
anyway.




btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.
 As long as all disks are working, the parity can be reconstructed
from the data disks.  If a disk fails prior to the completion of the
scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.


This is true.
I didn't take this into account.

But this is not a *single* problem, but 2 problems.
1) Power loss
2) Device crash

Before making things complex, why not focusing on single problem.

Not to mention the possibility is much smaller than single problem.



If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).


That's why we used higher duplication level for metadata by default.
And considering metadata size, it's much acceptable to use RAID1 for 
metadata other than RADI5/6.




Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.


True, but if we ignore parity, we'd find that, RAID5 is just RAID0.

COW ensures (cowed) data and metadata are all safe and checksum will 
ensure they are OK, so even for RAID0, it's not a problem for case like 
power loss.


So we should follow csum first and then parity.

If we following this principle, RAID5 should be a raid0 with a little 
higher possibility to recover some cases, like missing one device.


So, I'd like to fix RAID5 scrub to make it at least better than RAID0, 
not worse than RAID0.




 btrfs doesn't avoid writing new data in the same RAID stripe
as old data (it provides a rmw function for raid56, which is simply a bug
in a CoW filesystem), so previously committed data can be 

Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote:
> >But consider the identical scenario with md or LVM raid5, or any
> >conventional hardware raid5. A scrub check simply reports a mismatch.
> >It's unknown whether data or parity is bad, so the bad data strip is
> >propagated upward to user space without error. On a scrub repair, the
> >data strip is assumed to be good, and good parity is overwritten with
> >bad.
> 
> Totally true.
> 
> Original RAID5/6 design is only to handle missing device, not rotted bits.

Missing device is the _only_ thing the current design handles.  i.e. you
umount the filesystem cleanly, remove a disk, and mount it again degraded,
and then the only thing you can safely do with the filesystem is delete
or replace a device.  There is also a probability of being able to repair
bitrot under some circumstances.

If your disk failure looks any different from this, btrfs can't handle it.
If a disk fails while the array is running and the filesystem is writing,
the filesystem is likely to be severely damaged, possibly unrecoverably.

A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
snowball's chance in hell of surviving a disk failure on a live array
with only data losses.  This would work if mdadm and btrfs successfully
arrange to have each dup copy of metadata updated separately, and one
of the copies survives the raid5 write hole.  I've never tested this
configuration, and I'd test the heck out of it before considering
using it.

> >So while I agree in total that Btrfs raid56 isn't mature or tested
> >enough to consider it production ready, I think that's because of the
> >UNKNOWN causes for problems we've seen with raid56. Not the parity
> >scrub bug which - yeah NOT good, not least of which is the data
> >integrity guarantees Btrfs is purported to make are substantially
> >negated by this bug. I think the bark is worse than the bite. It is
> >not the bark we'd like Btrfs to have though, for sure.
> >
> 
> Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and
> data checksum.
[snip]

This leads directly to a variety of problems with the diagnostic tools,
e.g.  scrub reports errors randomly across devices, and cannot report the
path of files containing corrupted blocks if it's the parity block that
gets corrupted.

btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.  As long as all disks are working, the parity can be reconstructed
from the data disks.  If a disk fails prior to the completion of the
scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.

If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).

Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.  btrfs doesn't avoid writing new data in the same RAID stripe
as old data (it provides a rmw function for raid56, which is simply a bug
in a CoW filesystem), so previously committed data can be lost.  If the
previously committed data is part of the metadata tree, the filesystem
is doomed; for ordinary data blocks there are just a few dozen to a few
thousand corrupted files for the admin to clean up after each crash.

It might be possible to hack up the allocator to pack writes into empty
stripes to avoid the write hole, but every time I think about this it
looks insanely hard to do (or insanely wasteful of space) for data
stripes.



signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Dan Mons
Ignoring the RAID56 bugs for a moment, if you have mismatched drives,
BtrFS RAID1 is a pretty good way of utilising available space and
having redundancy.

My home array is BtrFS with a hobbled together collection of disks
ranging from 500GB to 3TB (and 5 of them, so it's not an even number).
I have a grand total of 8TB of linear space, and with BtrFS RAID1 I
can use exactly 50% of this (4TB) even with the weird combination of
disks.  That's something other RAID1 implementations can't do (they're
limited to the size of the smallest disk of any pair, and need an even
number of disks all up), and I get free compression and snapshotting,
so yay for that.

As drives die of natural old age, I replace them ad-hoc with bigger
drives (whatever is the sane price-point at the time).  A replace
followed by a rebalance later, and I'm back to using all available
space (which grows every time I throw a bigger drive in the mix),
which again is incredibly handy when you're a home user looking for
sane long-term storage that doesn't require complete rebuilds of your
array.

-Dan


Dan Mons - VFX Sysadmin
Cutting Edge
http://cuttingedge.com.au


On 12 October 2016 at 01:14, Philip Louis Moetteli
 wrote:
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:
>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Qu Wenruo



At 10/12/2016 07:58 AM, Chris Murphy wrote:

https://btrfs.wiki.kernel.org/index.php/Status
Scrub + RAID56 Unstable will verify but not repair

This doesn't seem quite accurate. It does repair the vast majority of
the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
data strip results in a.) fixed up data strip from parity b.) wrong
recomputation of replacement parity c.) good parity is overwritten
with bad, silently, d.) if parity reconstruction is needed in the
future e.g. device or sector failure, it results in EIO, a kind of
data loss.

Bad bug. For sure.

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.


Totally true.

Original RAID5/6 design is only to handle missing device, not rotted bits.



So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.



Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree 
and data checksum.


In ideal situation, btrfs should detect which stripe is corrupted, and 
only try to recover data/parity if recovered data checksum matches.


For example, for a very traditional RAID5 layout like the following:

  Disk 1|   Disk 2|  Disk 3 |
-
  Data 1|   Data 2|  Parity |

Scrub should check data stripe 1 and 2, against their checksum first

[All data extents has csum]
1) All csum matches
   Good, then check parity.
   1.1) Parity matches
Nothing wrong at all

   1.1) Parity mismatch
Just recalculate parity. Corruption may happen in unused data
space or in parity. Either way recalculate parity is good
enough.

2) One data stripe csum mismatches(missing), parity mismatches too
   We only know one data stripe mismatch, not sure if parity is OK.
   Try to recover that data stripe from parity, and recheck csum.

   2.1) Recovered data stripe matches csum
That data stripe is corrupted and parity is OK
Recoverable.

   2.2) Recovered data stripe mismatch csum
Both that data stripe and parity is corrupted.

3) Two data stripes csum mismatch, no matter parity matches or not
   At least 2 stripes are screwed up. no fix anyway.

[Some data extents has no csum(nodatasum)]
4) Existing(or no csum at all) csum matches, parity matches
   Good, nothing to worry about

5) Exist csum mismatch for one data stripe, parity mismatch
   Like 2), try to recover that data stripe, and re-check csum.

   5.1) recovered data stripes matches csum
At least we can recover the data covered by csum.
Corrupted no-csum data is not our concern.

   5.2) recovered data stripes mismatches csum
Screwed up

6) No csum at all, parity mismatch
   We are screwed up, just like traditional RAID5.

And I'm coding for the above cases in btrfs-progs to implement an 
off-line scrub tool.


Currently it looks good, and can already handle case from 1) to 3).
And I tend to ignore any full stripe who lacks checksum and parity 
mismatches.


But as you can see, there are so many things(csum exists,matches pairty 
matches, missing devices) involved in btrfs RAID5(RAID6 will be more 
complex), it's already much complex than traditional RAID5/6 or current 
scrub implementation.



So what current kernel scub lacks is:
1) Detection of good/bad stripes
2) Recheck of recovery attempts

But that's all traditional RAID5/6 lacks unless there is some hidden 
checksum like btrfs they can use.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Chris Murphy
https://btrfs.wiki.kernel.org/index.php/Status
Scrub + RAID56 Unstable will verify but not repair

This doesn't seem quite accurate. It does repair the vast majority of
the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
data strip results in a.) fixed up data strip from parity b.) wrong
recomputation of replacement parity c.) good parity is overwritten
with bad, silently, d.) if parity reconstruction is needed in the
future e.g. device or sector failure, it results in EIO, a kind of
data loss.

Bad bug. For sure.

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.

So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread ronnie sahlberg
On Tue, Oct 11, 2016 at 8:14 AM, Philip Louis Moetteli
 wrote:
>
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:


You should under no circumstances use RAID5/6 for anything other than
test and throw-away data.
It has several known issues that will eat your data. Total data loss
is a real possibility.

(the capability to even create raid5/6 filesystems should imho be
removed from btrfs until this changes.)

>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Tomasz Kusmierz
I think you just described all the benefits of btrfs in that type of
configuration  unfortunately after btrfs RAID 5 & 6 was marked as
OK it got marked as "it will eat your data" (and there is a tone of
people in random places poping up with raid 5 & 6 that just killed
their data)

On 11 October 2016 at 16:14, Philip Louis Moetteli
 wrote:
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:
>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Hugo Mills
On Tue, Oct 11, 2016 at 03:14:30PM +, Philip Louis Moetteli wrote:
> Hello,
> 
> 
> I have to build a RAID 6 with the following 3 requirements:
> 
>   • Use different kinds of disks with different sizes.
>   • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
>   • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
> 
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
> 
> Is BTrFS capable of doing that?

1) Take a look at http://carfax.org.uk/btrfs-usage/ which will tell
   you how much space you can get out of a btrfs array with different
   sized devices.

2) Btrfs's parity RAID implementation is not in good shape right
   now. It has known data corruption issues, and should not be used in
   production.

3) The redistribution of space is something that btrfs can do. It
   needs to be triggered manually at the moment, but it definitely
   works.

   Hugo.

-- 
Hugo Mills | We are all lying in the gutter, but some of us are
hugo@... carfax.org.uk | looking at the stars.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Oscar Wilde


signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Austin S. Hemmelgarn

On 2016-10-11 11:14, Philip Louis Moetteli wrote:

Hello,


I have to build a RAID 6 with the following 3 requirements:

• Use different kinds of disks with different sizes.
• When a disk fails and there's enough space, the RAID should be able 
to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
RAID with 8 disks and 1 fails, I should be able to chose to transform this in a 
non-degraded (!) RAID with 7 disks.
• Also the other way round: If I add a disk of what size ever, it 
should redistribute the data, so that it becomes a RAID with 9 disks.

I don’t care, if I have to do it manually.
I don’t care so much about speed either.

Is BTrFS capable of doing that?
In theory yes.  In practice, BTRFS RAID5/6 mode should not be used in 
production due to a number of known serious issues relating to 
rebuilding and reshaping arrays.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID system with adaption to changed number of disks

2016-10-11 Thread Philip Louis Moetteli
Hello,


I have to build a RAID 6 with the following 3 requirements:

• Use different kinds of disks with different sizes.
• When a disk fails and there's enough space, the RAID should be able 
to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
RAID with 8 disks and 1 fails, I should be able to chose to transform this in a 
non-degraded (!) RAID with 7 disks.
• Also the other way round: If I add a disk of what size ever, it 
should redistribute the data, so that it becomes a RAID with 9 disks.

I don’t care, if I have to do it manually.
I don’t care so much about speed either.

Is BTrFS capable of doing that?


Thanks a lot for your help!