Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> [...]
> >> Before you pointed out that the non-contiguous block written has
> >> an impact on performance. I am replaying  that the switching from a
> >> different BG happens at the stripe-disk boundary, so in any case the
> >> block is physically interrupted and switched to another disk
> > 
> > The difference is that the write is switched to a different local address
> > on the disk.
> > 
> > It's not "another" disk if it's a different BG.  Recall in this plan
> > there is a full-width BG that is on _every_ disk, which means every
> > small-width BG shares a disk with the full-width BG.  Every extent tail
> > write requires a seek on a minimum of two disks in the array for raid5,
> > three disks for raid6.  A tail that is strip-width minus one will hit
> > N - 1 disks twice in an N-disk array.
> 
> Below I made a little simulation; my results telling me another thing:
> 
> Current BTRFS (w/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case A.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks (3data + 2 parity)
> 
> Case A.2.2): extent size = 256kb: (optimistic case: contiguous space 
> available)
> 5 writes of 64kb spread on 5 disks (3 data + 2 parity)
> 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
> 3 writes of 64 kb spread on 3 disks (data + 2 parity)
> 
> Note that the two reads are contiguous to the 5 writes both in term of
> space and time. The three writes are contiguous only in terms of space,
> but not in terms of time, because these could happen only after the 2
> reads and the consequent parities computations. So we should consider
> that between these two events, some disk activities happen; this means
> seeks between the 2 reads and the 3 writes
> 
> 
> BTRFS with multiple BG (wo/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case B.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks
> 
> Case B.2): extent size = 256kb:
> 5 writes of 64kb spread on 5 disks in BG#1
> 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)
> 
> So if I count correctly:
> - case B1 vs A1: these are equivalent
> - case B2 vs A2.1/A2.2:
>   8 writes vs 8 writes
>   3 seeks vs 3 seeks
>   0 reads vs 2 reads
> 
> So to me it seems that the cost of doing a RMW cycle is worse than
> seeking to another BG.

Well, RMW cycles are dangerous, so being slow as well is just a second
reason never to do them.

> Anyway I am reaching the conclusion, also thanks of this discussion,
> that this is not enough. Even if we had solve the problem of the
> "extent smaller than stripe" write, we still face gain this issue when
> part of the file is changed.
> In this case the file update breaks the old extent and will create a
> three extents: the first part, the new part, the last part. Until that
> everything is OK. However the "old" part of the file would be marked
> as free space. But using this part could require a RMW cycle

You cannot use that free space within RAID stripes because it would
require RMW, and RMW causes write hole.  The space would have to be kept
unavailable until the rest of the RAID stripe was deleted.

OTOH, if you can solve that free space management problem, you don't
have to do anything else to solve write hole.  If you never RMW then
you never have the write hole in the first place.

> I am concluding that the only two reliable solution are 
> a) variable stripe size (like ZFS does) 
> or b) logging the RMW cycle of a stripe 

Those are the only solutions that don't require a special process for
reclaiming unused space in RAID stripes.  If you have that, you have a
few more options; however, they all involve making a second copy of the
data at a later time (as opposed to option b, which makes a second
copy of the data during the original write).

a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those)
and it would require defrag to get the inefficiently used space back.

b) is the best of the terrible options.  It minimizes the impact on the
rest of the filesystem since it can fix RMW inconsistency without having
to eliminate the RMW cases.  It doesn't require rewriting the allocator
nor does it require users to run defrag or balance periodically.

> [**] Does someone know if the checksum are checked during this read ?
> [...]
>  
> BR
> G.Baroncelli
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-04 Thread Goffredo Baroncelli
On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
>> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
[...]
>> Before you pointed out that the non-contiguous block written has
>> an impact on performance. I am replaying  that the switching from a
>> different BG happens at the stripe-disk boundary, so in any case the
>> block is physically interrupted and switched to another disk
> 
> The difference is that the write is switched to a different local address
> on the disk.
> 
> It's not "another" disk if it's a different BG.  Recall in this plan
> there is a full-width BG that is on _every_ disk, which means every
> small-width BG shares a disk with the full-width BG.  Every extent tail
> write requires a seek on a minimum of two disks in the array for raid5,
> three disks for raid6.  A tail that is strip-width minus one will hit
> N - 1 disks twice in an N-disk array.

Below I made a little simulation; my results telling me another thing:

Current BTRFS (w/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case A.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks (3data + 2 parity)

Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available)
5 writes of 64kb spread on 5 disks (3 data + 2 parity)
2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
3 writes of 64 kb spread on 3 disks (data + 2 parity)

Note that the two reads are contiguous to the 5 writes both in term of space 
and time. The three writes are contiguous only in terms of space, but not in 
terms of time, because these could happen only after the 2 reads and the 
consequent parities computations. So we should consider that between these two 
events, some disk activities happen; this means seeks between the 2 reads and 
the 3 writes


BTRFS with multiple BG (wo/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case B.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks

Case B.2): extent size = 256kb:
5 writes of 64kb spread on 5 disks in BG#1
3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)

So if I count correctly:
- case B1 vs A1: these are equivalent
- case B2 vs A2.1/A2.2:
8 writes vs 8 writes
3 seeks vs 3 seeks
0 reads vs 2 reads

So to me it seems that the cost of doing a RMW cycle is worse than seeking to 
another BG.

Anyway I am reaching the conclusion, also thanks of this discussion, that this 
is not enough. Even if we had solve the problem of the "extent smaller than 
stripe" write, we still face gain this issue when part of the file is changed.
In this case the file update breaks the old extent and will create a three 
extents: the first part, the new part, the last part. Until that everything is 
OK. However the "old" part of the file would be marked as free space. But using 
this part could require a RMW cycle

I am concluding that the only two reliable solution are 
a) variable stripe size (like ZFS does) 
or b) logging the RMW cycle of a stripe 


[**] Does someone know if the checksum are checked during this read ?
[...]
 
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote:
> On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli  
> wrote:
> > On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>  On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > I thought that a possible solution is to create BG with different
>  number of data disks. E.g. supposing to have a raid 6 system with 6
>  disks, where 2 are parity disk; we should allocate 3 BG
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> >
> > For simplicity, the disk-stripe length is assumed = 4K.
> >
> > So If you have a write with a length of 4 KB, this should be placed
>  in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>  should be placed in in BG#2, then in BG#1.
> > This would avoid space wasting, even if the fragmentation will
>  increase (but shall the fragmentation matters with the modern solid
>  state disks ?).
> >>> I don't really see why this would increase fragmentation or waste space.
> >
> >> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> >> remaining 2 blocks).  It also flips the usual order of "determine size
> >> of extent, then allocate space for it" which might require major surgery
> >> on the btrfs allocator to implement.
> >
> > I have to point out that in any case the extent is physically interrupted 
> > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 
> > 128KB, the first half is written in the first disk, the other in the 2nd 
> > disk.  If you want to write 96kb, the first 64 are written in the first 
> > disk, the last part in the 2nd, only on a different BG.
> > So yes there is a fragmentation from a logical point of view; from a 
> > physical point of view the data is spread on the disks in any case.
> >
> > In any case, you are right, we should gather some data, because the 
> > performance impact are no so clear.
> 
> They're pretty clear, and there's a lot written about small file size
> and parity raid performance being shit, no matter the implementation
> (md, ZFS, Btrfs, hardware maybe less so just because of all the
> caching and extra processing hardware that's dedicated to the task).

Pretty much everything goes fast if you put a faster non-volatile cache
in front of it.

> The linux-raid@ list is full of optimizations for this that are use
> case specific. One of those that often comes up is how badly suited
> raid56 are for e.g. mail servers, tons of small file reads and writes,
> and all the disk contention that comes up, and it's even worse when
> you lose a disk, and even if you're running raid 6 and lose two disk
> it's really god awful. It can be unexpectedly a disqualifying setup
> without prior testing in that condition: can your workload really be
> usable for two or three days in a double degraded state on that raid6?
> *shrug*
> 
> Parity raid is well suited for full stripe reads and writes, lots of
> sequential writes. Ergo a small file is anything less than a full
> stripe write. Of course, delayed allocation can end up making for more
> full stripe writes. But now you have more RMW which is the real
> performance killer, again no matter the raid.

RMW isn't necessary if you have properly configured COW on top.
ZFS doesn't do RMW at all.  OTOH for some workloads COW is a step in a
different wrong direction--the btrfs raid5 problems with nodatacow
files can be solved by stripe logging and nothing else.

Some equivalent of autodefrag that repacks your small RAID stripes
into bigger ones will burn 3x your write IOPS eventually--it just
lets you defer the inevitable until a hopefully more convenient time.
A continuously loaded server never has a more convenient time, so it
needs a different solution.

> > I am not worried abut having different BG; we have problem with these 
> > because we never developed tool to handle this issue properly (i.e. a 
> > daemon which starts a balance when needed). But I hope that this will be 
> > solved in future.
> >
> > In any case, the all solutions proposed have their trade off:
> >
> > - a) as is: write hole bug
> > - b) variable stripe size (like ZFS): big impact on how btrfs handle the 
> > extent. limited waste of space
> > - c) logging data before writing: we wrote the data two times in a short 
> > time window. Moreover the log area is written several order of magnitude 
> > more than the other area; there was some patch around
> > - d) rounding the writing to the stripe size: waste of space; simple to 
> > implement;
> > - e) different BG with different stripe size: limited waste of space; 
> > logical fragmentation.
> 
> I'd say for sure you're 

Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk

No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli
On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
>> I have to point out that in any case the extent is physically
>> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
>> you want to write 128KB, the first half is written in the first disk,
>> the other in the 2nd disk.  If you want to write 96kb, the first 64
>> are written in the first disk, the last part in the 2nd, only on a
>> different BG.
> The "only on a different BG" part implies something expensive, either
> a seek or a new erase page depending on the hardware.  Without that,
> nearby logical blocks are nearby physical blocks as well.

In any case it happens on a different disk

> 
>> So yes there is a fragmentation from a logical point of view; from a
>> physical point of view the data is spread on the disks in any case.

> What matters is the extent-tree point of view.  There is (currently)
> no fragmentation there, even for RAID5/6.  The extent tree is unaware
> of RAID5/6 (to its peril).

Before you pointed out that the non-contiguous block written has an impact on 
performance. I am replaying  that the switching from a different BG happens at 
the stripe-disk boundary, so in any case the block is physically interrupted 
and switched to another disk

However yes: from an extent-tree point of view there will be an increase of 
number extents, because the end of the writing is allocated to another BG (if 
the size is not stripe-boundary)

> If an application does a loop writing 68K then fsync(), the multiple-BG
> solution adds two seeks to read every 68K.  That's expensive if sequential
> read bandwidth is more scarce than free space.

Why you talk about an additional seeks? In any case (even without the 
additional BG) the read happens from another disks

>> * c),d),e) are applied only for the tail of the extent, in case the
> size is less than the stripe size.
> 
> It's only necessary to split an extent if there are no other writes
> in the same transaction that could be combined with the extent tail
> into a single RAID stripe.  As long as everything in the RAID stripe
> belongs to a single transaction, there is no write hole

May be that a more "simpler" optimization would be close the transaction when 
the data reach the stripe boundary... But I suspect that it is not so simple to 
implement.

> Not for d.  Balance doesn't know how to get rid of unreachable blocks
> in extents (it just moves the entire extent around) so after a balance
> the writes would still be rounded up to the stripe size.  Balance would
> never be able to free the rounded-up space.  That space would just be
> gone until the file was overwritten, deleted, or defragged.

If balance is capable to move the extent, why not place one near the other 
during a balance ? The goal is not to limit the the writing of the end of a 
extent, but avoid writing the end of an extent without further data (e.g. the 
gap to the stripe has to be filled in the same transaction)

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-03 Thread Chris Murphy
On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli  wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
 On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> I thought that a possible solution is to create BG with different
 number of data disks. E.g. supposing to have a raid 6 system with 6
 disks, where 2 are parity disk; we should allocate 3 BG
> BG #1: 1 data disk, 2 parity disks
> BG #2: 2 data disks, 2 parity disks,
> BG #3: 4 data disks, 2 parity disks
>
> For simplicity, the disk-stripe length is assumed = 4K.
>
> So If you have a write with a length of 4 KB, this should be placed
 in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
 should be placed in in BG#2, then in BG#1.
> This would avoid space wasting, even if the fragmentation will
 increase (but shall the fragmentation matters with the modern solid
 state disks ?).
>>> I don't really see why this would increase fragmentation or waste space.
>
>> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
>> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
>> remaining 2 blocks).  It also flips the usual order of "determine size
>> of extent, then allocate space for it" which might require major surgery
>> on the btrfs allocator to implement.
>
> I have to point out that in any case the extent is physically interrupted at 
> the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
> the first half is written in the first disk, the other in the 2nd disk.  If 
> you want to write 96kb, the first 64 are written in the first disk, the last 
> part in the 2nd, only on a different BG.
> So yes there is a fragmentation from a logical point of view; from a physical 
> point of view the data is spread on the disks in any case.
>
> In any case, you are right, we should gather some data, because the 
> performance impact are no so clear.

They're pretty clear, and there's a lot written about small file size
and parity raid performance being shit, no matter the implementation
(md, ZFS, Btrfs, hardware maybe less so just because of all the
caching and extra processing hardware that's dedicated to the task).

The linux-raid@ list is full of optimizations for this that are use
case specific. One of those that often comes up is how badly suited
raid56 are for e.g. mail servers, tons of small file reads and writes,
and all the disk contention that comes up, and it's even worse when
you lose a disk, and even if you're running raid 6 and lose two disk
it's really god awful. It can be unexpectedly a disqualifying setup
without prior testing in that condition: can your workload really be
usable for two or three days in a double degraded state on that raid6?
*shrug*

Parity raid is well suited for full stripe reads and writes, lots of
sequential writes. Ergo a small file is anything less than a full
stripe write. Of course, delayed allocation can end up making for more
full stripe writes. But now you have more RMW which is the real
performance killer, again no matter the raid.


>
> I am not worried abut having different BG; we have problem with these because 
> we never developed tool to handle this issue properly (i.e. a daemon which 
> starts a balance when needed). But I hope that this will be solved in future.
>
> In any case, the all solutions proposed have their trade off:
>
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle the 
> extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a short time 
> window. Moreover the log area is written several order of magnitude more than 
> the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple to 
> implement;
> - e) different BG with different stripe size: limited waste of space; logical 
> fragmentation.

I'd say for sure you're worse off with metadata raid5 vs metadata
raid1. And if there are many devices you might be better off with
metadata raid1 even on a raid6, it's not an absolute certainty you
lose the file system with a 2nd drive failure - it depends on the
device and what chunk copies happen to be on it. But at the least if
you have a script or some warning you can relatively easily rebalance
... HMMM

Actually that should be a test. Single drive degraded raid6 with
metadata raid1, can you do a metadata only balance to force the
missing copy of metadata to be replicated again? In theory this should
be quite fast.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-03 Thread Zygo Blaxell
On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>  I thought that a possible solution is to create BG with different
> >>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>> disks, where 2 are parity disk; we should allocate 3 BG
>  BG #1: 1 data disk, 2 parity disks
>  BG #2: 2 data disks, 2 parity disks,
>  BG #3: 4 data disks, 2 parity disks
> 
>  For simplicity, the disk-stripe length is assumed = 4K.
> 
>  So If you have a write with a length of 4 KB, this should be placed
> >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>> should be placed in in BG#2, then in BG#1.
>  This would avoid space wasting, even if the fragmentation will
> >>> increase (but shall the fragmentation matters with the modern solid
> >>> state disks ?).
> >> I don't really see why this would increase fragmentation or waste space.
> 
> > Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> > remaining 2 blocks).  It also flips the usual order of "determine size
> > of extent, then allocate space for it" which might require major surgery
> > on the btrfs allocator to implement.
> 
> I have to point out that in any case the extent is physically
> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> you want to write 128KB, the first half is written in the first disk,
> the other in the 2nd disk.  If you want to write 96kb, the first 64
> are written in the first disk, the last part in the 2nd, only on a
> different BG.

The "only on a different BG" part implies something expensive, either
a seek or a new erase page depending on the hardware.  Without that,
nearby logical blocks are nearby physical blocks as well.

> So yes there is a fragmentation from a logical point of view; from a
> physical point of view the data is spread on the disks in any case.

What matters is the extent-tree point of view.  There is (currently)
no fragmentation there, even for RAID5/6.  The extent tree is unaware
of RAID5/6 (to its peril).

ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can
put a stripe of any size anywhere.  If we're going to do that in btrfs,
you might as well just do what ZFS does.

OTOH, variable-size block groups give us read-compatibility with old
kernel versions (and write-compatibility for that matter--a kernel that
didn't know about the BG separation would just work but have write hole).

If an application does a loop writing 68K then fsync(), the multiple-BG
solution adds two seeks to read every 68K.  That's expensive if sequential
read bandwidth is more scarce than free space.

> In any case, you are right, we should gather some data, because the
> performance impact are no so clear.
> 
> I am not worried abut having different BG; we have problem with these
> because we never developed tool to handle this issue properly (i.e. a
> daemon which starts a balance when needed). But I hope that this will
> be solved in future.

Balance daemons are easy to the point of being trivial to write in Python.

The balancing itself is quite expensive and invasive:  can't usefully
ionice it, can only abort it on block group boundaries, can't delete
snapshots while it's running.

If balance could be given a vrange that was the size of one extent...then
we could talk about daemons.

> In any case, the all solutions proposed have their trade off:
> 
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle
> the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a
> short time window. Moreover the log area is written several order of
> magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple
> to implement;
> - e) different BG with different stripe size: limited waste of space;
> logical fragmentation.

Also:

  - f) avoiding writes to partially filled stripes: free space
  fragmentation; simple to implement (ssd_spread does it accidentally)

The difference between d) and f) is that d) allocates the space to the
extent while f) leaves the space unallocated, but skips any free space
fragments smaller than the stripe size when allocating.  f) gets the
space back with a balance (i.e. it is exactly as space-efficient as (a)
after balance).

> * c),d),e) are applied only for the tail of the extent, in case the
size is less than the stripe size.

It's only necessary to split an extent if there are no other writes
in the same transaction that could be combined with the extent tail
into a single RAID stripe.  As long as 

Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli
On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
 I thought that a possible solution is to create BG with different
>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>> disks, where 2 are parity disk; we should allocate 3 BG
 BG #1: 1 data disk, 2 parity disks
 BG #2: 2 data disks, 2 parity disks,
 BG #3: 4 data disks, 2 parity disks

 For simplicity, the disk-stripe length is assumed = 4K.

 So If you have a write with a length of 4 KB, this should be placed
>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>> should be placed in in BG#2, then in BG#1.
 This would avoid space wasting, even if the fragmentation will
>>> increase (but shall the fragmentation matters with the modern solid
>>> state disks ?).
>> I don't really see why this would increase fragmentation or waste space.

> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> remaining 2 blocks).  It also flips the usual order of "determine size
> of extent, then allocate space for it" which might require major surgery
> on the btrfs allocator to implement.

I have to point out that in any case the extent is physically interrupted at 
the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
the first half is written in the first disk, the other in the 2nd disk.  If you 
want to write 96kb, the first 64 are written in the first disk, the last part 
in the 2nd, only on a different BG.
So yes there is a fragmentation from a logical point of view; from a physical 
point of view the data is spread on the disks in any case.

In any case, you are right, we should gather some data, because the performance 
impact are no so clear.

I am not worried abut having different BG; we have problem with these because 
we never developed tool to handle this issue properly (i.e. a daemon which 
starts a balance when needed). But I hope that this will be solved in future.

In any case, the all solutions proposed have their trade off:

- a) as is: write hole bug
- b) variable stripe size (like ZFS): big impact on how btrfs handle the 
extent. limited waste of space
- c) logging data before writing: we wrote the data two times in a short time 
window. Moreover the log area is written several order of magnitude more than 
the other area; there was some patch around
- d) rounding the writing to the stripe size: waste of space; simple to 
implement;
- e) different BG with different stripe size: limited waste of space; logical 
fragmentation.


* c),d),e) are applied only for the tail of the extent, in case the size is 
less than the stripe size.
* for b),d), e), the wasting of space may be reduced with a balance 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 




signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> > [...]
> > > It is possible to combine writes from a single transaction into full
> > > RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> > > Any partially-filled stripe is effectively read-only and the space within
> > > it is inaccessible until all data within the stripe is overwritten,
> > > deleted, or relocated by balance.
> > > 
> > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> > > update, but that has a significant write magnification effect (and before
> > > kernel 4.14, non-trivial CPU load as well).
> > > 
> > > btrfs could also just allocate the full stripe to an extent, but emit
> > > only extent ref items for the blocks that are in use.  No fragmentation
> > > but lots of extra disk space used.  Also doesn't quite work the same
> > > way for metadata pages.
> > > 
> > > If btrfs adopted the ZFS approach, the extent allocator and all higher
> > > layers of the filesystem would have to know about--and skip over--the
> > > parity blocks embedded inside extents.  Making this change would mean
> > > that some btrfs RAID profiles start interacting with stuff like balance
> > > and compression which they currently do not.  It would create a new
> > > block group type and require an incompatible on-disk format change for
> > > both reads and writes.
> > 
> > I thought that a possible solution is to create BG with different
> number of data disks. E.g. supposing to have a raid 6 system with 6
> disks, where 2 are parity disk; we should allocate 3 BG
> > 
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> > 
> > For simplicity, the disk-stripe length is assumed = 4K.
> > 
> > So If you have a write with a length of 4 KB, this should be placed
> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> should be placed in in BG#2, then in BG#1.
> > 
> > This would avoid space wasting, even if the fragmentation will
> increase (but shall the fragmentation matters with the modern solid
> state disks ?).

I don't really see why this would increase fragmentation or waste space.
The extent size is determined before allocation anyway, all that changes
in this proposal is where those small extents ultimately land on the disk.

If anything, it might _reduce_ fragmentation since everything in BG #1
and BG #2 will be of uniform size.

It does solve write hole (one transaction per RAID stripe).

> Also, you're still going to be wasting space, it's just that less space will
> be wasted, and it will be wasted at the chunk level instead of the block
> level, which opens up a whole new set of issues to deal with, most
> significantly that it becomes functionally impossible without brute-force
> search techniques to determine when you will hit the common-case of -ENOSPC
> due to being unable to allocate a new chunk.

Hopefully the allocator only keeps one of each size of small block groups
around at a time.  The allocator can take significant short cuts because
the size of every extent in the small block groups is known (they are
all the same size by definition).

When a small block group fills up, the next one should occupy the
most-empty subset of disks--which is the opposite of the usual RAID5/6
allocation policy.  This will probably lead to "interesting" imbalances
since there are now two allocators on the filesystem with different goals
(though it is no worse than -draid5 -mraid1, and I had no problems with
free space when I was running that).

There will be an increase in the amount of allocated but not usable space,
though, because now the amount of free space depends on how much data
is batched up before fsync() or sync().  Probably best to just not count
any space in the small block groups as 'free' in statvfs terms at all.

There are a lot of variables implied there.  Without running some
simulations I have no idea if this is a good idea or not.

> > Time to time, a re-balance should be performed to empty the BG #1,
> and #2. Otherwise a new BG should be allocated.

That shouldn't be _necessary_ (the filesystem should just allocate
whatever BGs it needs), though it will improve storage efficiency if it
is done.

> > The cost should be comparable to the logging/journaling (each
> data shorter than a full-stripe, has to be written two times); the
> implementation should be quite easy, because already NOW btrfs support
> BG with different set of disks.



signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-02 Thread Goffredo Baroncelli
On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
> It is possible to combine writes from a single transaction into full
> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> Any partially-filled stripe is effectively read-only and the space within
> it is inaccessible until all data within the stripe is overwritten,
> deleted, or relocated by balance.
>
> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> update, but that has a significant write magnification effect (and before
> kernel 4.14, non-trivial CPU load as well).
> 
> btrfs could also just allocate the full stripe to an extent, but emit
> only extent ref items for the blocks that are in use.  No fragmentation
> but lots of extra disk space used.  Also doesn't quite work the same
> way for metadata pages.
> 
> If btrfs adopted the ZFS approach, the extent allocator and all higher
> layers of the filesystem would have to know about--and skip over--the
> parity blocks embedded inside extents.  Making this change would mean
> that some btrfs RAID profiles start interacting with stuff like balance
> and compression which they currently do not.  It would create a new
> block group type and require an incompatible on-disk format change for
> both reads and writes.

I thought that a possible solution is to create BG with different number of 
data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are 
parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if 
you have a write with a length of 4*3KB, the first 8KB, should be placed in in 
BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but 
shall the fragmentation matters with the modern solid state disks ?).

Time to time, a re-balance should be performed to empty the BG #1, and #2. 
Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than 
a full-stripe, has to be written two times); the implementation should be quite 
easy, because already NOW btrfs support BG with different set of disks.

BR 
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-01 Thread Zygo Blaxell
On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote:
> (I hate it when my palm rubs the trackpad and hits send prematurely...)
> 
> 
> On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy  wrote:
> 
> >> Users can run scrub immediately after _every_ unclean shutdown to
> >> reduce the risk of inconsistent parity and unrecoverable data should
> >> a disk fail later, but this can only prevent future write hole events,
> >> not recover data lost during past events.
> >
> > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM
> 
> means that EXTENT_CSUM is assumed to be correct. But in fact it could
> be stale. It's just as possible the metadata and superblock update is
> what's missing due to the interruption, while both data and parity
> strip writes succeeded. The window for either the data or parity write
> to fail is way shorter of a time interval, than that of the numerous
> metadata writes, followed by superblock update. 

csums cannot be wrong due to write interruption.  The data and metadata
blocks are written first, then barrier, then superblock updates pointing
to the data and csums previously written in the same transaction.
Unflushed data is not included in the metadata.  If there is a write
interruption then the superblock update doesn't occur and btrfs reverts
to the previous unmodified data+csum trees.

This works on non-raid5/6 because all the writes that make up a
single transaction are ordered and independent, and no data from older
transactions is modified during any tree update.

On raid5/6 every RMW operation modifies data from old transactions
by creating data/parity inconsistency.  If there was no data in the
stripe from an old transaction, the operation would be just a write,
no read and modify.  In the write hole case, the csum *is* correct,
it is the data that is wrong.

> In such a case, the
> old metadata is what's pointed to, including EXTENT_CSUM. Therefore
> your scrub would always show csum error, even if both data and parity
> are correct. You'd have to init-csum in this case, I suppose.

No, the csums are correct.  The data does not match the csum because the
data is corrupted.  Assuming barriers work on your disk, and you're not
having some kind of direct IO data consistency bug, and you can read the
csum tree at all, then the csums are correct, even with write hole.

When write holes and other write interruption patterns affect the csum
tree itself, this results in parent transid verify failures, csum tree
page csum failures, or both.  This forces the filesystem read-only so
it's easy to spot when it happens.

Note that the data blocks with wrong csum from raid5/6 reconstruction
after a write hole event always belong to _old_ transactions damaged
by the write hole.  If the writes are interrupted, the new data blocks
in a RMW stripe will not be committed and will have no csums to verify,
so they can't have _wrong_ csums.  The old data blocks do not have their
csum changed by the write hole (the csum is stored on a separate tree
in a different block group) so the csums are intact.  When a write hole
event corrupts the data reconstruction on a degraded array, the csum
doesn't match because the csum is correct and the data is not.

> Pretty much it's RMW with a (partial) stripe overwrite upending COW,
> and therefore upending the atomicity, and thus consistency of Btrfs in
> the raid56 case where any portion of the transaction is interrupted.

Not any portion, only the RMW stripe update can produce data loss due
to write interruption (well, that, and fsync() log-tree replay bugs).

If any other part of the transaction is interrupted then btrfs recovers
just fine with its COW tree update algorithm and write barriers.

> And this is amplified if metadata is also raid56.

Data and metadata are mangled the same way.  The difference is the impact.

btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery,
and enforces this intolerance with metadata transids and csums, so write
hole on metadata _always_ breaks the filesystem.

> ZFS avoids the problem at the expense of probably a ton of
> fragmentation, by taking e.g. 4KiB RMW and writing a full length
> stripe of 8KiB fully COW, rather than doing stripe modification with
> an overwrite. And that's because it has dynamic stripe lengths. 

I think that's technically correct but could be clearer.

ZFS never does RMW.  It doesn't need to.  Parity blocks are allocated
at the extent level and RAID stripes are built *inside* the extents (or
"groups of contiguous blocks written in a single transaction" which
seems to be the closest ZFS equivalent of the btrfs extent concept).

Since every ZFS RAID stripe is bespoke sized to exactly fit a single
write operation, no two ZFS transactions can ever share a RAID stripe.
No transactions sharing a stripe means no write hole.

There is no impact on fragmentation on ZFS--space is 

Re: Status of RAID5/6

2018-04-01 Thread Chris Murphy
(I hate it when my palm rubs the trackpad and hits send prematurely...)


On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy  wrote:

>> Users can run scrub immediately after _every_ unclean shutdown to
>> reduce the risk of inconsistent parity and unrecoverable data should
>> a disk fail later, but this can only prevent future write hole events,
>> not recover data lost during past events.
>
> Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> such a leaf containing EXTENT_CSUM means that EXTENT_CSUM

means that EXTENT_CSUM is assumed to be correct. But in fact it could
be stale. It's just as possible the metadata and superblock update is
what's missing due to the interruption, while both data and parity
strip writes succeeded. The window for either the data or parity write
to fail is way shorter of a time interval, than that of the numerous
metadata writes, followed by superblock update. In such a case, the
old metadata is what's pointed to, including EXTENT_CSUM. Therefore
your scrub would always show csum error, even if both data and parity
are correct. You'd have to init-csum in this case, I suppose.

Pretty much it's RMW with a (partial) stripe overwrite upending COW,
and therefore upending the atomicity, and thus consistency of Btrfs in
the raid56 case where any portion of the transaction is interrupted.

And this is amplified if metadata is also raid56.

ZFS avoids the problem at the expense of probably a ton of
fragmentation, by taking e.g. 4KiB RMW and writing a full length
stripe of 8KiB fully COW, rather than doing stripe modification with
an overwrite. And that's because it has dynamic stripe lengths. For
Btrfs to always do COW would mean that 4KiB change goes into a new
full stripe, 64KiB * num devices, assuming no other changes are ready
at commit time.

So yeah, avoiding the problem is best. But if it's going to be a
journal, it's going to make things pretty damn slow I'd think, unless
the journal can be explicitly placed something faster than the array,
like an SSD/NVMe device. And that's what mdadm allows and expects.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-01 Thread Chris Murphy
On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxell
 wrote:
> On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:

>> Write hole happens on disk in Btrfs, but the ensuing corruption on
>> rebuild is detected. Corrupt data never propagates.
>
> Data written with nodatasum or nodatacow is corrupted without detection
> (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
> journal device).

Yeah I guess I'm not very worried about nodatasum/nodatacow if the
user isn't. Perhaps it's not a fair bias, but bias nonetheless.


>
> Metadata always has csums, and files have checksums if they are created
> with default attributes and mount options.  Those cases are covered,
> any corrupted data will give EIO on reads (except once per 4 billion
> blocks, where the corrupted CRC matches at random).
>
>> The problem is that Btrfs gives up when it's detected.
>
> Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
> combinations of recovery blocks for raid6, and earlier kernels than
> those would not recover correctly for raid5 either.  I think this has
> all been fixed in recent kernels but I haven't tested these myself so
> don't quote me on that.

Looks like 4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15=v4.14

And those parts aren't yet backported to 4.14
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15=v4.14.32

And more in 4.16
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7=v4.15


>
>> If it assumes just a bit flip - not always a correct assumption but
>> might be reasonable most of the time, it could iterate very quickly.
>
> That is not how write hole works (or csum recovery for that matter).
> Write hole producing a single bit flip would occur extremely rarely
> outside of contrived test cases.

Yes, what I wrote is definitely wrong, and I know better. I guess I
had a torn write in my brain!



> Users can run scrub immediately after _every_ unclean shutdown to
> reduce the risk of inconsistent parity and unrecoverable data should
> a disk fail later, but this can only prevent future write hole events,
> not recover data lost during past events.

Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
such a leaf containing EXTENT_CSUM means that EXTENT_CSUM




>
> If one of the data blocks is not available, its content cannot be
> recomputed from parity due to the inconsistency within the stripe.
> This will likely be detected as a csum failure (unless the data block
> is part of a nodatacow/nodatasum file, in which case corruption occurs
> but is not detected) except for the one time out of 4 billion when
> two CRC32s on random data match at random.
>
> If a damaged block contains btrfs metadata, the filesystem will be
> severely affected:  read-only, up to 100% of data inaccessible, only
> recovery methods involving brute force search will work.
>
>> Flip bit, and recompute and compare checksum. It doesn't have to
>> iterate across 64KiB times the number of devices. It really only has
>> to iterate bit flips on the particular 4KiB block that has failed csum
>> (or in the case of metadata, 16KiB for the default leaf size, up to a
>> max of 64KiB).
>
> Write hole is effectively 32768 possible bit flips in a 4K block--assuming
> only one block is affected, which is not very likely.  Each disk in an
> array can have dozens of block updates in flight when an interruption
> occurs, so there can be millions of bits corrupted in a single write
> interruption event (and dozens of opportunities to encounter the nominally
> rare write hole itself).
>
> An experienced forensic analyst armed with specialized tools, a database
> of file formats, and a recent backup of the filesystem might be able to
> recover the damaged data or deduce what it was.  btrfs, being only mere
> software running in the kernel, cannot.
>
> There are two ways to solve the write hole problem and this is not one
> of them.
>
>> That's a maximum of 4096 iterations and comparisons. It'd be quite
>> fast. And going for two bit flips while a lot slower is probably not
>> all that bad either.
>
> You could use that approach to fix a corrupted parity or data block
> on a degraded array, but not a stripe that has data blocks destroyed
> by an update with a write hole event.  Also this approach assumes that
> whatever is flipping bits in RAM is not in and of itself corrupting data
> or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
> agents in the real world do not limit themselves only to detectable and
> recoverable mischief.
>
> Aside:  As a best practice, if you see one-bit corruptions on your
> btrfs filesystem, it is time to start replacing hardware, possibly also
> finding a new hardware vendor or model (assuming the corruption is coming
> from hardware, not a kernel 

Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:
> On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
>  wrote:
> > On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>  btrfs has no optimization like mdadm write-intent bitmaps; recovery
>  is always a full-device operation.  In theory btrfs could track
>  modifications at the chunk level but this isn't even specified in the
>  on-disk format, much less implemented.
> >>> It could go even further; it would be sufficient to track which
> >>> *partial* stripes update will be performed before a commit, in one
> >>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >>> a scrub on these stripes would be sufficient.
> >
> >> A scrub cannot fix a raid56 write hole--the data is already lost.
> >> The damaged stripe updates must be replayed from the log.
> >
> > Your statement is correct, but you doesn't consider the COW nature of btrfs.
> >
> > The key is that if a data write is interrupted, all the transaction is 
> > interrupted and aborted. And due to the COW nature of btrfs, the "old 
> > state" is restored at the next reboot.
> >
> > What is needed in any case is rebuild of parity to avoid the "write-hole" 
> > bug.
> 
> Write hole happens on disk in Btrfs, but the ensuing corruption on
> rebuild is detected. Corrupt data never propagates. 

Data written with nodatasum or nodatacow is corrupted without detection
(same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
journal device).

Metadata always has csums, and files have checksums if they are created
with default attributes and mount options.  Those cases are covered,
any corrupted data will give EIO on reads (except once per 4 billion
blocks, where the corrupted CRC matches at random).

> The problem is that Btrfs gives up when it's detected.

Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
combinations of recovery blocks for raid6, and earlier kernels than
those would not recover correctly for raid5 either.  I think this has
all been fixed in recent kernels but I haven't tested these myself so
don't quote me on that.

Other than that, btrfs doesn't give up in the write hole case.
It rebuilds the data according to the raid5/6 parity algorithm, but
the algorithm doesn't produce correct data for interrupted RMW writes
when there is no stripe update journal.  There is nothing else to try
at that point.  By the time the error is detected the opportunity to
recover the data has long passed.

The data that comes out of the recovery algorithm is a mixture of old
and new data from the filesystem.  The "new" data is something that
was written just before a failure, but the "old" data could be data
of any age, even a block of free space, that previously existed on the
filesystem.  If you bypass the EIO from the failing csums (e.g. by using
btrfs rescue) it will appear as though someone took the XOR of pairs of
random blocks from the disk and wrote it over one of the data blocks
at random.  When this happens to btrfs metadata, it is effectively a
fuzz tester for tools like 'btrfs check' which will often splat after
a write hole failure happens.

> If it assumes just a bit flip - not always a correct assumption but
> might be reasonable most of the time, it could iterate very quickly.

That is not how write hole works (or csum recovery for that matter).
Write hole producing a single bit flip would occur extremely rarely
outside of contrived test cases.

Recall that in a write hole, one or more 4K blocks are updated on some
of the disks in a stripe, but other blocks retain their original values
from prior to the update.  This is OK as long as all disks are online,
since the parity can be ignored or recomputed from the data blocks.  It is
also OK if the writes on all disks are completed without interruption,
since the data and parity eventually become consistent when all writes
complete as intended.  It is also OK if the entire stripe is written at
once, since then there is only one transaction referring to the stripe,
and if that transaction is not committed then the content of the stripe
is irrelevant.

The write hole error event is when all of the following occur:

- a stripe containing committed data from one or more btrfs
transactions is modified by raid5/6 RMW update in a new
transaction.  This is the usual case on a btrfs filesystem
with the default, 'nossd' or 'ssd' mount options.

- the write is not completed (due to crash, power failure, disk
failure, bad sector, SCSI timeout, bad cable, firmware bug, etc),
so the parity block is out of sync with modified data blocks
(before or after, order doesn't matter).

- the array is alredy degraded, or later becomes degraded before
the parity block can be recomputed by a scrub.

Users can run scrub immediately after _every_ unclean shutdown to
reduce the risk of inconsistent parity 

Re: Status of RAID5/6

2018-03-31 Thread Chris Murphy
On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
 wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
 btrfs has no optimization like mdadm write-intent bitmaps; recovery
 is always a full-device operation.  In theory btrfs could track
 modifications at the chunk level but this isn't even specified in the
 on-disk format, much less implemented.
>>> It could go even further; it would be sufficient to track which
>>> *partial* stripes update will be performed before a commit, in one
>>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>>> a scrub on these stripes would be sufficient.
>
>> A scrub cannot fix a raid56 write hole--the data is already lost.
>> The damaged stripe updates must be replayed from the log.
>
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
>
> The key is that if a data write is interrupted, all the transaction is 
> interrupted and aborted. And due to the COW nature of btrfs, the "old state" 
> is restored at the next reboot.
>
> What is needed in any case is rebuild of parity to avoid the "write-hole" bug.

Write hole happens on disk in Btrfs, but the ensuing corruption on
rebuild is detected. Corrupt data never propagates. The problem is
that Btrfs gives up when it's detected.

If it assumes just a bit flip - not always a correct assumption but
might be reasonable most of the time, it could iterate very quickly.
Flip bit, and recompute and compare checksum. It doesn't have to
iterate across 64KiB times the number of devices. It really only has
to iterate bit flips on the particular 4KiB block that has failed csum
(or in the case of metadata, 16KiB for the default leaf size, up to a
max of 64KiB).

That's a maximum of 4096 iterations and comparisons. It'd be quite
fast. And going for two bit flips while a lot slower is probably not
all that bad either.

Now if it's the kind of corruption you get from a torn or misdirected
write, there's enough corruption that now you're trying to find a
collision on crc32c with a partial match as a guide. That'd take a
while and who knows you might actually get corrupted data anyway since
crc32c isn't cryptographically secure.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote:
> 31.03.2018 11:16, Goffredo Baroncelli пишет:
> > On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
> >>> The key is that if a data write is interrupted, all the transaction
> >>> is interrupted and aborted. And due to the COW nature of btrfs, the
> >>> "old state" is restored at the next reboot.
> > 
> >> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> >> RMW operations which are not COW and don't provide any data integrity
> >> guarantee.  Old data (i.e. data from very old transactions that are not
> >> part of the currently written transaction) can be destroyed by this.
> > 
> > Could you elaborate a bit ?
> > 
> > Generally speaking, updating a part of a stripe require a RMW cycle, because
> > - you need to read all data stripe (with parity in case of a problem)
> > - then you should write
> > - the new data
> > - the new parity (calculated on the basis of the first read, and the 
> > new data)
> > 
> > However the "old" data should be untouched; or you are saying that the 
> > "old" data is rewritten with the same data ? 
> > 
> 
> If old data block becomes unavailable, it can no more be reconstructed
> because old content of "new data" and "new priority" blocks are lost.
> Fortunately if checksum is in use it does not cause silent data
> corruption but it effectively means data loss.
> 
> Writing of data belonging to unrelated transaction affects previous
> transactions precisely due to RMW cycle. This fundamentally violates
> btrfs claim of always having either old or new consistent state.

Correct.

To fix this, any RMW stripe update on raid56 has to be written to a
log first.  All RMW updates must be logged because a disk failure could
happen at any time.

Full stripe writes don't need to be logged because all the data in the
stripe belongs to the same transaction, so if a disk fails the entire
stripe is either committed or it is not.

One way to avoid the logging is to change the btrfs allocation parameters
so that the filesystem doesn't allocate data in RAID stripes that are
already occupied by data from older transactions.  This is similar to
what 'ssd_spread' does, although the ssd_spread option wasn't designed
for this and won't be effective on large arrays.  This avoids modifying
stripes that contain old committed data, but it also means the free space
on the filesystem will become heavily fragmented over time.  Users will
have to run balance *much* more often to defragment the free space.



signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-31 Thread Goffredo Baroncelli
On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
>> The key is that if a data write is interrupted, all the transaction
>> is interrupted and aborted. And due to the COW nature of btrfs, the
>> "old state" is restored at the next reboot.

> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> RMW operations which are not COW and don't provide any data integrity
> guarantee.  Old data (i.e. data from very old transactions that are not
> part of the currently written transaction) can be destroyed by this.

Could you elaborate a bit ?

Generally speaking, updating a part of a stripe require a RMW cycle, because
- you need to read all data stripe (with parity in case of a problem)
- then you should write
- the new data
- the new parity (calculated on the basis of the first read, and the 
new data)

However the "old" data should be untouched; or you are saying that the "old" 
data is rewritten with the same data ? 

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>> is always a full-device operation.  In theory btrfs could track
> >>> modifications at the chunk level but this isn't even specified in the
> >>> on-disk format, much less implemented.
> >> It could go even further; it would be sufficient to track which
> >> *partial* stripes update will be performed before a commit, in one
> >> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >> a scrub on these stripes would be sufficient.
> 
> > A scrub cannot fix a raid56 write hole--the data is already lost.
> > The damaged stripe updates must be replayed from the log.
> 
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
> 
> The key is that if a data write is interrupted, all the transaction
> is interrupted and aborted. And due to the COW nature of btrfs, the
> "old state" is restored at the next reboot.

This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
RMW operations which are not COW and don't provide any data integrity
guarantee.  Old data (i.e. data from very old transactions that are not
part of the currently written transaction) can be destroyed by this.

> What is needed in any case is rebuild of parity to avoid the
> "write-hole" bug. And this is needed only for a partial stripe
> write. For a full stripe write, due to the fact that the commit is
> not flushed, it is not needed the scrub at all.
> 
> Of course for the NODATACOW file this is not entirely true; but I
> don't see the gain to switch from the cost of COW to the cost of a log.
> 
> The above sentences are correct (IMHO) if we don't consider a power
> failure+device missing case. However in this case even logging the
> "new data" would be not sufficient.
> 
> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-31 Thread Goffredo Baroncelli
On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
>>> is always a full-device operation.  In theory btrfs could track
>>> modifications at the chunk level but this isn't even specified in the
>>> on-disk format, much less implemented.
>> It could go even further; it would be sufficient to track which
>> *partial* stripes update will be performed before a commit, in one
>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>> a scrub on these stripes would be sufficient.

> A scrub cannot fix a raid56 write hole--the data is already lost.
> The damaged stripe updates must be replayed from the log.

Your statement is correct, but you doesn't consider the COW nature of btrfs.

The key is that if a data write is interrupted, all the transaction is 
interrupted and aborted. And due to the COW nature of btrfs, the "old state" is 
restored at the next reboot.

What is needed in any case is rebuild of parity to avoid the "write-hole" bug. 
And this is needed only for a partial stripe write. For a full stripe write, 
due to the fact that the commit is not flushed, it is not needed the scrub at 
all.

Of course for the NODATACOW file this is not entirely true; but I don't see the 
gain to switch from the cost of COW to the cost of a log.

The above sentences are correct (IMHO) if we don't consider a power 
failure+device missing case. However in this case even logging the "new data" 
would be not sufficient.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-30 Thread Zygo Blaxell
On Fri, Mar 30, 2018 at 06:14:52PM +0200, Goffredo Baroncelli wrote:
> On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> >> Hey.
> >>
> >> Some things would IMO be nice to get done/clarified (i.e. documented in
> >> the Wiki and manpages) from users'/admin's  POV:
> [...]
> > 
> > btrfs has no optimization like mdadm write-intent bitmaps; recovery
> > is always a full-device operation.  In theory btrfs could track
> > modifications at the chunk level but this isn't even specified in the
> > on-disk format, much less implemented.
> 
> It could go even further; it would be sufficient to track which
> *partial* stripes update will be performed before a commit, in one
> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> a scrub on these stripes would be sufficient.

A scrub cannot fix a raid56 write hole--the data is already lost.
The damaged stripe updates must be replayed from the log.

A scrub could fix raid1/raid10 partial updates but only if the filesystem
can reliably track which blocks failed to be updated by the disconnected
disks.

It would be nice if scrub could be filtered the same way balance is, e.g.
only certain block ranges, or only metadata blocks; however, this is not
presently implemented.

> BR
> G.Baroncelli
> 
> [...]
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-30 Thread Zygo Blaxell
On Fri, Mar 30, 2018 at 09:21:00AM +0200, Menion wrote:
>  Thanks for the detailed explanation. I think that a summary of this
> should go in the btrfs raid56 wiki status page, because now it is
> completely inconsistent and if a user comes there, ihe may get the
> impression that the raid56 is just broken
> Still I have the 1 bilion dollar question: from your word I understand
> that even in RAID56 the metadata are spread on the devices in a coplex
> way, but shall I assume that the array can survice to the sudden death
> of one (two for raid6) HDD in the array?

I wouldn't assume that.  There is still the write hole, and while there
is a small probability of having a write hole failure, it's a probability
that applies on *every* write in degraded mode, and since disks can fail
at any time, the array can enter degraded mode at any time.

It's similar to lottery tickets--buy one ticket, you probably won't win,
but if you buy millions of tickets, you'll claim the prize eventually.
The "prize" in this case is a severely damaged, possibly unrecoverable
filesystem.

If the data is raid5 and the metadata is raid1, the filesystem can
survive a single disk failure easily; however, some of the data may be
lost if writes to the remaining disks are interrupted by a system crash
or power failure and the write hole issue occurs.  Note that the damage
is not necessarily limited to recently written data--it's any random
data that is merely located adjacent to written data on the filesystem.

I wouldn't use raid6 until the write hole issue is resolved.  There is
no configuration where two disks can fail and metadata can still be
updated reliably.

Some users use the 'ssd_spread' mount option to reduce the probability
of write hole failure, which happens to be helpful by accident on some
array configurations, but it has a fairly high cost when the array is
not degraded due to all the extra balancing required.



> Bye


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-30 Thread Goffredo Baroncelli
On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
>> Hey.
>>
>> Some things would IMO be nice to get done/clarified (i.e. documented in
>> the Wiki and manpages) from users'/admin's  POV:
[...]
> 
>>   - changing raid lvls?
> 
> btrfs uses a brute-force RAID conversion algorithm which always works, but
> takes zero short cuts.  e.g. there is no speed optimization implemented
> for cases like "convert 2-disk raid1 to 1-disk single" which can be
> very fast in theory.  The worst-case running time is the only running
> time available in btrfs.

[...]
What it is reported by Zygo is an excellent source of information. However I 
have to point out that BTRFS has a little optimization: i.e. scrub/balance only 
works on the allocated chunk. So a partial filled filesystem requires less time 
than a nearly filled one

> 
> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> is always a full-device operation.  In theory btrfs could track
> modifications at the chunk level but this isn't even specified in the
> on-disk format, much less implemented.

It could go even further; it would be sufficient to track which *partial* 
stripes update will be performed before a commit, in one of the btrfs logs. 
Then in case of a mount of an unclean filesystem, a scrub on these stripes 
would be sufficient.

BR
G.Baroncelli

[...]


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-30 Thread Menion
 Thanks for the detailed explanation. I think that a summary of this
should go in the btrfs raid56 wiki status page, because now it is
completely inconsistent and if a user comes there, ihe may get the
impression that the raid56 is just broken
Still I have the 1 bilion dollar question: from your word I understand
that even in RAID56 the metadata are spread on the devices in a coplex
way, but shall I assume that the array can survice to the sudden death
of one (two for raid6) HDD in the array?
Bye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-29 Thread Zygo Blaxell
On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> Some things would IMO be nice to get done/clarified (i.e. documented in
> the Wiki and manpages) from users'/admin's  POV:
> 
> Some basic questions:

I can answer some easy ones:

>   - compression+raid?

There is no interaction between compression and raid.  They happen on
different data trees at different levels of the stack.  So if the raid
works, compression does too.

>   - rebuild / replace of devices?

"replace" needs raid-level-specific support.  If the raid level doesn't
support replace, then users have to do device add followed by device
delete, which is considerably (orders of magnitude) slower.

>   - changing raid lvls?

btrfs uses a brute-force RAID conversion algorithm which always works, but
takes zero short cuts.  e.g. there is no speed optimization implemented
for cases like "convert 2-disk raid1 to 1-disk single" which can be
very fast in theory.  The worst-case running time is the only running
time available in btrfs.

Also, users have to understand how the different raid allocators work
to understand their behavior in specific situations.  Without this
understanding, the set of restrictions that pop up in practice can seem
capricious and arbitrary.  e.g. after adding 1 disk to a nearly-full
raid1, full balance is required to make the new space available, but
adding 2 disks makes all the free space available immediately.

Generally it always works if you repeatedly run full-balances in a loop
until you stop running out of space, but again, this is the worst case.

>   - anything to consider with raid when doing snapshots, send/receive
> or defrag?

Snapshot deletes cannot run at the same time as RAID convert/device
delete/device shrink/resize.  If one is started while the other is
running, it will be blocked until the other finishes.  Internally these
operations block each other on a mutex.

I don't know if snapshot deletes interact with device replace (the case
has never come up for me).  I wouldn't expect it to as device replace
is more similar to scrub than balance, and scrub has no such interaction.

Also note you can only run one balance, device shrink, or device delete
at a time.  If you start one of these three operations while another is
already running, the new request is rejected immediately.

As far as I know there are no other restrictions.

>   => and for each of these: for which raid levels?

Most of those features don't interact with anything specific to a raid
layer, so they work on all raid levels.

Device replace is the exception: all RAID levels in use on the filesystem
must support it, or the user must use device add and device delete instead.

[Aside:  I don't know if any RAID levels that do not support device
replace still exist, which makes my answer longer than it otherwise
would be]

>   Perhaps also confirmation for previous issues:
>   - I vaguely remember there were issues with either device delete or
> replace and that one of them was possibly super-slow?

Device replace is faster than device delete.  Replace does not modify
any metadata, while delete rewrites all the metadata referring to the
removed device.

Delete can be orders of magnitude slower than expected because of the
metadata modifications required.

>   - I also remember there were cases in which a fs could end up in
> permanent read-only state?

Any unrecovered metadata error 1 bit or larger will do that.  RAID level
is relevant only in terms of how well it can recover corrupted or
unreadable metadata blocks.

> - Clarifying questions on what is expected to work and how things are
>   expected to behave, e.g.:
>   - Can one plug a device (without deleting/removing it first) just
> under operation and will btrfs survive it?

On raid1 and raid10, yes.  On raid5/6 you will be at risk of write hole
problems if the filesystem is modified while the device is unplugged.

If the device is later reconnected, you should immediately scrub to
bring the metadata on the devices back in sync.  Data written to the
filesystem while the device was offline will be corrected if the csum
is different on the removed device.  If there is no csum data will be
silently corrupted.  If the csum is correct, but the data is not (this
occurs with 2^-32 probability on random data where the CRC happens to
be identical) then the data will be silently corrupted.

A full replace of the removed device would be better than a scrub,
as that will get a known good copy of the data.

If the device is offline for a long time, it should be wiped before being
reintroduced to the rest of the array to avoid data integrity issues.

It may be necessary to specify a different device name when mounting
a filesystem that has had a disk removed and later reinserted until
the scrub or replace action above is completed.

btrfs has no optimization like mdadm write-intent bitmaps; recovery
is always a full-device operation.  In theory btrfs 

Re: Status of RAID5/6

2018-03-22 Thread waxhead

Liu Bo wrote:

On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote:

Hi all
I am trying to understand the status of RAID5/6 in BTRFS
I know that there are some discussion ongoing on the RFC patch
proposed by Liu bo
But it seems that everything stopped last summary. Also it mentioned
about a "separate disk for journal", does it mean that the final
implementation of RAID5/6 will require a dedicated HDD for the
journaling?


Thanks for the interest on btrfs and raid56.

The patch set is to plug write hole, which is very rare in practice, tbh.
The feedback is to use existing space instead of another dedicate
"fast device" as the journal in order to get some extent of raid
protection.  I'd need some time to pick it up.

With that being said, we have several data reconstruction fixes for
raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
upstream kernel or some distros which do kernel updates frequently,
the most important one is

8810f7517a3b Btrfs: make raid6 rebuild retry more
https://patchwork.kernel.org/patch/10091755/

AFAIK, no other data corruptions showed up.

I am very interested in the "raid"5/6 like behavior myself. Actually 
calling it RAID in the past may have had it's benefits , but these days 
continuing to use the RAID term is not helping. Even technically minded 
people seem to get confused.


For example: It was suggested that "raid"5/6 should have hot-spare 
support. In BTRFS terms a hot spare devicse sounds wrong to me, but 
reserving extra space for a "hot-space" so any "raid"5/6 like system can 
(auto?) rebalance to missing blocks to the rest of the pool sounds 
sensible enough (as long as the number of devices allows to separate the 
different bits and pieces).


Anyway , I got carried away a bit there. Sorry about that.
What I really wanted to comment is about usability of "raid"5/6
How would really a metadata "raid"1 + data "raid"5 or 6 compare to say 
mdraid 5 or 6 from a reliability point of view.


Sure mdraid has the advantage, but even with the write hole and the risk 
of corruption of data (not the filesystem) would not BTRFS in "theory" 
be safer that at least mdraid 5 if run with metadata "raid"5 ?!
You have to run scrub on both mdraid as well as BTRFS to ensure data is 
not corrupted.


PS! It might be worth mentioning that I am slightly affected by a 
Glenfarclas 105 Whisky while writing this so please bare with me in case 
something is too far off :)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-22 Thread Austin S. Hemmelgarn

On 2018-03-21 16:02, Christoph Anton Mitterer wrote:
On the note of maintenance specifically:

- Maintenance tools
   - How to get the status of the RAID? (Querying kernel logs is IMO
 rather a bad way for this)
 This includes:
 - Is the raid degraded or not?
Check for the 'degraded' flag in the mount options.  Assuming you're 
doing things sensibly and not specifying it on mount, it gets added when 
the array goes degraded.



 - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
   they? (Reshape would be: if the raid level is changed or the raid
   grown/shrinked: has all data been replicated enough to be
   "complete" for the desired raid lvl/number of devices/size?
A bit trickier, but still not hard, just check the the output of `btrfs 
scrub status`, `btrfs balance status`, and `btrfs replace status` for 
the volume.  It won't check automatic spot-repairs (that is, repairing 
individual blocks that fail checksums), but most people really don't care



- What should one regularly do? scrubs? balance? How often?
  Do we get any automatic (but configurable) tools for this?
There aren't any such tools that I know of currently.  storaged might 
have some, but I've never really looked at it so i can't comment (I'm 
kind of adverse to having hundreds of background services running to do 
stuff that can just as easily be done in a polling manner from cron 
without compromising their utility).  Right now though, it's _trivial_ 
to automate things with cron, or systemd timers, or even third-party 
tools like monit (which has the bonus that if the maintenance fails, you 
get an e-mail about it).



- There should be support in commonly used tools, e.g. Icinga/Nagios
  check_raid
Agreed.  I think there might already be a Nagios plugin for the basic 
checks, not sure about anything else though.


Netdata has had basic monitoring support for a while now, but it only 
looks at allocations, not error counters, so while it will help catch 
impending ENOSPC issues, it can't really help much with data corruption 
issues.



- Ideally there should also be some desktop notification tool, which
  tells about raid (and btrfs errors in general) as small
  installations with raids typically run no Icinga/Nagios but rely
  on e.g. email or gui notifications.
Desktop notifications would be nice, but are out of scope for the main 
btrfs-progs.  Not even LVM, MDADM, or ZFS ship desktop notification 
support from upstream.  You don't need Icinga or Nagios for monitoring 
either.  Netdata works pretty well for covering the allocation checks 
(and I'm planning to have something soon, and it's trivial to set up 
e-mail notifications with cron or systemd timers or even tools like monit.


On the note of generic monitoring though, I've been working on a Python 
3 script (with no dependencies beyond the Python standard library) to do 
the same checks that Netdata does regarding allocations, as well as 
checking device error counters and mount options that should be 
reasonable as a simple warning tool run from cron or a systemd timer. 
I'm hoping to get it included in the upstream btrfs-progs, but I don't 
have it in a state yet that it's ready to be posted (the checks are 
working, but I'm still having issues reliably mapping between mount 
points and filesystem UUID's).



I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in
btrfs, the tools don't notice and errors remain undetected.
It depends on what they look at.  All the stuff under /sys/fs/btrfs 
should never change (new things might get added, but none of the old 
stuff is likely to ever change because /sys is classified as part of the 
userspace ABI, and any changes would get shot down by Linus), so 
anything that just uses those will likely have no issues (Netdata falls 
into this category for example).  Same goes for anything using ioctls 
directly, as those are also userspace ABI.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-21 Thread Menion
I am on 4.15.5 :)
Yes I agree that Journaling is better on the same array,  still should
be unit failure tolerant, so maybe it should go in a RAID1 scheme.
Will a raid56 array built with older kernel be compatible with the new
forecoming code?
Bye

2018-03-21 18:24 GMT+01:00 Liu Bo <obuil.li...@gmail.com>:
> On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote:
>> Hi all
>> I am trying to understand the status of RAID5/6 in BTRFS
>> I know that there are some discussion ongoing on the RFC patch
>> proposed by Liu bo
>> But it seems that everything stopped last summary. Also it mentioned
>> about a "separate disk for journal", does it mean that the final
>> implementation of RAID5/6 will require a dedicated HDD for the
>> journaling?
>
> Thanks for the interest on btrfs and raid56.
>
> The patch set is to plug write hole, which is very rare in practice, tbh.
> The feedback is to use existing space instead of another dedicate
> "fast device" as the journal in order to get some extent of raid
> protection.  I'd need some time to pick it up.
>
> With that being said, we have several data reconstruction fixes for
> raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
> upstream kernel or some distros which do kernel updates frequently,
> the most important one is
>
> 8810f7517a3b Btrfs: make raid6 rebuild retry more
> https://patchwork.kernel.org/patch/10091755/
>
> AFAIK, no other data corruptions showed up.
>
> thanks,
> liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-21 Thread Christoph Anton Mitterer
Hey.

Some things would IMO be nice to get done/clarified (i.e. documented in
the Wiki and manpages) from users'/admin's  POV:

Some basic questions:
- Starting with which kernels (including stable kernel versions) does
it contain the fixes for the bigger issues from some time ago?

- Exactly what does not work yet (only the write hole?)?
  What's the roadmap for such non-working things?

- Ideally some explicit confirmations of what's considered to work,
  like:
  - compression+raid?
  - rebuild / replace of devices?
  - changing raid lvls?
  - repairing data (i.e. picking the right block according to csums in
case of silent data corruption)?
  - scrub (and scrub+repair)?
  - anything to consider with raid when doing snapshots, send/receive
or defrag?
  => and for each of these: for which raid levels?

  Perhaps also confirmation for previous issues:
  - I vaguely remember there were issues with either device delete or
replace and that one of them was possibly super-slow?
  - I also remember there were cases in which a fs could end up in
permanent read-only state?


- Clarifying questions on what is expected to work and how things are
  expected to behave, e.g.:
  - Can one plug a device (without deleting/removing it first) just
under operation and will btrfs survive it?
  - If an error is found (e.g. silent data corruption based on csums),
when will it repair (fix = write the repaired data) the data?
On the read that finds the bad data?
Only on scrub (i.e. do users need to regularly run scrubs)? 
  - What happens if error cannot be repaired, e.g. no csum information
or all blocks bad?
EIO? Or are there cases where it gives no EIO (I guess at least in
nodatacow case)
  - What happens if data cannot be fixed (i.e. trying to write the
repaired block again fails)?
And if the repaired block is written, will it be immediately
checked again (to find cases of blocks that give different results
again)?
  - Will a scrub check only the data on "one" device... or will it
check all the copies (or parity blocks) on all devices in the raid?
  - Does a fsck check all devices or just one?
  - Does a balance implicitly contain a scrub?
  - If a rebuild/repair/reshape is performed... can these be
interrupted? What if they are forcibly interrupted (power loss)?


- Explaining common workflows:
  - Replacing a faulty or simply an old disk.
How to stop btrfs from using a device (without bricking the fs)?
How to do the rebuild.
  - Best practices, like: should one do regular balances (and if so, as
asked above, do these include the scrubs, so basically: is it
enough to do one of them)
  - How to grow/shrink raid btrfs... and if this is done... how to
replicate the data already on the fs to the newly added disks (or
is this done automatically - and if so, how to see that it's
finished)?
  - What will actually trigger repairs? (i.e. one wants to get silent
block errors fixed ASAP and not only when the data is read - and
when it's possibly to late)
  - In the rebuild/repair phase (e.g. one replaces a device): Can one
somehow give priority to the rebuild/repair? (e.g. in case of a
degraded raid, one may want to get that solved ASAP and rather slow
down other reads or stop them completely.
  - Is there anything to notice when btrfs raid is placed above dm-
crypt from a security PoV?
With MD raid that wasn't much of a problem as it's typically placed
below dm-crypt... but btrfs raid would need to be placed above it.
So maybe there are some known attacks against crypto modes, if
equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written
above multiple crypto devices? (Probably something one would need
to ask their experts).


- Maintenance tools
  - How to get the status of the RAID? (Querying kernel logs is IMO
rather a bad way for this)
This includes:
- Is the raid degraded or not?
- Are scrubs/repairs/rebuilds/reshapes in progress and how far are
  they? (Reshape would be: if the raid level is changed or the raid
  grown/shrinked: has all data been replicated enough to be
  "complete" for the desired raid lvl/number of devices/size?
   - What should one regularly do? scrubs? balance? How often?
 Do we get any automatic (but configurable) tools for this?
   - There should be support in commonly used tools, e.g. Icinga/Nagios
 check_raid
   - Ideally there should also be some desktop notification tool, which
 tells about raid (and btrfs errors in general) as small
 installations with raids typically run no Icinga/Nagios but rely
 on e.g. email or gui notifications.

I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in

Re: Status of RAID5/6

2018-03-21 Thread Liu Bo
On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote:
> Hi all
> I am trying to understand the status of RAID5/6 in BTRFS
> I know that there are some discussion ongoing on the RFC patch
> proposed by Liu bo
> But it seems that everything stopped last summary. Also it mentioned
> about a "separate disk for journal", does it mean that the final
> implementation of RAID5/6 will require a dedicated HDD for the
> journaling?

Thanks for the interest on btrfs and raid56.

The patch set is to plug write hole, which is very rare in practice, tbh.
The feedback is to use existing space instead of another dedicate
"fast device" as the journal in order to get some extent of raid
protection.  I'd need some time to pick it up.

With that being said, we have several data reconstruction fixes for
raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
upstream kernel or some distros which do kernel updates frequently,
the most important one is

8810f7517a3b Btrfs: make raid6 rebuild retry more
https://patchwork.kernel.org/patch/10091755/

AFAIK, no other data corruptions showed up.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Status of RAID5/6

2018-03-21 Thread Menion
Hi all
I am trying to understand the status of RAID5/6 in BTRFS
I know that there are some discussion ongoing on the RFC patch
proposed by Liu bo
But it seems that everything stopped last summary. Also it mentioned
about a "separate disk for journal", does it mean that the final
implementation of RAID5/6 will require a dedicated HDD for the
journaling?
Bye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Status of raid5/6 in 2014?

2014-01-03 Thread Dave
Back in Feb 2013 there was quite a bit of press about the preliminary
raid5/6 implementation in Btrfs.  At the time it wasn't useful for
anything other then testing and it's my understanding that this is
still the case.

I've seen a few git commits and some chatter on this list but it would
appear the developers are largely silent.  Parity based raid would be
a powerful addition the the Btrfs feature stack and it's the feature I
most anxiously await.  Are there any milestones planned for 2014?

Keep up the good work...
-- 
-=[dave]=-

Entropy isn't what it used to be.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of raid5/6 in 2014?

2014-01-03 Thread Hans-Kristian Bakke
I personally consider proper RAID6 support with gracious non-intrusive
handling of failing drives and a proper warning mechanism the most
important missing feature of btrfs, and I know this view is shared by
many others with software RAID based storage systems, currently
limited by the existing choises on Linux.
But having been a (naughty) user of btrfs the last few months I fully
understand that there are important bugs, performance fixes and issues
in the existing state of btrfs that need more immediate attention as
they affect the currently installed base.

I will however stress that the faster the functionality gets
implemented the sooner users like myself can begin using it and
reporting issues, and hence btrfs gets ready for enterprise usage and
general deployment sooner.

Regards,
Hans-Kristian Bakke
Mvh

Hans-Kristian Bakke


On 3 January 2014 17:45, Dave d...@thekilempire.com wrote:
 Back in Feb 2013 there was quite a bit of press about the preliminary
 raid5/6 implementation in Btrfs.  At the time it wasn't useful for
 anything other then testing and it's my understanding that this is
 still the case.

 I've seen a few git commits and some chatter on this list but it would
 appear the developers are largely silent.  Parity based raid would be
 a powerful addition the the Btrfs feature stack and it's the feature I
 most anxiously await.  Are there any milestones planned for 2014?

 Keep up the good work...
 --
 -=[dave]=-

 Entropy isn't what it used to be.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html