Re: Status of RAID5/6
On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote: > On 04/04/2018 08:01 AM, Zygo Blaxell wrote: > > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: > >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote: > [...] > >> Before you pointed out that the non-contiguous block written has > >> an impact on performance. I am replaying that the switching from a > >> different BG happens at the stripe-disk boundary, so in any case the > >> block is physically interrupted and switched to another disk > > > > The difference is that the write is switched to a different local address > > on the disk. > > > > It's not "another" disk if it's a different BG. Recall in this plan > > there is a full-width BG that is on _every_ disk, which means every > > small-width BG shares a disk with the full-width BG. Every extent tail > > write requires a seek on a minimum of two disks in the array for raid5, > > three disks for raid6. A tail that is strip-width minus one will hit > > N - 1 disks twice in an N-disk array. > > Below I made a little simulation; my results telling me another thing: > > Current BTRFS (w/write hole) > > Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) > > Case A.1): extent size = 192kb: > 5 writes of 64kb spread on 5 disks (3data + 2 parity) > > Case A.2.2): extent size = 256kb: (optimistic case: contiguous space > available) > 5 writes of 64kb spread on 5 disks (3 data + 2 parity) > 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**] > 3 writes of 64 kb spread on 3 disks (data + 2 parity) > > Note that the two reads are contiguous to the 5 writes both in term of > space and time. The three writes are contiguous only in terms of space, > but not in terms of time, because these could happen only after the 2 > reads and the consequent parities computations. So we should consider > that between these two events, some disk activities happen; this means > seeks between the 2 reads and the 3 writes > > > BTRFS with multiple BG (wo/write hole) > > Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) > > Case B.1): extent size = 192kb: > 5 writes of 64kb spread on 5 disks > > Case B.2): extent size = 256kb: > 5 writes of 64kb spread on 5 disks in BG#1 > 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks) > > So if I count correctly: > - case B1 vs A1: these are equivalent > - case B2 vs A2.1/A2.2: > 8 writes vs 8 writes > 3 seeks vs 3 seeks > 0 reads vs 2 reads > > So to me it seems that the cost of doing a RMW cycle is worse than > seeking to another BG. Well, RMW cycles are dangerous, so being slow as well is just a second reason never to do them. > Anyway I am reaching the conclusion, also thanks of this discussion, > that this is not enough. Even if we had solve the problem of the > "extent smaller than stripe" write, we still face gain this issue when > part of the file is changed. > In this case the file update breaks the old extent and will create a > three extents: the first part, the new part, the last part. Until that > everything is OK. However the "old" part of the file would be marked > as free space. But using this part could require a RMW cycle You cannot use that free space within RAID stripes because it would require RMW, and RMW causes write hole. The space would have to be kept unavailable until the rest of the RAID stripe was deleted. OTOH, if you can solve that free space management problem, you don't have to do anything else to solve write hole. If you never RMW then you never have the write hole in the first place. > I am concluding that the only two reliable solution are > a) variable stripe size (like ZFS does) > or b) logging the RMW cycle of a stripe Those are the only solutions that don't require a special process for reclaiming unused space in RAID stripes. If you have that, you have a few more options; however, they all involve making a second copy of the data at a later time (as opposed to option b, which makes a second copy of the data during the original write). a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those) and it would require defrag to get the inefficiently used space back. b) is the best of the terrible options. It minimizes the impact on the rest of the filesystem since it can fix RMW inconsistency without having to eliminate the RMW cases. It doesn't require rewriting the allocator nor does it require users to run defrag or balance periodically. > [**] Does someone know if the checksum are checked during this read ? > [...] > > BR > G.Baroncelli > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On 04/04/2018 08:01 AM, Zygo Blaxell wrote: > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote: [...] >> Before you pointed out that the non-contiguous block written has >> an impact on performance. I am replaying that the switching from a >> different BG happens at the stripe-disk boundary, so in any case the >> block is physically interrupted and switched to another disk > > The difference is that the write is switched to a different local address > on the disk. > > It's not "another" disk if it's a different BG. Recall in this plan > there is a full-width BG that is on _every_ disk, which means every > small-width BG shares a disk with the full-width BG. Every extent tail > write requires a seek on a minimum of two disks in the array for raid5, > three disks for raid6. A tail that is strip-width minus one will hit > N - 1 disks twice in an N-disk array. Below I made a little simulation; my results telling me another thing: Current BTRFS (w/write hole) Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) Case A.1): extent size = 192kb: 5 writes of 64kb spread on 5 disks (3data + 2 parity) Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available) 5 writes of 64kb spread on 5 disks (3 data + 2 parity) 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**] 3 writes of 64 kb spread on 3 disks (data + 2 parity) Note that the two reads are contiguous to the 5 writes both in term of space and time. The three writes are contiguous only in terms of space, but not in terms of time, because these could happen only after the 2 reads and the consequent parities computations. So we should consider that between these two events, some disk activities happen; this means seeks between the 2 reads and the 3 writes BTRFS with multiple BG (wo/write hole) Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) Case B.1): extent size = 192kb: 5 writes of 64kb spread on 5 disks Case B.2): extent size = 256kb: 5 writes of 64kb spread on 5 disks in BG#1 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks) So if I count correctly: - case B1 vs A1: these are equivalent - case B2 vs A2.1/A2.2: 8 writes vs 8 writes 3 seeks vs 3 seeks 0 reads vs 2 reads So to me it seems that the cost of doing a RMW cycle is worse than seeking to another BG. Anyway I am reaching the conclusion, also thanks of this discussion, that this is not enough. Even if we had solve the problem of the "extent smaller than stripe" write, we still face gain this issue when part of the file is changed. In this case the file update breaks the old extent and will create a three extents: the first part, the new part, the last part. Until that everything is OK. However the "old" part of the file would be marked as free space. But using this part could require a RMW cycle I am concluding that the only two reliable solution are a) variable stripe size (like ZFS does) or b) logging the RMW cycle of a stripe [**] Does someone know if the checksum are checked during this read ? [...] BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote: > On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli> wrote: > > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > On 2018-04-02 11:18, Goffredo Baroncelli wrote: > > I thought that a possible solution is to create BG with different > number of data disks. E.g. supposing to have a raid 6 system with 6 > disks, where 2 are parity disk; we should allocate 3 BG > > BG #1: 1 data disk, 2 parity disks > > BG #2: 2 data disks, 2 parity disks, > > BG #3: 4 data disks, 2 parity disks > > > > For simplicity, the disk-stripe length is assumed = 4K. > > > > So If you have a write with a length of 4 KB, this should be placed > in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > should be placed in in BG#2, then in BG#1. > > This would avoid space wasting, even if the fragmentation will > increase (but shall the fragmentation matters with the modern solid > state disks ?). > >>> I don't really see why this would increase fragmentation or waste space. > > > >> Oh, wait, yes I do. If there's a write of 6 blocks, we would have > >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > >> remaining 2 blocks). It also flips the usual order of "determine size > >> of extent, then allocate space for it" which might require major surgery > >> on the btrfs allocator to implement. > > > > I have to point out that in any case the extent is physically interrupted > > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write > > 128KB, the first half is written in the first disk, the other in the 2nd > > disk. If you want to write 96kb, the first 64 are written in the first > > disk, the last part in the 2nd, only on a different BG. > > So yes there is a fragmentation from a logical point of view; from a > > physical point of view the data is spread on the disks in any case. > > > > In any case, you are right, we should gather some data, because the > > performance impact are no so clear. > > They're pretty clear, and there's a lot written about small file size > and parity raid performance being shit, no matter the implementation > (md, ZFS, Btrfs, hardware maybe less so just because of all the > caching and extra processing hardware that's dedicated to the task). Pretty much everything goes fast if you put a faster non-volatile cache in front of it. > The linux-raid@ list is full of optimizations for this that are use > case specific. One of those that often comes up is how badly suited > raid56 are for e.g. mail servers, tons of small file reads and writes, > and all the disk contention that comes up, and it's even worse when > you lose a disk, and even if you're running raid 6 and lose two disk > it's really god awful. It can be unexpectedly a disqualifying setup > without prior testing in that condition: can your workload really be > usable for two or three days in a double degraded state on that raid6? > *shrug* > > Parity raid is well suited for full stripe reads and writes, lots of > sequential writes. Ergo a small file is anything less than a full > stripe write. Of course, delayed allocation can end up making for more > full stripe writes. But now you have more RMW which is the real > performance killer, again no matter the raid. RMW isn't necessary if you have properly configured COW on top. ZFS doesn't do RMW at all. OTOH for some workloads COW is a step in a different wrong direction--the btrfs raid5 problems with nodatacow files can be solved by stripe logging and nothing else. Some equivalent of autodefrag that repacks your small RAID stripes into bigger ones will burn 3x your write IOPS eventually--it just lets you defer the inevitable until a hopefully more convenient time. A continuously loaded server never has a more convenient time, so it needs a different solution. > > I am not worried abut having different BG; we have problem with these > > because we never developed tool to handle this issue properly (i.e. a > > daemon which starts a balance when needed). But I hope that this will be > > solved in future. > > > > In any case, the all solutions proposed have their trade off: > > > > - a) as is: write hole bug > > - b) variable stripe size (like ZFS): big impact on how btrfs handle the > > extent. limited waste of space > > - c) logging data before writing: we wrote the data two times in a short > > time window. Moreover the log area is written several order of magnitude > > more than the other area; there was some patch around > > - d) rounding the writing to the stripe size: waste of space; simple to > > implement; > > - e) different BG with different stripe size: limited waste of space; > > logical fragmentation. > > I'd say for sure you're
Re: Status of RAID5/6
On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: > On 04/04/2018 12:57 AM, Zygo Blaxell wrote: > >> I have to point out that in any case the extent is physically > >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > >> you want to write 128KB, the first half is written in the first disk, > >> the other in the 2nd disk. If you want to write 96kb, the first 64 > >> are written in the first disk, the last part in the 2nd, only on a > >> different BG. > > The "only on a different BG" part implies something expensive, either > > a seek or a new erase page depending on the hardware. Without that, > > nearby logical blocks are nearby physical blocks as well. > > In any case it happens on a different disk No it doesn't. The small-BG could be on the same disk(s) as the big-BG. > >> So yes there is a fragmentation from a logical point of view; from a > >> physical point of view the data is spread on the disks in any case. > > > What matters is the extent-tree point of view. There is (currently) > > no fragmentation there, even for RAID5/6. The extent tree is unaware > > of RAID5/6 (to its peril). > > Before you pointed out that the non-contiguous block written has > an impact on performance. I am replaying that the switching from a > different BG happens at the stripe-disk boundary, so in any case the > block is physically interrupted and switched to another disk The difference is that the write is switched to a different local address on the disk. It's not "another" disk if it's a different BG. Recall in this plan there is a full-width BG that is on _every_ disk, which means every small-width BG shares a disk with the full-width BG. Every extent tail write requires a seek on a minimum of two disks in the array for raid5, three disks for raid6. A tail that is strip-width minus one will hit N - 1 disks twice in an N-disk array. > However yes: from an extent-tree point of view there will be an increase > of number extents, because the end of the writing is allocated to > another BG (if the size is not stripe-boundary) > > > If an application does a loop writing 68K then fsync(), the multiple-BG > > solution adds two seeks to read every 68K. That's expensive if sequential > > read bandwidth is more scarce than free space. > > Why you talk about an additional seeks? In any case (even without the > additional BG) the read happens from another disks See above: not another disk, usually a different location on two or more of the same disks. > >> * c),d),e) are applied only for the tail of the extent, in case the > > size is less than the stripe size. > > > > It's only necessary to split an extent if there are no other writes > > in the same transaction that could be combined with the extent tail > > into a single RAID stripe. As long as everything in the RAID stripe > > belongs to a single transaction, there is no write hole > > May be that a more "simpler" optimization would be close the transaction > when the data reach the stripe boundary... But I suspect that it is > not so simple to implement. Transactions exist in btrfs to batch up writes into big contiguous extents already. The trick is to _not_ do that when one transaction ends and the next begins, i.e. leave a space at the end of the partially-filled stripe so that the next transaction begins in an empty stripe. This does mean that there will only be extra seeks during transaction commit and fsync()--which were already very seeky to begin with. It's not necessary to write a partial stripe when there are other extents to combine. So there will be double the amount of seeking, but depending on the workload, it could double a very small percentage of writes. > > Not for d. Balance doesn't know how to get rid of unreachable blocks > > in extents (it just moves the entire extent around) so after a balance > > the writes would still be rounded up to the stripe size. Balance would > > never be able to free the rounded-up space. That space would just be > > gone until the file was overwritten, deleted, or defragged. > > If balance is capable to move the extent, why not place one near the > other during a balance ? The goal is not to limit the the writing of > the end of a extent, but avoid writing the end of an extent without > further data (e.g. the gap to the stripe has to be filled in the > same transaction) That's plan f (leave gaps in RAID stripes empty). Balance will repack short extents into RAID stripes nicely. Plan d can't do that because plan d overallocates the extent so that the extent fills the stripe (only some of the extent is used for data). Small but important difference. > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On 04/04/2018 12:57 AM, Zygo Blaxell wrote: >> I have to point out that in any case the extent is physically >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if >> you want to write 128KB, the first half is written in the first disk, >> the other in the 2nd disk. If you want to write 96kb, the first 64 >> are written in the first disk, the last part in the 2nd, only on a >> different BG. > The "only on a different BG" part implies something expensive, either > a seek or a new erase page depending on the hardware. Without that, > nearby logical blocks are nearby physical blocks as well. In any case it happens on a different disk > >> So yes there is a fragmentation from a logical point of view; from a >> physical point of view the data is spread on the disks in any case. > What matters is the extent-tree point of view. There is (currently) > no fragmentation there, even for RAID5/6. The extent tree is unaware > of RAID5/6 (to its peril). Before you pointed out that the non-contiguous block written has an impact on performance. I am replaying that the switching from a different BG happens at the stripe-disk boundary, so in any case the block is physically interrupted and switched to another disk However yes: from an extent-tree point of view there will be an increase of number extents, because the end of the writing is allocated to another BG (if the size is not stripe-boundary) > If an application does a loop writing 68K then fsync(), the multiple-BG > solution adds two seeks to read every 68K. That's expensive if sequential > read bandwidth is more scarce than free space. Why you talk about an additional seeks? In any case (even without the additional BG) the read happens from another disks >> * c),d),e) are applied only for the tail of the extent, in case the > size is less than the stripe size. > > It's only necessary to split an extent if there are no other writes > in the same transaction that could be combined with the extent tail > into a single RAID stripe. As long as everything in the RAID stripe > belongs to a single transaction, there is no write hole May be that a more "simpler" optimization would be close the transaction when the data reach the stripe boundary... But I suspect that it is not so simple to implement. > Not for d. Balance doesn't know how to get rid of unreachable blocks > in extents (it just moves the entire extent around) so after a balance > the writes would still be rounded up to the stripe size. Balance would > never be able to free the rounded-up space. That space would just be > gone until the file was overwritten, deleted, or defragged. If balance is capable to move the extent, why not place one near the other during a balance ? The goal is not to limit the the writing of the end of a extent, but avoid writing the end of an extent without further data (e.g. the gap to the stripe has to be filled in the same transaction) BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelliwrote: > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: On 2018-04-02 11:18, Goffredo Baroncelli wrote: > I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG > BG #1: 1 data disk, 2 parity disks > BG #2: 2 data disks, 2 parity disks, > BG #3: 4 data disks, 2 parity disks > > For simplicity, the disk-stripe length is assumed = 4K. > > So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1. > This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?). >>> I don't really see why this would increase fragmentation or waste space. > >> Oh, wait, yes I do. If there's a write of 6 blocks, we would have >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the >> remaining 2 blocks). It also flips the usual order of "determine size >> of extent, then allocate space for it" which might require major surgery >> on the btrfs allocator to implement. > > I have to point out that in any case the extent is physically interrupted at > the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, > the first half is written in the first disk, the other in the 2nd disk. If > you want to write 96kb, the first 64 are written in the first disk, the last > part in the 2nd, only on a different BG. > So yes there is a fragmentation from a logical point of view; from a physical > point of view the data is spread on the disks in any case. > > In any case, you are right, we should gather some data, because the > performance impact are no so clear. They're pretty clear, and there's a lot written about small file size and parity raid performance being shit, no matter the implementation (md, ZFS, Btrfs, hardware maybe less so just because of all the caching and extra processing hardware that's dedicated to the task). The linux-raid@ list is full of optimizations for this that are use case specific. One of those that often comes up is how badly suited raid56 are for e.g. mail servers, tons of small file reads and writes, and all the disk contention that comes up, and it's even worse when you lose a disk, and even if you're running raid 6 and lose two disk it's really god awful. It can be unexpectedly a disqualifying setup without prior testing in that condition: can your workload really be usable for two or three days in a double degraded state on that raid6? *shrug* Parity raid is well suited for full stripe reads and writes, lots of sequential writes. Ergo a small file is anything less than a full stripe write. Of course, delayed allocation can end up making for more full stripe writes. But now you have more RMW which is the real performance killer, again no matter the raid. > > I am not worried abut having different BG; we have problem with these because > we never developed tool to handle this issue properly (i.e. a daemon which > starts a balance when needed). But I hope that this will be solved in future. > > In any case, the all solutions proposed have their trade off: > > - a) as is: write hole bug > - b) variable stripe size (like ZFS): big impact on how btrfs handle the > extent. limited waste of space > - c) logging data before writing: we wrote the data two times in a short time > window. Moreover the log area is written several order of magnitude more than > the other area; there was some patch around > - d) rounding the writing to the stripe size: waste of space; simple to > implement; > - e) different BG with different stripe size: limited waste of space; logical > fragmentation. I'd say for sure you're worse off with metadata raid5 vs metadata raid1. And if there are many devices you might be better off with metadata raid1 even on a raid6, it's not an absolute certainty you lose the file system with a 2nd drive failure - it depends on the device and what chunk copies happen to be on it. But at the least if you have a script or some warning you can relatively easily rebalance ... HMMM Actually that should be a test. Single drive degraded raid6 with metadata raid1, can you do a metadata only balance to force the missing copy of metadata to be replicated again? In theory this should be quite fast. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote: > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: > I thought that a possible solution is to create BG with different > >>> number of data disks. E.g. supposing to have a raid 6 system with 6 > >>> disks, where 2 are parity disk; we should allocate 3 BG > BG #1: 1 data disk, 2 parity disks > BG #2: 2 data disks, 2 parity disks, > BG #3: 4 data disks, 2 parity disks > > For simplicity, the disk-stripe length is assumed = 4K. > > So If you have a write with a length of 4 KB, this should be placed > >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > >>> should be placed in in BG#2, then in BG#1. > This would avoid space wasting, even if the fragmentation will > >>> increase (but shall the fragmentation matters with the modern solid > >>> state disks ?). > >> I don't really see why this would increase fragmentation or waste space. > > > Oh, wait, yes I do. If there's a write of 6 blocks, we would have > > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > > remaining 2 blocks). It also flips the usual order of "determine size > > of extent, then allocate space for it" which might require major surgery > > on the btrfs allocator to implement. > > I have to point out that in any case the extent is physically > interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > you want to write 128KB, the first half is written in the first disk, > the other in the 2nd disk. If you want to write 96kb, the first 64 > are written in the first disk, the last part in the 2nd, only on a > different BG. The "only on a different BG" part implies something expensive, either a seek or a new erase page depending on the hardware. Without that, nearby logical blocks are nearby physical blocks as well. > So yes there is a fragmentation from a logical point of view; from a > physical point of view the data is spread on the disks in any case. What matters is the extent-tree point of view. There is (currently) no fragmentation there, even for RAID5/6. The extent tree is unaware of RAID5/6 (to its peril). ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can put a stripe of any size anywhere. If we're going to do that in btrfs, you might as well just do what ZFS does. OTOH, variable-size block groups give us read-compatibility with old kernel versions (and write-compatibility for that matter--a kernel that didn't know about the BG separation would just work but have write hole). If an application does a loop writing 68K then fsync(), the multiple-BG solution adds two seeks to read every 68K. That's expensive if sequential read bandwidth is more scarce than free space. > In any case, you are right, we should gather some data, because the > performance impact are no so clear. > > I am not worried abut having different BG; we have problem with these > because we never developed tool to handle this issue properly (i.e. a > daemon which starts a balance when needed). But I hope that this will > be solved in future. Balance daemons are easy to the point of being trivial to write in Python. The balancing itself is quite expensive and invasive: can't usefully ionice it, can only abort it on block group boundaries, can't delete snapshots while it's running. If balance could be given a vrange that was the size of one extent...then we could talk about daemons. > In any case, the all solutions proposed have their trade off: > > - a) as is: write hole bug > - b) variable stripe size (like ZFS): big impact on how btrfs handle > the extent. limited waste of space > - c) logging data before writing: we wrote the data two times in a > short time window. Moreover the log area is written several order of > magnitude more than the other area; there was some patch around > - d) rounding the writing to the stripe size: waste of space; simple > to implement; > - e) different BG with different stripe size: limited waste of space; > logical fragmentation. Also: - f) avoiding writes to partially filled stripes: free space fragmentation; simple to implement (ssd_spread does it accidentally) The difference between d) and f) is that d) allocates the space to the extent while f) leaves the space unallocated, but skips any free space fragments smaller than the stripe size when allocating. f) gets the space back with a balance (i.e. it is exactly as space-efficient as (a) after balance). > * c),d),e) are applied only for the tail of the extent, in case the size is less than the stripe size. It's only necessary to split an extent if there are no other writes in the same transaction that could be combined with the extent tail into a single RAID stripe. As long as
Re: Status of RAID5/6
On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: I thought that a possible solution is to create BG with different >>> number of data disks. E.g. supposing to have a raid 6 system with 6 >>> disks, where 2 are parity disk; we should allocate 3 BG BG #1: 1 data disk, 2 parity disks BG #2: 2 data disks, 2 parity disks, BG #3: 4 data disks, 2 parity disks For simplicity, the disk-stripe length is assumed = 4K. So If you have a write with a length of 4 KB, this should be placed >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, >>> should be placed in in BG#2, then in BG#1. This would avoid space wasting, even if the fragmentation will >>> increase (but shall the fragmentation matters with the modern solid >>> state disks ?). >> I don't really see why this would increase fragmentation or waste space. > Oh, wait, yes I do. If there's a write of 6 blocks, we would have > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > remaining 2 blocks). It also flips the usual order of "determine size > of extent, then allocate space for it" which might require major surgery > on the btrfs allocator to implement. I have to point out that in any case the extent is physically interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, the first half is written in the first disk, the other in the 2nd disk. If you want to write 96kb, the first 64 are written in the first disk, the last part in the 2nd, only on a different BG. So yes there is a fragmentation from a logical point of view; from a physical point of view the data is spread on the disks in any case. In any case, you are right, we should gather some data, because the performance impact are no so clear. I am not worried abut having different BG; we have problem with these because we never developed tool to handle this issue properly (i.e. a daemon which starts a balance when needed). But I hope that this will be solved in future. In any case, the all solutions proposed have their trade off: - a) as is: write hole bug - b) variable stripe size (like ZFS): big impact on how btrfs handle the extent. limited waste of space - c) logging data before writing: we wrote the data two times in a short time window. Moreover the log area is written several order of magnitude more than the other area; there was some patch around - d) rounding the writing to the stripe size: waste of space; simple to implement; - e) different BG with different stripe size: limited waste of space; logical fragmentation. * c),d),e) are applied only for the tail of the extent, in case the size is less than the stripe size. * for b),d), e), the wasting of space may be reduced with a balance -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > > On 2018-04-02 11:18, Goffredo Baroncelli wrote: > > > I thought that a possible solution is to create BG with different > > number of data disks. E.g. supposing to have a raid 6 system with 6 > > disks, where 2 are parity disk; we should allocate 3 BG > > > > > > BG #1: 1 data disk, 2 parity disks > > > BG #2: 2 data disks, 2 parity disks, > > > BG #3: 4 data disks, 2 parity disks > > > > > > For simplicity, the disk-stripe length is assumed = 4K. > > > > > > So If you have a write with a length of 4 KB, this should be placed > > in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > > should be placed in in BG#2, then in BG#1. > > > > > > This would avoid space wasting, even if the fragmentation will > > increase (but shall the fragmentation matters with the modern solid > > state disks ?). > > I don't really see why this would increase fragmentation or waste space. Oh, wait, yes I do. If there's a write of 6 blocks, we would have to split an extent between BG #3 (the first 4 blocks) and BG #2 (the remaining 2 blocks). It also flips the usual order of "determine size of extent, then allocate space for it" which might require major surgery on the btrfs allocator to implement. If we round that write up to 8 blocks (so we can put both pieces in BG #3), it degenerates into the "pretend partially filled RAID stripes are completely full" case, something like what ssd_spread already does. That trades less file fragmentation for more free space fragmentation. > The extent size is determined before allocation anyway, all that changes > in this proposal is where those small extents ultimately land on the disk. > > If anything, it might _reduce_ fragmentation since everything in BG #1 > and BG #2 will be of uniform size. > > It does solve write hole (one transaction per RAID stripe). > > > Also, you're still going to be wasting space, it's just that less space will > > be wasted, and it will be wasted at the chunk level instead of the block > > level, which opens up a whole new set of issues to deal with, most > > significantly that it becomes functionally impossible without brute-force > > search techniques to determine when you will hit the common-case of -ENOSPC > > due to being unable to allocate a new chunk. > > Hopefully the allocator only keeps one of each size of small block groups > around at a time. The allocator can take significant short cuts because > the size of every extent in the small block groups is known (they are > all the same size by definition). > > When a small block group fills up, the next one should occupy the > most-empty subset of disks--which is the opposite of the usual RAID5/6 > allocation policy. This will probably lead to "interesting" imbalances > since there are now two allocators on the filesystem with different goals > (though it is no worse than -draid5 -mraid1, and I had no problems with > free space when I was running that). > > There will be an increase in the amount of allocated but not usable space, > though, because now the amount of free space depends on how much data > is batched up before fsync() or sync(). Probably best to just not count > any space in the small block groups as 'free' in statvfs terms at all. > > There are a lot of variables implied there. Without running some > simulations I have no idea if this is a good idea or not. > > > > Time to time, a re-balance should be performed to empty the BG #1, > > and #2. Otherwise a new BG should be allocated. > > That shouldn't be _necessary_ (the filesystem should just allocate > whatever BGs it needs), though it will improve storage efficiency if it > is done. > > > > The cost should be comparable to the logging/journaling (each > > data shorter than a full-stripe, has to be written two times); the > > implementation should be quite easy, because already NOW btrfs support > > BG with different set of disks. > signature.asc Description: PGP signature
Re: Status of RAID5/6
On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > On 2018-04-02 11:18, Goffredo Baroncelli wrote: > > On 04/02/2018 07:45 AM, Zygo Blaxell wrote: > > [...] > > > It is possible to combine writes from a single transaction into full > > > RMW stripes, but this *does* have an impact on fragmentation in btrfs. > > > Any partially-filled stripe is effectively read-only and the space within > > > it is inaccessible until all data within the stripe is overwritten, > > > deleted, or relocated by balance. > > > > > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe > > > update, but that has a significant write magnification effect (and before > > > kernel 4.14, non-trivial CPU load as well). > > > > > > btrfs could also just allocate the full stripe to an extent, but emit > > > only extent ref items for the blocks that are in use. No fragmentation > > > but lots of extra disk space used. Also doesn't quite work the same > > > way for metadata pages. > > > > > > If btrfs adopted the ZFS approach, the extent allocator and all higher > > > layers of the filesystem would have to know about--and skip over--the > > > parity blocks embedded inside extents. Making this change would mean > > > that some btrfs RAID profiles start interacting with stuff like balance > > > and compression which they currently do not. It would create a new > > > block group type and require an incompatible on-disk format change for > > > both reads and writes. > > > > I thought that a possible solution is to create BG with different > number of data disks. E.g. supposing to have a raid 6 system with 6 > disks, where 2 are parity disk; we should allocate 3 BG > > > > BG #1: 1 data disk, 2 parity disks > > BG #2: 2 data disks, 2 parity disks, > > BG #3: 4 data disks, 2 parity disks > > > > For simplicity, the disk-stripe length is assumed = 4K. > > > > So If you have a write with a length of 4 KB, this should be placed > in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > should be placed in in BG#2, then in BG#1. > > > > This would avoid space wasting, even if the fragmentation will > increase (but shall the fragmentation matters with the modern solid > state disks ?). I don't really see why this would increase fragmentation or waste space. The extent size is determined before allocation anyway, all that changes in this proposal is where those small extents ultimately land on the disk. If anything, it might _reduce_ fragmentation since everything in BG #1 and BG #2 will be of uniform size. It does solve write hole (one transaction per RAID stripe). > Also, you're still going to be wasting space, it's just that less space will > be wasted, and it will be wasted at the chunk level instead of the block > level, which opens up a whole new set of issues to deal with, most > significantly that it becomes functionally impossible without brute-force > search techniques to determine when you will hit the common-case of -ENOSPC > due to being unable to allocate a new chunk. Hopefully the allocator only keeps one of each size of small block groups around at a time. The allocator can take significant short cuts because the size of every extent in the small block groups is known (they are all the same size by definition). When a small block group fills up, the next one should occupy the most-empty subset of disks--which is the opposite of the usual RAID5/6 allocation policy. This will probably lead to "interesting" imbalances since there are now two allocators on the filesystem with different goals (though it is no worse than -draid5 -mraid1, and I had no problems with free space when I was running that). There will be an increase in the amount of allocated but not usable space, though, because now the amount of free space depends on how much data is batched up before fsync() or sync(). Probably best to just not count any space in the small block groups as 'free' in statvfs terms at all. There are a lot of variables implied there. Without running some simulations I have no idea if this is a good idea or not. > > Time to time, a re-balance should be performed to empty the BG #1, > and #2. Otherwise a new BG should be allocated. That shouldn't be _necessary_ (the filesystem should just allocate whatever BGs it needs), though it will improve storage efficiency if it is done. > > The cost should be comparable to the logging/journaling (each > data shorter than a full-stripe, has to be written two times); the > implementation should be quite easy, because already NOW btrfs support > BG with different set of disks. signature.asc Description: PGP signature
Re: Status of RAID5/6
On 04/02/2018 07:45 AM, Zygo Blaxell wrote: [...] > It is possible to combine writes from a single transaction into full > RMW stripes, but this *does* have an impact on fragmentation in btrfs. > Any partially-filled stripe is effectively read-only and the space within > it is inaccessible until all data within the stripe is overwritten, > deleted, or relocated by balance. > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe > update, but that has a significant write magnification effect (and before > kernel 4.14, non-trivial CPU load as well). > > btrfs could also just allocate the full stripe to an extent, but emit > only extent ref items for the blocks that are in use. No fragmentation > but lots of extra disk space used. Also doesn't quite work the same > way for metadata pages. > > If btrfs adopted the ZFS approach, the extent allocator and all higher > layers of the filesystem would have to know about--and skip over--the > parity blocks embedded inside extents. Making this change would mean > that some btrfs RAID profiles start interacting with stuff like balance > and compression which they currently do not. It would create a new > block group type and require an incompatible on-disk format change for > both reads and writes. I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG BG #1: 1 data disk, 2 parity disks BG #2: 2 data disks, 2 parity disks, BG #3: 4 data disks, 2 parity disks For simplicity, the disk-stripe length is assumed = 4K. So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1. This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?). Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated. The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote: > (I hate it when my palm rubs the trackpad and hits send prematurely...) > > > On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphywrote: > > >> Users can run scrub immediately after _every_ unclean shutdown to > >> reduce the risk of inconsistent parity and unrecoverable data should > >> a disk fail later, but this can only prevent future write hole events, > >> not recover data lost during past events. > > > > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And > > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM > > means that EXTENT_CSUM is assumed to be correct. But in fact it could > be stale. It's just as possible the metadata and superblock update is > what's missing due to the interruption, while both data and parity > strip writes succeeded. The window for either the data or parity write > to fail is way shorter of a time interval, than that of the numerous > metadata writes, followed by superblock update. csums cannot be wrong due to write interruption. The data and metadata blocks are written first, then barrier, then superblock updates pointing to the data and csums previously written in the same transaction. Unflushed data is not included in the metadata. If there is a write interruption then the superblock update doesn't occur and btrfs reverts to the previous unmodified data+csum trees. This works on non-raid5/6 because all the writes that make up a single transaction are ordered and independent, and no data from older transactions is modified during any tree update. On raid5/6 every RMW operation modifies data from old transactions by creating data/parity inconsistency. If there was no data in the stripe from an old transaction, the operation would be just a write, no read and modify. In the write hole case, the csum *is* correct, it is the data that is wrong. > In such a case, the > old metadata is what's pointed to, including EXTENT_CSUM. Therefore > your scrub would always show csum error, even if both data and parity > are correct. You'd have to init-csum in this case, I suppose. No, the csums are correct. The data does not match the csum because the data is corrupted. Assuming barriers work on your disk, and you're not having some kind of direct IO data consistency bug, and you can read the csum tree at all, then the csums are correct, even with write hole. When write holes and other write interruption patterns affect the csum tree itself, this results in parent transid verify failures, csum tree page csum failures, or both. This forces the filesystem read-only so it's easy to spot when it happens. Note that the data blocks with wrong csum from raid5/6 reconstruction after a write hole event always belong to _old_ transactions damaged by the write hole. If the writes are interrupted, the new data blocks in a RMW stripe will not be committed and will have no csums to verify, so they can't have _wrong_ csums. The old data blocks do not have their csum changed by the write hole (the csum is stored on a separate tree in a different block group) so the csums are intact. When a write hole event corrupts the data reconstruction on a degraded array, the csum doesn't match because the csum is correct and the data is not. > Pretty much it's RMW with a (partial) stripe overwrite upending COW, > and therefore upending the atomicity, and thus consistency of Btrfs in > the raid56 case where any portion of the transaction is interrupted. Not any portion, only the RMW stripe update can produce data loss due to write interruption (well, that, and fsync() log-tree replay bugs). If any other part of the transaction is interrupted then btrfs recovers just fine with its COW tree update algorithm and write barriers. > And this is amplified if metadata is also raid56. Data and metadata are mangled the same way. The difference is the impact. btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery, and enforces this intolerance with metadata transids and csums, so write hole on metadata _always_ breaks the filesystem. > ZFS avoids the problem at the expense of probably a ton of > fragmentation, by taking e.g. 4KiB RMW and writing a full length > stripe of 8KiB fully COW, rather than doing stripe modification with > an overwrite. And that's because it has dynamic stripe lengths. I think that's technically correct but could be clearer. ZFS never does RMW. It doesn't need to. Parity blocks are allocated at the extent level and RAID stripes are built *inside* the extents (or "groups of contiguous blocks written in a single transaction" which seems to be the closest ZFS equivalent of the btrfs extent concept). Since every ZFS RAID stripe is bespoke sized to exactly fit a single write operation, no two ZFS transactions can ever share a RAID stripe. No transactions sharing a stripe means no write hole. There is no impact on fragmentation on ZFS--space is
Re: Status of RAID5/6
(I hate it when my palm rubs the trackpad and hits send prematurely...) On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphywrote: >> Users can run scrub immediately after _every_ unclean shutdown to >> reduce the risk of inconsistent parity and unrecoverable data should >> a disk fail later, but this can only prevent future write hole events, >> not recover data lost during past events. > > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM means that EXTENT_CSUM is assumed to be correct. But in fact it could be stale. It's just as possible the metadata and superblock update is what's missing due to the interruption, while both data and parity strip writes succeeded. The window for either the data or parity write to fail is way shorter of a time interval, than that of the numerous metadata writes, followed by superblock update. In such a case, the old metadata is what's pointed to, including EXTENT_CSUM. Therefore your scrub would always show csum error, even if both data and parity are correct. You'd have to init-csum in this case, I suppose. Pretty much it's RMW with a (partial) stripe overwrite upending COW, and therefore upending the atomicity, and thus consistency of Btrfs in the raid56 case where any portion of the transaction is interrupted. And this is amplified if metadata is also raid56. ZFS avoids the problem at the expense of probably a ton of fragmentation, by taking e.g. 4KiB RMW and writing a full length stripe of 8KiB fully COW, rather than doing stripe modification with an overwrite. And that's because it has dynamic stripe lengths. For Btrfs to always do COW would mean that 4KiB change goes into a new full stripe, 64KiB * num devices, assuming no other changes are ready at commit time. So yeah, avoiding the problem is best. But if it's going to be a journal, it's going to make things pretty damn slow I'd think, unless the journal can be explicitly placed something faster than the array, like an SSD/NVMe device. And that's what mdadm allows and expects. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxellwrote: > On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote: >> Write hole happens on disk in Btrfs, but the ensuing corruption on >> rebuild is detected. Corrupt data never propagates. > > Data written with nodatasum or nodatacow is corrupted without detection > (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity > journal device). Yeah I guess I'm not very worried about nodatasum/nodatacow if the user isn't. Perhaps it's not a fair bias, but bias nonetheless. > > Metadata always has csums, and files have checksums if they are created > with default attributes and mount options. Those cases are covered, > any corrupted data will give EIO on reads (except once per 4 billion > blocks, where the corrupted CRC matches at random). > >> The problem is that Btrfs gives up when it's detected. > > Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible > combinations of recovery blocks for raid6, and earlier kernels than > those would not recover correctly for raid5 either. I think this has > all been fixed in recent kernels but I haven't tested these myself so > don't quote me on that. Looks like 4.15 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15=v4.14 And those parts aren't yet backported to 4.14 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15=v4.14.32 And more in 4.16 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7=v4.15 > >> If it assumes just a bit flip - not always a correct assumption but >> might be reasonable most of the time, it could iterate very quickly. > > That is not how write hole works (or csum recovery for that matter). > Write hole producing a single bit flip would occur extremely rarely > outside of contrived test cases. Yes, what I wrote is definitely wrong, and I know better. I guess I had a torn write in my brain! > Users can run scrub immediately after _every_ unclean shutdown to > reduce the risk of inconsistent parity and unrecoverable data should > a disk fail later, but this can only prevent future write hole events, > not recover data lost during past events. Problem is, Btrfs assumes a leaf is correct if it passes checksum. And such a leaf containing EXTENT_CSUM means that EXTENT_CSUM > > If one of the data blocks is not available, its content cannot be > recomputed from parity due to the inconsistency within the stripe. > This will likely be detected as a csum failure (unless the data block > is part of a nodatacow/nodatasum file, in which case corruption occurs > but is not detected) except for the one time out of 4 billion when > two CRC32s on random data match at random. > > If a damaged block contains btrfs metadata, the filesystem will be > severely affected: read-only, up to 100% of data inaccessible, only > recovery methods involving brute force search will work. > >> Flip bit, and recompute and compare checksum. It doesn't have to >> iterate across 64KiB times the number of devices. It really only has >> to iterate bit flips on the particular 4KiB block that has failed csum >> (or in the case of metadata, 16KiB for the default leaf size, up to a >> max of 64KiB). > > Write hole is effectively 32768 possible bit flips in a 4K block--assuming > only one block is affected, which is not very likely. Each disk in an > array can have dozens of block updates in flight when an interruption > occurs, so there can be millions of bits corrupted in a single write > interruption event (and dozens of opportunities to encounter the nominally > rare write hole itself). > > An experienced forensic analyst armed with specialized tools, a database > of file formats, and a recent backup of the filesystem might be able to > recover the damaged data or deduce what it was. btrfs, being only mere > software running in the kernel, cannot. > > There are two ways to solve the write hole problem and this is not one > of them. > >> That's a maximum of 4096 iterations and comparisons. It'd be quite >> fast. And going for two bit flips while a lot slower is probably not >> all that bad either. > > You could use that approach to fix a corrupted parity or data block > on a degraded array, but not a stripe that has data blocks destroyed > by an update with a write hole event. Also this approach assumes that > whatever is flipping bits in RAM is not in and of itself corrupting data > or damaging the filesystem in unrecoverable ways, but most RAM-corrupting > agents in the real world do not limit themselves only to detectable and > recoverable mischief. > > Aside: As a best practice, if you see one-bit corruptions on your > btrfs filesystem, it is time to start replacing hardware, possibly also > finding a new hardware vendor or model (assuming the corruption is coming > from hardware, not a kernel
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote: > On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli >wrote: > > On 03/31/2018 07:03 AM, Zygo Blaxell wrote: > btrfs has no optimization like mdadm write-intent bitmaps; recovery > is always a full-device operation. In theory btrfs could track > modifications at the chunk level but this isn't even specified in the > on-disk format, much less implemented. > >>> It could go even further; it would be sufficient to track which > >>> *partial* stripes update will be performed before a commit, in one > >>> of the btrfs logs. Then in case of a mount of an unclean filesystem, > >>> a scrub on these stripes would be sufficient. > > > >> A scrub cannot fix a raid56 write hole--the data is already lost. > >> The damaged stripe updates must be replayed from the log. > > > > Your statement is correct, but you doesn't consider the COW nature of btrfs. > > > > The key is that if a data write is interrupted, all the transaction is > > interrupted and aborted. And due to the COW nature of btrfs, the "old > > state" is restored at the next reboot. > > > > What is needed in any case is rebuild of parity to avoid the "write-hole" > > bug. > > Write hole happens on disk in Btrfs, but the ensuing corruption on > rebuild is detected. Corrupt data never propagates. Data written with nodatasum or nodatacow is corrupted without detection (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity journal device). Metadata always has csums, and files have checksums if they are created with default attributes and mount options. Those cases are covered, any corrupted data will give EIO on reads (except once per 4 billion blocks, where the corrupted CRC matches at random). > The problem is that Btrfs gives up when it's detected. Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible combinations of recovery blocks for raid6, and earlier kernels than those would not recover correctly for raid5 either. I think this has all been fixed in recent kernels but I haven't tested these myself so don't quote me on that. Other than that, btrfs doesn't give up in the write hole case. It rebuilds the data according to the raid5/6 parity algorithm, but the algorithm doesn't produce correct data for interrupted RMW writes when there is no stripe update journal. There is nothing else to try at that point. By the time the error is detected the opportunity to recover the data has long passed. The data that comes out of the recovery algorithm is a mixture of old and new data from the filesystem. The "new" data is something that was written just before a failure, but the "old" data could be data of any age, even a block of free space, that previously existed on the filesystem. If you bypass the EIO from the failing csums (e.g. by using btrfs rescue) it will appear as though someone took the XOR of pairs of random blocks from the disk and wrote it over one of the data blocks at random. When this happens to btrfs metadata, it is effectively a fuzz tester for tools like 'btrfs check' which will often splat after a write hole failure happens. > If it assumes just a bit flip - not always a correct assumption but > might be reasonable most of the time, it could iterate very quickly. That is not how write hole works (or csum recovery for that matter). Write hole producing a single bit flip would occur extremely rarely outside of contrived test cases. Recall that in a write hole, one or more 4K blocks are updated on some of the disks in a stripe, but other blocks retain their original values from prior to the update. This is OK as long as all disks are online, since the parity can be ignored or recomputed from the data blocks. It is also OK if the writes on all disks are completed without interruption, since the data and parity eventually become consistent when all writes complete as intended. It is also OK if the entire stripe is written at once, since then there is only one transaction referring to the stripe, and if that transaction is not committed then the content of the stripe is irrelevant. The write hole error event is when all of the following occur: - a stripe containing committed data from one or more btrfs transactions is modified by raid5/6 RMW update in a new transaction. This is the usual case on a btrfs filesystem with the default, 'nossd' or 'ssd' mount options. - the write is not completed (due to crash, power failure, disk failure, bad sector, SCSI timeout, bad cable, firmware bug, etc), so the parity block is out of sync with modified data blocks (before or after, order doesn't matter). - the array is alredy degraded, or later becomes degraded before the parity block can be recomputed by a scrub. Users can run scrub immediately after _every_ unclean shutdown to reduce the risk of inconsistent parity
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelliwrote: > On 03/31/2018 07:03 AM, Zygo Blaxell wrote: btrfs has no optimization like mdadm write-intent bitmaps; recovery is always a full-device operation. In theory btrfs could track modifications at the chunk level but this isn't even specified in the on-disk format, much less implemented. >>> It could go even further; it would be sufficient to track which >>> *partial* stripes update will be performed before a commit, in one >>> of the btrfs logs. Then in case of a mount of an unclean filesystem, >>> a scrub on these stripes would be sufficient. > >> A scrub cannot fix a raid56 write hole--the data is already lost. >> The damaged stripe updates must be replayed from the log. > > Your statement is correct, but you doesn't consider the COW nature of btrfs. > > The key is that if a data write is interrupted, all the transaction is > interrupted and aborted. And due to the COW nature of btrfs, the "old state" > is restored at the next reboot. > > What is needed in any case is rebuild of parity to avoid the "write-hole" bug. Write hole happens on disk in Btrfs, but the ensuing corruption on rebuild is detected. Corrupt data never propagates. The problem is that Btrfs gives up when it's detected. If it assumes just a bit flip - not always a correct assumption but might be reasonable most of the time, it could iterate very quickly. Flip bit, and recompute and compare checksum. It doesn't have to iterate across 64KiB times the number of devices. It really only has to iterate bit flips on the particular 4KiB block that has failed csum (or in the case of metadata, 16KiB for the default leaf size, up to a max of 64KiB). That's a maximum of 4096 iterations and comparisons. It'd be quite fast. And going for two bit flips while a lot slower is probably not all that bad either. Now if it's the kind of corruption you get from a torn or misdirected write, there's enough corruption that now you're trying to find a collision on crc32c with a partial match as a guide. That'd take a while and who knows you might actually get corrupted data anyway since crc32c isn't cryptographically secure. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote: > 31.03.2018 11:16, Goffredo Baroncelli пишет: > > On 03/31/2018 09:43 AM, Zygo Blaxell wrote: > >>> The key is that if a data write is interrupted, all the transaction > >>> is interrupted and aborted. And due to the COW nature of btrfs, the > >>> "old state" is restored at the next reboot. > > > >> This is not presently true with raid56 and btrfs. RAID56 on btrfs uses > >> RMW operations which are not COW and don't provide any data integrity > >> guarantee. Old data (i.e. data from very old transactions that are not > >> part of the currently written transaction) can be destroyed by this. > > > > Could you elaborate a bit ? > > > > Generally speaking, updating a part of a stripe require a RMW cycle, because > > - you need to read all data stripe (with parity in case of a problem) > > - then you should write > > - the new data > > - the new parity (calculated on the basis of the first read, and the > > new data) > > > > However the "old" data should be untouched; or you are saying that the > > "old" data is rewritten with the same data ? > > > > If old data block becomes unavailable, it can no more be reconstructed > because old content of "new data" and "new priority" blocks are lost. > Fortunately if checksum is in use it does not cause silent data > corruption but it effectively means data loss. > > Writing of data belonging to unrelated transaction affects previous > transactions precisely due to RMW cycle. This fundamentally violates > btrfs claim of always having either old or new consistent state. Correct. To fix this, any RMW stripe update on raid56 has to be written to a log first. All RMW updates must be logged because a disk failure could happen at any time. Full stripe writes don't need to be logged because all the data in the stripe belongs to the same transaction, so if a disk fails the entire stripe is either committed or it is not. One way to avoid the logging is to change the btrfs allocation parameters so that the filesystem doesn't allocate data in RAID stripes that are already occupied by data from older transactions. This is similar to what 'ssd_spread' does, although the ssd_spread option wasn't designed for this and won't be effective on large arrays. This avoids modifying stripes that contain old committed data, but it also means the free space on the filesystem will become heavily fragmented over time. Users will have to run balance *much* more often to defragment the free space. signature.asc Description: PGP signature
Re: Status of RAID5/6
On 03/31/2018 09:43 AM, Zygo Blaxell wrote: >> The key is that if a data write is interrupted, all the transaction >> is interrupted and aborted. And due to the COW nature of btrfs, the >> "old state" is restored at the next reboot. > This is not presently true with raid56 and btrfs. RAID56 on btrfs uses > RMW operations which are not COW and don't provide any data integrity > guarantee. Old data (i.e. data from very old transactions that are not > part of the currently written transaction) can be destroyed by this. Could you elaborate a bit ? Generally speaking, updating a part of a stripe require a RMW cycle, because - you need to read all data stripe (with parity in case of a problem) - then you should write - the new data - the new parity (calculated on the basis of the first read, and the new data) However the "old" data should be untouched; or you are saying that the "old" data is rewritten with the same data ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote: > On 03/31/2018 07:03 AM, Zygo Blaxell wrote: > >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery > >>> is always a full-device operation. In theory btrfs could track > >>> modifications at the chunk level but this isn't even specified in the > >>> on-disk format, much less implemented. > >> It could go even further; it would be sufficient to track which > >> *partial* stripes update will be performed before a commit, in one > >> of the btrfs logs. Then in case of a mount of an unclean filesystem, > >> a scrub on these stripes would be sufficient. > > > A scrub cannot fix a raid56 write hole--the data is already lost. > > The damaged stripe updates must be replayed from the log. > > Your statement is correct, but you doesn't consider the COW nature of btrfs. > > The key is that if a data write is interrupted, all the transaction > is interrupted and aborted. And due to the COW nature of btrfs, the > "old state" is restored at the next reboot. This is not presently true with raid56 and btrfs. RAID56 on btrfs uses RMW operations which are not COW and don't provide any data integrity guarantee. Old data (i.e. data from very old transactions that are not part of the currently written transaction) can be destroyed by this. > What is needed in any case is rebuild of parity to avoid the > "write-hole" bug. And this is needed only for a partial stripe > write. For a full stripe write, due to the fact that the commit is > not flushed, it is not needed the scrub at all. > > Of course for the NODATACOW file this is not entirely true; but I > don't see the gain to switch from the cost of COW to the cost of a log. > > The above sentences are correct (IMHO) if we don't consider a power > failure+device missing case. However in this case even logging the > "new data" would be not sufficient. > > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On 03/31/2018 07:03 AM, Zygo Blaxell wrote: >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery >>> is always a full-device operation. In theory btrfs could track >>> modifications at the chunk level but this isn't even specified in the >>> on-disk format, much less implemented. >> It could go even further; it would be sufficient to track which >> *partial* stripes update will be performed before a commit, in one >> of the btrfs logs. Then in case of a mount of an unclean filesystem, >> a scrub on these stripes would be sufficient. > A scrub cannot fix a raid56 write hole--the data is already lost. > The damaged stripe updates must be replayed from the log. Your statement is correct, but you doesn't consider the COW nature of btrfs. The key is that if a data write is interrupted, all the transaction is interrupted and aborted. And due to the COW nature of btrfs, the "old state" is restored at the next reboot. What is needed in any case is rebuild of parity to avoid the "write-hole" bug. And this is needed only for a partial stripe write. For a full stripe write, due to the fact that the commit is not flushed, it is not needed the scrub at all. Of course for the NODATACOW file this is not entirely true; but I don't see the gain to switch from the cost of COW to the cost of a log. The above sentences are correct (IMHO) if we don't consider a power failure+device missing case. However in this case even logging the "new data" would be not sufficient. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Fri, Mar 30, 2018 at 06:14:52PM +0200, Goffredo Baroncelli wrote: > On 03/29/2018 11:50 PM, Zygo Blaxell wrote: > > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote: > >> Hey. > >> > >> Some things would IMO be nice to get done/clarified (i.e. documented in > >> the Wiki and manpages) from users'/admin's POV: > [...] > > > > btrfs has no optimization like mdadm write-intent bitmaps; recovery > > is always a full-device operation. In theory btrfs could track > > modifications at the chunk level but this isn't even specified in the > > on-disk format, much less implemented. > > It could go even further; it would be sufficient to track which > *partial* stripes update will be performed before a commit, in one > of the btrfs logs. Then in case of a mount of an unclean filesystem, > a scrub on these stripes would be sufficient. A scrub cannot fix a raid56 write hole--the data is already lost. The damaged stripe updates must be replayed from the log. A scrub could fix raid1/raid10 partial updates but only if the filesystem can reliably track which blocks failed to be updated by the disconnected disks. It would be nice if scrub could be filtered the same way balance is, e.g. only certain block ranges, or only metadata blocks; however, this is not presently implemented. > BR > G.Baroncelli > > [...] > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: Status of RAID5/6
On Fri, Mar 30, 2018 at 09:21:00AM +0200, Menion wrote: > Thanks for the detailed explanation. I think that a summary of this > should go in the btrfs raid56 wiki status page, because now it is > completely inconsistent and if a user comes there, ihe may get the > impression that the raid56 is just broken > Still I have the 1 bilion dollar question: from your word I understand > that even in RAID56 the metadata are spread on the devices in a coplex > way, but shall I assume that the array can survice to the sudden death > of one (two for raid6) HDD in the array? I wouldn't assume that. There is still the write hole, and while there is a small probability of having a write hole failure, it's a probability that applies on *every* write in degraded mode, and since disks can fail at any time, the array can enter degraded mode at any time. It's similar to lottery tickets--buy one ticket, you probably won't win, but if you buy millions of tickets, you'll claim the prize eventually. The "prize" in this case is a severely damaged, possibly unrecoverable filesystem. If the data is raid5 and the metadata is raid1, the filesystem can survive a single disk failure easily; however, some of the data may be lost if writes to the remaining disks are interrupted by a system crash or power failure and the write hole issue occurs. Note that the damage is not necessarily limited to recently written data--it's any random data that is merely located adjacent to written data on the filesystem. I wouldn't use raid6 until the write hole issue is resolved. There is no configuration where two disks can fail and metadata can still be updated reliably. Some users use the 'ssd_spread' mount option to reduce the probability of write hole failure, which happens to be helpful by accident on some array configurations, but it has a fairly high cost when the array is not degraded due to all the extra balancing required. > Bye signature.asc Description: PGP signature
Re: Status of RAID5/6
On 03/29/2018 11:50 PM, Zygo Blaxell wrote: > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote: >> Hey. >> >> Some things would IMO be nice to get done/clarified (i.e. documented in >> the Wiki and manpages) from users'/admin's POV: [...] > >> - changing raid lvls? > > btrfs uses a brute-force RAID conversion algorithm which always works, but > takes zero short cuts. e.g. there is no speed optimization implemented > for cases like "convert 2-disk raid1 to 1-disk single" which can be > very fast in theory. The worst-case running time is the only running > time available in btrfs. [...] What it is reported by Zygo is an excellent source of information. However I have to point out that BTRFS has a little optimization: i.e. scrub/balance only works on the allocated chunk. So a partial filled filesystem requires less time than a nearly filled one > > btrfs has no optimization like mdadm write-intent bitmaps; recovery > is always a full-device operation. In theory btrfs could track > modifications at the chunk level but this isn't even specified in the > on-disk format, much less implemented. It could go even further; it would be sufficient to track which *partial* stripes update will be performed before a commit, in one of the btrfs logs. Then in case of a mount of an unclean filesystem, a scrub on these stripes would be sufficient. BR G.Baroncelli [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
Thanks for the detailed explanation. I think that a summary of this should go in the btrfs raid56 wiki status page, because now it is completely inconsistent and if a user comes there, ihe may get the impression that the raid56 is just broken Still I have the 1 bilion dollar question: from your word I understand that even in RAID56 the metadata are spread on the devices in a coplex way, but shall I assume that the array can survice to the sudden death of one (two for raid6) HDD in the array? Bye -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote: > Hey. > > Some things would IMO be nice to get done/clarified (i.e. documented in > the Wiki and manpages) from users'/admin's POV: > > Some basic questions: I can answer some easy ones: > - compression+raid? There is no interaction between compression and raid. They happen on different data trees at different levels of the stack. So if the raid works, compression does too. > - rebuild / replace of devices? "replace" needs raid-level-specific support. If the raid level doesn't support replace, then users have to do device add followed by device delete, which is considerably (orders of magnitude) slower. > - changing raid lvls? btrfs uses a brute-force RAID conversion algorithm which always works, but takes zero short cuts. e.g. there is no speed optimization implemented for cases like "convert 2-disk raid1 to 1-disk single" which can be very fast in theory. The worst-case running time is the only running time available in btrfs. Also, users have to understand how the different raid allocators work to understand their behavior in specific situations. Without this understanding, the set of restrictions that pop up in practice can seem capricious and arbitrary. e.g. after adding 1 disk to a nearly-full raid1, full balance is required to make the new space available, but adding 2 disks makes all the free space available immediately. Generally it always works if you repeatedly run full-balances in a loop until you stop running out of space, but again, this is the worst case. > - anything to consider with raid when doing snapshots, send/receive > or defrag? Snapshot deletes cannot run at the same time as RAID convert/device delete/device shrink/resize. If one is started while the other is running, it will be blocked until the other finishes. Internally these operations block each other on a mutex. I don't know if snapshot deletes interact with device replace (the case has never come up for me). I wouldn't expect it to as device replace is more similar to scrub than balance, and scrub has no such interaction. Also note you can only run one balance, device shrink, or device delete at a time. If you start one of these three operations while another is already running, the new request is rejected immediately. As far as I know there are no other restrictions. > => and for each of these: for which raid levels? Most of those features don't interact with anything specific to a raid layer, so they work on all raid levels. Device replace is the exception: all RAID levels in use on the filesystem must support it, or the user must use device add and device delete instead. [Aside: I don't know if any RAID levels that do not support device replace still exist, which makes my answer longer than it otherwise would be] > Perhaps also confirmation for previous issues: > - I vaguely remember there were issues with either device delete or > replace and that one of them was possibly super-slow? Device replace is faster than device delete. Replace does not modify any metadata, while delete rewrites all the metadata referring to the removed device. Delete can be orders of magnitude slower than expected because of the metadata modifications required. > - I also remember there were cases in which a fs could end up in > permanent read-only state? Any unrecovered metadata error 1 bit or larger will do that. RAID level is relevant only in terms of how well it can recover corrupted or unreadable metadata blocks. > - Clarifying questions on what is expected to work and how things are > expected to behave, e.g.: > - Can one plug a device (without deleting/removing it first) just > under operation and will btrfs survive it? On raid1 and raid10, yes. On raid5/6 you will be at risk of write hole problems if the filesystem is modified while the device is unplugged. If the device is later reconnected, you should immediately scrub to bring the metadata on the devices back in sync. Data written to the filesystem while the device was offline will be corrected if the csum is different on the removed device. If there is no csum data will be silently corrupted. If the csum is correct, but the data is not (this occurs with 2^-32 probability on random data where the CRC happens to be identical) then the data will be silently corrupted. A full replace of the removed device would be better than a scrub, as that will get a known good copy of the data. If the device is offline for a long time, it should be wiped before being reintroduced to the rest of the array to avoid data integrity issues. It may be necessary to specify a different device name when mounting a filesystem that has had a disk removed and later reinserted until the scrub or replace action above is completed. btrfs has no optimization like mdadm write-intent bitmaps; recovery is always a full-device operation. In theory btrfs
Re: Status of RAID5/6
Liu Bo wrote: On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote: Hi all I am trying to understand the status of RAID5/6 in BTRFS I know that there are some discussion ongoing on the RFC patch proposed by Liu bo But it seems that everything stopped last summary. Also it mentioned about a "separate disk for journal", does it mean that the final implementation of RAID5/6 will require a dedicated HDD for the journaling? Thanks for the interest on btrfs and raid56. The patch set is to plug write hole, which is very rare in practice, tbh. The feedback is to use existing space instead of another dedicate "fast device" as the journal in order to get some extent of raid protection. I'd need some time to pick it up. With that being said, we have several data reconstruction fixes for raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the upstream kernel or some distros which do kernel updates frequently, the most important one is 8810f7517a3b Btrfs: make raid6 rebuild retry more https://patchwork.kernel.org/patch/10091755/ AFAIK, no other data corruptions showed up. I am very interested in the "raid"5/6 like behavior myself. Actually calling it RAID in the past may have had it's benefits , but these days continuing to use the RAID term is not helping. Even technically minded people seem to get confused. For example: It was suggested that "raid"5/6 should have hot-spare support. In BTRFS terms a hot spare devicse sounds wrong to me, but reserving extra space for a "hot-space" so any "raid"5/6 like system can (auto?) rebalance to missing blocks to the rest of the pool sounds sensible enough (as long as the number of devices allows to separate the different bits and pieces). Anyway , I got carried away a bit there. Sorry about that. What I really wanted to comment is about usability of "raid"5/6 How would really a metadata "raid"1 + data "raid"5 or 6 compare to say mdraid 5 or 6 from a reliability point of view. Sure mdraid has the advantage, but even with the write hole and the risk of corruption of data (not the filesystem) would not BTRFS in "theory" be safer that at least mdraid 5 if run with metadata "raid"5 ?! You have to run scrub on both mdraid as well as BTRFS to ensure data is not corrupted. PS! It might be worth mentioning that I am slightly affected by a Glenfarclas 105 Whisky while writing this so please bare with me in case something is too far off :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On 2018-03-21 16:02, Christoph Anton Mitterer wrote: On the note of maintenance specifically: - Maintenance tools - How to get the status of the RAID? (Querying kernel logs is IMO rather a bad way for this) This includes: - Is the raid degraded or not? Check for the 'degraded' flag in the mount options. Assuming you're doing things sensibly and not specifying it on mount, it gets added when the array goes degraded. - Are scrubs/repairs/rebuilds/reshapes in progress and how far are they? (Reshape would be: if the raid level is changed or the raid grown/shrinked: has all data been replicated enough to be "complete" for the desired raid lvl/number of devices/size? A bit trickier, but still not hard, just check the the output of `btrfs scrub status`, `btrfs balance status`, and `btrfs replace status` for the volume. It won't check automatic spot-repairs (that is, repairing individual blocks that fail checksums), but most people really don't care - What should one regularly do? scrubs? balance? How often? Do we get any automatic (but configurable) tools for this? There aren't any such tools that I know of currently. storaged might have some, but I've never really looked at it so i can't comment (I'm kind of adverse to having hundreds of background services running to do stuff that can just as easily be done in a polling manner from cron without compromising their utility). Right now though, it's _trivial_ to automate things with cron, or systemd timers, or even third-party tools like monit (which has the bonus that if the maintenance fails, you get an e-mail about it). - There should be support in commonly used tools, e.g. Icinga/Nagios check_raid Agreed. I think there might already be a Nagios plugin for the basic checks, not sure about anything else though. Netdata has had basic monitoring support for a while now, but it only looks at allocations, not error counters, so while it will help catch impending ENOSPC issues, it can't really help much with data corruption issues. - Ideally there should also be some desktop notification tool, which tells about raid (and btrfs errors in general) as small installations with raids typically run no Icinga/Nagios but rely on e.g. email or gui notifications. Desktop notifications would be nice, but are out of scope for the main btrfs-progs. Not even LVM, MDADM, or ZFS ship desktop notification support from upstream. You don't need Icinga or Nagios for monitoring either. Netdata works pretty well for covering the allocation checks (and I'm planning to have something soon, and it's trivial to set up e-mail notifications with cron or systemd timers or even tools like monit. On the note of generic monitoring though, I've been working on a Python 3 script (with no dependencies beyond the Python standard library) to do the same checks that Netdata does regarding allocations, as well as checking device error counters and mount options that should be reasonable as a simple warning tool run from cron or a systemd timer. I'm hoping to get it included in the upstream btrfs-progs, but I don't have it in a state yet that it's ready to be posted (the checks are working, but I'm still having issues reliably mapping between mount points and filesystem UUID's). I think especially for such tools it's important that these are maintained by upstream (and yes I know you guys are rather fs developers not)... but since these tools are so vital, having them done 3rd party can easily lead to the situation where something changes in btrfs, the tools don't notice and errors remain undetected. It depends on what they look at. All the stuff under /sys/fs/btrfs should never change (new things might get added, but none of the old stuff is likely to ever change because /sys is classified as part of the userspace ABI, and any changes would get shot down by Linus), so anything that just uses those will likely have no issues (Netdata falls into this category for example). Same goes for anything using ioctls directly, as those are also userspace ABI. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
I am on 4.15.5 :) Yes I agree that Journaling is better on the same array, still should be unit failure tolerant, so maybe it should go in a RAID1 scheme. Will a raid56 array built with older kernel be compatible with the new forecoming code? Bye 2018-03-21 18:24 GMT+01:00 Liu Bo <obuil.li...@gmail.com>: > On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote: >> Hi all >> I am trying to understand the status of RAID5/6 in BTRFS >> I know that there are some discussion ongoing on the RFC patch >> proposed by Liu bo >> But it seems that everything stopped last summary. Also it mentioned >> about a "separate disk for journal", does it mean that the final >> implementation of RAID5/6 will require a dedicated HDD for the >> journaling? > > Thanks for the interest on btrfs and raid56. > > The patch set is to plug write hole, which is very rare in practice, tbh. > The feedback is to use existing space instead of another dedicate > "fast device" as the journal in order to get some extent of raid > protection. I'd need some time to pick it up. > > With that being said, we have several data reconstruction fixes for > raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the > upstream kernel or some distros which do kernel updates frequently, > the most important one is > > 8810f7517a3b Btrfs: make raid6 rebuild retry more > https://patchwork.kernel.org/patch/10091755/ > > AFAIK, no other data corruptions showed up. > > thanks, > liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
Hey. Some things would IMO be nice to get done/clarified (i.e. documented in the Wiki and manpages) from users'/admin's POV: Some basic questions: - Starting with which kernels (including stable kernel versions) does it contain the fixes for the bigger issues from some time ago? - Exactly what does not work yet (only the write hole?)? What's the roadmap for such non-working things? - Ideally some explicit confirmations of what's considered to work, like: - compression+raid? - rebuild / replace of devices? - changing raid lvls? - repairing data (i.e. picking the right block according to csums in case of silent data corruption)? - scrub (and scrub+repair)? - anything to consider with raid when doing snapshots, send/receive or defrag? => and for each of these: for which raid levels? Perhaps also confirmation for previous issues: - I vaguely remember there were issues with either device delete or replace and that one of them was possibly super-slow? - I also remember there were cases in which a fs could end up in permanent read-only state? - Clarifying questions on what is expected to work and how things are expected to behave, e.g.: - Can one plug a device (without deleting/removing it first) just under operation and will btrfs survive it? - If an error is found (e.g. silent data corruption based on csums), when will it repair (fix = write the repaired data) the data? On the read that finds the bad data? Only on scrub (i.e. do users need to regularly run scrubs)? - What happens if error cannot be repaired, e.g. no csum information or all blocks bad? EIO? Or are there cases where it gives no EIO (I guess at least in nodatacow case) - What happens if data cannot be fixed (i.e. trying to write the repaired block again fails)? And if the repaired block is written, will it be immediately checked again (to find cases of blocks that give different results again)? - Will a scrub check only the data on "one" device... or will it check all the copies (or parity blocks) on all devices in the raid? - Does a fsck check all devices or just one? - Does a balance implicitly contain a scrub? - If a rebuild/repair/reshape is performed... can these be interrupted? What if they are forcibly interrupted (power loss)? - Explaining common workflows: - Replacing a faulty or simply an old disk. How to stop btrfs from using a device (without bricking the fs)? How to do the rebuild. - Best practices, like: should one do regular balances (and if so, as asked above, do these include the scrubs, so basically: is it enough to do one of them) - How to grow/shrink raid btrfs... and if this is done... how to replicate the data already on the fs to the newly added disks (or is this done automatically - and if so, how to see that it's finished)? - What will actually trigger repairs? (i.e. one wants to get silent block errors fixed ASAP and not only when the data is read - and when it's possibly to late) - In the rebuild/repair phase (e.g. one replaces a device): Can one somehow give priority to the rebuild/repair? (e.g. in case of a degraded raid, one may want to get that solved ASAP and rather slow down other reads or stop them completely. - Is there anything to notice when btrfs raid is placed above dm- crypt from a security PoV? With MD raid that wasn't much of a problem as it's typically placed below dm-crypt... but btrfs raid would need to be placed above it. So maybe there are some known attacks against crypto modes, if equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written above multiple crypto devices? (Probably something one would need to ask their experts). - Maintenance tools - How to get the status of the RAID? (Querying kernel logs is IMO rather a bad way for this) This includes: - Is the raid degraded or not? - Are scrubs/repairs/rebuilds/reshapes in progress and how far are they? (Reshape would be: if the raid level is changed or the raid grown/shrinked: has all data been replicated enough to be "complete" for the desired raid lvl/number of devices/size? - What should one regularly do? scrubs? balance? How often? Do we get any automatic (but configurable) tools for this? - There should be support in commonly used tools, e.g. Icinga/Nagios check_raid - Ideally there should also be some desktop notification tool, which tells about raid (and btrfs errors in general) as small installations with raids typically run no Icinga/Nagios but rely on e.g. email or gui notifications. I think especially for such tools it's important that these are maintained by upstream (and yes I know you guys are rather fs developers not)... but since these tools are so vital, having them done 3rd party can easily lead to the situation where something changes in
Re: Status of RAID5/6
On Wed, Mar 21, 2018 at 9:50 AM, Menion <men...@gmail.com> wrote: > Hi all > I am trying to understand the status of RAID5/6 in BTRFS > I know that there are some discussion ongoing on the RFC patch > proposed by Liu bo > But it seems that everything stopped last summary. Also it mentioned > about a "separate disk for journal", does it mean that the final > implementation of RAID5/6 will require a dedicated HDD for the > journaling? Thanks for the interest on btrfs and raid56. The patch set is to plug write hole, which is very rare in practice, tbh. The feedback is to use existing space instead of another dedicate "fast device" as the journal in order to get some extent of raid protection. I'd need some time to pick it up. With that being said, we have several data reconstruction fixes for raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the upstream kernel or some distros which do kernel updates frequently, the most important one is 8810f7517a3b Btrfs: make raid6 rebuild retry more https://patchwork.kernel.org/patch/10091755/ AFAIK, no other data corruptions showed up. thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Status of RAID5/6
Hi all I am trying to understand the status of RAID5/6 in BTRFS I know that there are some discussion ongoing on the RFC patch proposed by Liu bo But it seems that everything stopped last summary. Also it mentioned about a "separate disk for journal", does it mean that the final implementation of RAID5/6 will require a dedicated HDD for the journaling? Bye -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Status of raid5/6 in 2014?
Back in Feb 2013 there was quite a bit of press about the preliminary raid5/6 implementation in Btrfs. At the time it wasn't useful for anything other then testing and it's my understanding that this is still the case. I've seen a few git commits and some chatter on this list but it would appear the developers are largely silent. Parity based raid would be a powerful addition the the Btrfs feature stack and it's the feature I most anxiously await. Are there any milestones planned for 2014? Keep up the good work... -- -=[dave]=- Entropy isn't what it used to be. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of raid5/6 in 2014?
I personally consider proper RAID6 support with gracious non-intrusive handling of failing drives and a proper warning mechanism the most important missing feature of btrfs, and I know this view is shared by many others with software RAID based storage systems, currently limited by the existing choises on Linux. But having been a (naughty) user of btrfs the last few months I fully understand that there are important bugs, performance fixes and issues in the existing state of btrfs that need more immediate attention as they affect the currently installed base. I will however stress that the faster the functionality gets implemented the sooner users like myself can begin using it and reporting issues, and hence btrfs gets ready for enterprise usage and general deployment sooner. Regards, Hans-Kristian Bakke Mvh Hans-Kristian Bakke On 3 January 2014 17:45, Dave d...@thekilempire.com wrote: Back in Feb 2013 there was quite a bit of press about the preliminary raid5/6 implementation in Btrfs. At the time it wasn't useful for anything other then testing and it's my understanding that this is still the case. I've seen a few git commits and some chatter on this list but it would appear the developers are largely silent. Parity based raid would be a powerful addition the the Btrfs feature stack and it's the feature I most anxiously await. Are there any milestones planned for 2014? Keep up the good work... -- -=[dave]=- Entropy isn't what it used to be. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html