Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote: > Also, is there RMW with raid0, or even raid10? No. Mirroring is writing the same data in two isolated places. Striping is writing data at different isolated places. No matter which sectors you write through these layers, it does not affect the correctness of data in any sector at a different logical address. None of these use RMW--you read or write only complete sectors and act only on the specific sectors requested. Only parity RAID does RMW. e.g. in RAID0, when you modify block 47, you may actually modify block 93 on a different disk, but there's always a 1:1 mapping between every logical and physical address. If there is a crash we go back to an earlier tree that does not contain block 47/93 so we don't care if the write was interrupted. e.g. in RAID1, when you modify block 47, you modify physical block 47 on two separate disks. The state of disk1-block47 may be different from the state of disk2-block47 if the write is interrupted. If there is a crash we go back to an earlier tree that does not contain either copy of block 47 so we don't care about any inconsistency there. So raid0, single, dup, raid1, and raid10 are OK--they fall into one or both of the above cases. CoW works there. None of these properties change in degraded mode with the mirroring profiles. Parity RAID is writing data in non-isolated places. When you write to some sectors, additional sectors are implicitly modified in degraded mode (whether you are in degraded mode at the time of the writes or not). This is different from the other cases because the other cases never modify any sectors that were not explicitly requested by the upper layer. This is OK if and only if the CoW layer is aware of this behavior and works around it. > Or is that always CoW > for metadata and data, just like single and dup? It's always CoW at the higher levels, even for parity RAID. The problem is that the CoW layer is not aware of the RMW behavior buried in the parity RAID layer, so the combination doesn't work properly. CoW thinks it's modifying only block 47, when in fact it's modifying an entire stripe in degraded mode. Let's assume 5-disk RAID5 with a strip size of one block for this example, and say blocks 45-48 are one RAID stripe. If there is a crash, data in blocks 45, 46, 47, and 48 may be irretrievably damaged by inconsistent modification of parity and data blocks. When we try to go back to an earlier tree that does not contain block 47, we will end up with a tree that contains corruption in one of the blocks 45, 46, or 48. This corruption will only be visible when something else goes wrong (parity mismatch, data csum failure, disk failure, or scrub) so a damaged filesystem that isn't degraded could appear to be healthy for a long time. If the CoW layer is aware of this, it can arrange operations such that no stripe is modified while it is referenced by a committed tree. Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at block 49 instead of 47. Since blocks 50-52 contain no data we care about, we don't even have to bother reading them (just fill the other blocks with zero or find some other data to write in the same commit), and we can eliminate many slow RMW operations entirely*. If there is a crash we just fall back to an earlier tree that does not contain block 49. This tree is not damaged because we left blocks 45-48 alone. One way to tell this is done right is all data in each RAID stripe will always belong to exactly zero or one transaction, not dozens of different transactions as stripes have now. The other way to fix things is to make stripe RMW atomic so that CoW works properly. You can tell this is done right if you can find a stripe update journal in the disk format or the code. > If raid0 is always > CoW, then I don't think it's correct to consider raid5 minus parity to > be anything like raid0 - in a Btrfs context anyway. Outside of that > context, I understand the argument. > > > > -- > Chris Murphy [*] We'd still need parity RAID RMW for nodatacow and PREALLOC because neither uses the CoW layer. That doesn't matter for nodatacow because nodatacow is how users tell us they don't want to read their data any more, but it has interesting implications for PREALLOC. Maybe a solution for PREALLOC is to do the first write strictly in RAID-stripe-sized units? signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 3:38 PM, Chris Murphywrote: > On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell > wrote: > >> >>> And how common is RMW for metadata operations? >> >> RMW in metadata is the norm. It happens on nearly all commits--the only >> exception seems to be when both ends of a commit write happen to land >> on stripe boundaries accidentally, which is less than 1% of the time on >> 3 disks. > > In the interest of due diligence, and the fact I can't confirm or deny > this myself from reading the code (although I do see many comments > involving RMW in the code), I must ask Qu if he can corroborate this. > > Because basically means btrfs raid56 is not better than md raid56 - by > design. It has nothing to do with bugs. This is substantially worse > than the scrub->wrong parity bug. > > Does it make sense to proscribe raid5 profile for metadata? As in, > disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6 > seems like it could be specious - yes there are two copies but if > there is constant RMW, then there's no CoW and we're not really > protected that well with all of these overwrites, statistically > speaking. > > Basically you have to have a setup where there's no chance of torn or > misdirected writes, and no corruptions, in which case Btrfs checksums > aren't really helpful, you're using it for other reasons (snapshots > and what not). > > Really seriously the CoW part of Btrfs being violated by all of this > RMW to me sounds like it reduces the pros of Btrfs. Also, is there RMW with raid0, or even raid10? Or is that always CoW for metadata and data, just like single and dup? If raid0 is always CoW, then I don't think it's correct to consider raid5 minus parity to be anything like raid0 - in a Btrfs context anyway. Outside of that context, I understand the argument. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxellwrote: > >> And how common is RMW for metadata operations? > > RMW in metadata is the norm. It happens on nearly all commits--the only > exception seems to be when both ends of a commit write happen to land > on stripe boundaries accidentally, which is less than 1% of the time on > 3 disks. In the interest of due diligence, and the fact I can't confirm or deny this myself from reading the code (although I do see many comments involving RMW in the code), I must ask Qu if he can corroborate this. Because basically means btrfs raid56 is not better than md raid56 - by design. It has nothing to do with bugs. This is substantially worse than the scrub->wrong parity bug. Does it make sense to proscribe raid5 profile for metadata? As in, disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6 seems like it could be specious - yes there are two copies but if there is constant RMW, then there's no CoW and we're not really protected that well with all of these overwrites, statistically speaking. Basically you have to have a setup where there's no chance of torn or misdirected writes, and no corruptions, in which case Btrfs checksums aren't really helpful, you're using it for other reasons (snapshots and what not). Really seriously the CoW part of Btrfs being violated by all of this RMW to me sounds like it reduces the pros of Btrfs. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
Zygo Blaxell posted on Fri, 14 Oct 2016 15:55:45 -0400 as excerpted: > The current btrfs raid5 implementation is a thin layer of bugs on top of > code that is still missing critical pieces. There is no mechanism to > prevent RMW-related failures combined with zero tolerance for > RMW-related failures in metadata, so I expect a btrfs filesystem using > raid5 metadata to be extremely fragile. Failure is not likely--it's > *inevitable*. Wow, that's a signature-quality quote reflecting just how dire the situation with btrfs parity-raid is ATM. First sentence for a short sig, full paragraph for a longer one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 01:16:05AM -0600, Chris Murphy wrote: > OK so we know for raid5 data block groups there can be RMW. And > because of that, any interruption results in the write hole. On Btrfs > thought, the write hole is on disk only. If there's a lost strip > (failed drive or UNC read), reconstruction from wrong parity results > in a checksum error and EIO. That's good. > > However, what happens in the metadata case? If metadata is raid5, and > there's a crash or power failure during metadata RMW, same problem, > wrong parity, bad reconstruction, csum mismatch, and EIO. So what's > the effect of EIO when reading metadata? The effect is you can't access the page or anything referenced by the page. If the page happens to be a root or interior node of something important, large parts of the filesystem are inaccessible, or the filesystem is not mountable at all. RAID device management and balance operations don't work because they abort as soon as they find the first unreadable metadata page. In theory it's still possible to rebuild parts of the filesystem offline using backrefs or brute-force search. Using an old root might work too, but in bad cases the newest viable root could be thousands of generations old (i.e. it's more likely that no viable root exists at all). > And how common is RMW for metadata operations? RMW in metadata is the norm. It happens on nearly all commits--the only exception seems to be when both ends of a commit write happen to land on stripe boundaries accidentally, which is less than 1% of the time on 3 disks. > I wonder where all of these damn strange cases where people can't do > anything at all with a normally degraded raid5 - one device failed, > and no other failures, but they can't mount due to a bunch of csum > errors. I'm *astonished* to hear about real-world successes with raid5 metadata. The total-loss failure reports are the result I expect. The current btrfs raid5 implementation is a thin layer of bugs on top of code that is still missing critical pieces. There is no mechanism to prevent RMW-related failures combined with zero tolerance for RMW-related failures in metadata, so I expect a btrfs filesystem using raid5 metadata to be extremely fragile. Failure is not likely--it's *inevitable*. The non-RMW-aware allocator almost maximizes the risk of RMW data loss. Every transaction commit contains multiple tree root pages, which are the most critical metadata that could be lost due to RMW failure. There is a window at least a few milliseconds wide, and potentially several seconds wide, where some data on disk is in an unrecoverable state due to RMW. This happens twice a minute with the default commit interval and 99% of commits are affected. That's a million opportunities per machine-year to lose metadata. If a crash lands on one of those, boom, no more filesystem. I expect one random crash (i.e. a crash that is not strongly correlated to RMW activity) out of 30-2000 (depending on filesystem size, workload, rotation speed, btrfs mount parameters) will destroy a filesystem under typical conditions. Real world crashes tend not to be random (i.e. they are strongly correlated to RMW activity), so filesystem loss will be much more frequent in practice. > > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
OK so we know for raid5 data block groups there can be RMW. And because of that, any interruption results in the write hole. On Btrfs thought, the write hole is on disk only. If there's a lost strip (failed drive or UNC read), reconstruction from wrong parity results in a checksum error and EIO. That's good. However, what happens in the metadata case? If metadata is raid5, and there's a crash or power failure during metadata RMW, same problem, wrong parity, bad reconstruction, csum mismatch, and EIO. So what's the effect of EIO when reading metadata? And how common is RMW for metadata operations? I wonder where all of these damn strange cases where people can't do anything at all with a normally degraded raid5 - one device failed, and no other failures, but they can't mount due to a bunch of csum errors. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
At 10/14/2016 05:03 AM, Zygo Blaxell wrote: On Thu, Oct 13, 2016 at 08:35:02AM +0800, Qu Wenruo wrote: At 10/13/2016 01:19 AM, Zygo Blaxell wrote: On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: True, but if we ignore parity, we'd find that, RAID5 is just RAID0. Degraded RAID5 is not RAID0. RAID5 has strict constraints that RAID0 does not. The way a RAID5 implementation behaves in degraded mode is the thing that usually matters after a disk fails. COW ensures (cowed) data and metadata are all safe and checksum will ensure they are OK, so even for RAID0, it's not a problem for case like power loss. This is not true. btrfs does not use stripes correctly to get CoW to work on RAID5/6. This is why power failures result in small amounts of data loss, if not filesystem-destroying disaster. See my below comments. And, I already said, forget parity. In that case, RAID5 without parity is just RAID0 with device rotation. This is only true in one direction. If you start with RAID0, add parity, and rotate the blocks on the devices, you get RAID5. Each individual non-parity block is independent of every other block on every other disk. If you start with RAID5 and remove one device, the result is *not* RAID0. Each individual block is now entangled with N other blocks on all the other disks. On RAID0 there's no parity. On RAID5 with no failed devices parity is irrelevant. On RAID5 with a failed device, parity touches *all* data. I understand all this. But the point is, RAID5 should never reconstruct wrong/corrupted data parity. It should either reconstruct good copy, or recover nothing. So RAID5 should be: 1) RAID0 if nothing goes wrong (with RMW overhead) 2) A little higher chance (not always 100%) to recover one missing device. For CoW to work you have to make sure that you never modify a RAID stripe that already contains committed data. Let's consider a 5-disk array and look at what we get when we try to reconstruct disk 2: Disk1 Disk2 Disk3 Disk4 Disk5 Data1 Data2 Parity Data3 Data4 Suppose one transaction writes Data1-Data4 and Parity. This is OK because no metadata reference would point to this stripe before it was committed to disk. Here's some data as an example: Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 Why do the d*mn reconstruction without checking csum? If a disk fails, we need stripe reconstruction to rebuild the data before we can verify its csum. There is no disk to read the data from directly. NOOO! Never recover anything without checking csum. And that's the problem of current kernel scrub. The root cause may be the non-atomic full stripe write, but the silent data corruption is what we should avoid. We can read out all existing data stripes and parity into memory, and try to recover the missing device. If recovered part (or existing data stripes) mismatches csum. Then there is nothing we can recover reliably. If data is not reliable, no meaning to recover. Wrong data is never better than no data. My strategy is clear enough. Never trust parity unless all data and reconstructed data matches csum. Csum is more precious than the unreliable parity. We always have csum to verify at the end (and if we aren't verifying it at the end, that's a bug). It doesn't help the parity to be more reliable. It's a bug, we didn't verify csum before writing recovered data stripes into disk, and it even writes wrong data into correct data stripes. That's all these RAID5/6 kernel scrub reports about. So, please forget csum first, just consider it as RAID0, and add parity back when all csum matches with each other. I can't reconstruct parity in the degraded RAID5 write case. That only works in the scrub case. Solve normal case first, then more complex case. If btrfs RAID5/6 scrub can't even handle normal case, no need to consider recover case. Even with all disks present on RAID5, parity gets corrupted during writes. The corruption is hidden by the fact that we can ignore parity and use the data blocks instead, but it is revealed when one of the disks is missing or has a csum failure. For device missing case, try recover in memory, and re-check the existing data stripes and recover stripe against csum. If any mismatches, that full stripe is just screwed up. For csum mismatch, it is the same. It just lower the possibility to recover one device from 100% to something lower, depending on how many screwed up parities there are. But we should never recover anything wrong. (to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^ Data5 here) Later, a transaction deletes Data3 and Data 4. Still OK, because we didn't modify any data in the stripe, so we may still be able to reconstruct the data from missing disks. The checksums for Data4 and Data5 are missing, so if there is any bitrot we lose the whole stripe (we
Re: RAID system with adaption to changed number of disks
On Thu, Oct 13, 2016 at 08:35:02AM +0800, Qu Wenruo wrote: > At 10/13/2016 01:19 AM, Zygo Blaxell wrote: > >On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: > >>True, but if we ignore parity, we'd find that, RAID5 is just RAID0. > > > >Degraded RAID5 is not RAID0. RAID5 has strict constraints that RAID0 > >does not. The way a RAID5 implementation behaves in degraded mode is > >the thing that usually matters after a disk fails. > > > >>COW ensures (cowed) data and metadata are all safe and checksum will ensure > >>they are OK, so even for RAID0, it's not a problem for case like power loss. > > > >This is not true. btrfs does not use stripes correctly to get CoW to > >work on RAID5/6. This is why power failures result in small amounts of > >data loss, if not filesystem-destroying disaster. > > See my below comments. > > And, I already said, forget parity. > In that case, RAID5 without parity is just RAID0 with device rotation. This is only true in one direction. If you start with RAID0, add parity, and rotate the blocks on the devices, you get RAID5. Each individual non-parity block is independent of every other block on every other disk. If you start with RAID5 and remove one device, the result is *not* RAID0. Each individual block is now entangled with N other blocks on all the other disks. On RAID0 there's no parity. On RAID5 with no failed devices parity is irrelevant. On RAID5 with a failed device, parity touches *all* data. > >For CoW to work you have to make sure that you never modify a RAID stripe > >that already contains committed data. Let's consider a 5-disk array > >and look at what we get when we try to reconstruct disk 2: > > > > Disk1 Disk2 Disk3 Disk4 Disk5 > > Data1 Data2 Parity Data3 Data4 > > > >Suppose one transaction writes Data1-Data4 and Parity. This is OK > >because no metadata reference would point to this stripe before it > >was committed to disk. Here's some data as an example: > > > > Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 > > > > Why do the d*mn reconstruction without checking csum? If a disk fails, we need stripe reconstruction to rebuild the data before we can verify its csum. There is no disk to read the data from directly. > My strategy is clear enough. > Never trust parity unless all data and reconstructed data matches csum. > > Csum is more precious than the unreliable parity. We always have csum to verify at the end (and if we aren't verifying it at the end, that's a bug). It doesn't help the parity to be more reliable. > So, please forget csum first, just consider it as RAID0, and add parity back > when all csum matches with each other. I can't reconstruct parity in the degraded RAID5 write case. That only works in the scrub case. Even with all disks present on RAID5, parity gets corrupted during writes. The corruption is hidden by the fact that we can ignore parity and use the data blocks instead, but it is revealed when one of the disks is missing or has a csum failure. > >(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^ > >Data5 here) > > > >Later, a transaction deletes Data3 and Data 4. Still OK, because > >we didn't modify any data in the stripe, so we may still be able to > >reconstruct the data from missing disks. The checksums for Data4 and > >Data5 are missing, so if there is any bitrot we lose the whole stripe > >(we can't tell whether the data is wrong or parity, we can't ignore the > >rotted data because it's included in the parity, and we didn't update > >the parity because deleting an extent doesn't modify its data stripe). > > > > Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 > > > > So data stripes are , , , . > If trans committed, then csum and extents of , is also deleted. > If trans not committed, , csum and extents exists. > > Any way, if we check data stripes against their csum, they should match. Let's assume they do. If any one of the csums are wrong and all the disks are online, we need correct parity to reconstruct the data blocks with bad csums. This imposes a requirement that we keep parity correct! If any of the csums are wrong *and* a disk is missing, the affected data blocks are irretrievable because there is no redundant data to reconstruct them. Since that case isn't very interesting, let's only consider what happens with no csum failures anywhere (only disk failures). > Either way, we know all data stripes matches their csum, that's enough. > No matter parity matches or not, it's just rubbish. > Re-calculate it using scrub. When one of the disks is missing, we must reconstruct from parity. At this point we still can, because the stripe isn't modified when we delete extents within it. > >Now a third transaction allocates Data3 and Data 4. Bad. First, Disk4 > >is written and existing data is
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 05:10:18PM -0400, Zygo Blaxell wrote: > On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote: > > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote: > > > I had been thinking that we could inject "plug" extents to fill up > > > RAID5 stripes. > > Your idea sounds good, but there's one problem: most real users don't > > balance. Ever. Contrary to the tribal wisdom here, this actually works > > fine, unless you had a pathologic load skewed to either data or metadata on > > the first write then fill the disk to near-capacity with a load skewed the > > other way. > > > Most usage patterns produce a mix of transient and persistent data (and at > > write time you don't know which file is which), meaning that with time every > > stripe will contain a smidge of cold data plus a fill of plug extents. > > Yes, it'll certainly reduce storage efficiency. I think all the > RMW-avoidance strategies have this problem. The alternative is to risk > losing data or the entire filesystem on disk failure, so any of the > RMW-avoidance strategies are probably a worthwhile tradeoff. Big RAID5/6 > arrays tend to be used mostly for storing large sequentially-accessed > files which are less susceptible to this kind of problem. > > If the pattern is lots of small random writes then performance on raid5 > will be terrible anyway (though it may even be improved by using plug > extents, since RMW stripe updates would be replaced with pure CoW). I've looked at some simple scenarios, and it appears that, with your scheme, the total amount of I/O would increase, but it would not hinder performance as increases happen only when the disk would be otherwise idle. There's also a latency win and a fragmentation win -- all while fixing the write hole! Let's assume leaf size 16KB, stripe size 64KB. The disk has four stripes, each 75% full 25% deleted. '*' marks cold data, '.' deleted/plug space, 'x' new data. I'm not drawing entirely empty stripes. ***. ***. ***. ***. The user wants to write 64KB of data. RMW needs to read 12 leafs, write 16, no matter if the data comes in one commit or four. ***x ***x ***x ***x Latency 28 (big commit)/7 per commit (small commits), total I/O 28. The plug extents scheme requires compaction (partial balance): I/O so far 24. Big commit: Latency 4, total I/O 28. If we had to compact on-demand, the latency is 28 (assuming we can do stripe-sized balance). Small commits, no concurrent writes: x... x... x... x... Latency 1 per commit, I/O so far 28, need another compaction: Total I/O 32. Small io, concurrent writes that peg the disk: xyyy xyyy xyyy xyyy Total I/O 28 (not counting concurrent writes). Other scenarios I've analyzed give similar results. I'm not sure if my thinking is correct, but if it is, the outcome is quite surprising: no performance loss even though we had to rewrite the stripes! > > Thus, while the plug extents idea doesn't suffer from problems of big > > sectors you just mentioned, we'd need some kind of auto-balance. > > Another way to approach the problem is to relocate the blocks in > partially filled RMW stripes so they can be effectively CoW stripes; > however, the requirement to do full extent relocations leads to some > nasty write amplification and performance ramifications. Balance is > hugely heavy I/O load and there are good reasons not to incur it at > unexpected times. We don't need balance in btrfs sense, it's enough to compact stripes -- ie, something akin to balance except done at stripe level rather than allocation block level. As for write amplification, F2FS guys solved the issue by having two types of cleaning (balancing): * on demand (when there is no free space and thus it needs to be done NOW) * in the background (done only on cold data) The on-demand clean goes for juiciest targets first (least data/stripe), background clean on the other hand uses a formula that takes into account both the amount of space to reclaim and age of the stripe. If the data is hot, it shouldn't be cleaned yet -- it's likely to be deleted/modified soon. Meow! -- A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month. Filter out and throw away the fruits (can dump them into a cake, etc), let the drink age at least 3-6 months. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
At 10/13/2016 01:19 AM, Zygo Blaxell wrote: On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: btrfs also doesn't avoid the raid5 write hole properly. After a crash, a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) to reconstruct any parity that was damaged by an incomplete data stripe update. As long as all disks are working, the parity can be reconstructed >from the data disks. If a disk fails prior to the completion of the scrub, any data stripes that were written during previous crashes may be destroyed. And all that assumes the scrub bugs are fixed first. This is true. I didn't take this into account. But this is not a *single* problem, but 2 problems. 1) Power loss 2) Device crash Before making things complex, why not focusing on single problem. Solve one problem at a time--but don't lose sight of the whole list of problems either, especially when they are interdependent. Not to mention the possibility is much smaller than single problem. Having field experience with both problems, I disagree with that. The power loss/system crash problem is much more common than the device failure/scrub problems. More data is lost when a disk fails, but the amount of data lost in a power failure isn't zero. Before I gave up on btrfs raid5, it worked out to about equal amounts of admin time recovering from the two different failure modes. If writes occur after a disk fails, they all temporarily corrupt small amounts of data in the filesystem. btrfs cannot tolerate any metadata corruption (it relies on redundant metadata to self-repair), so when a write to metadata is interrupted, the filesystem is instantly doomed (damaged beyond the current tools' ability to repair and mount read-write). That's why we used higher duplication level for metadata by default. And considering metadata size, it's much acceptable to use RAID1 for metadata other than RADI5/6. Data RAID5 metadata RAID1 makes a limited amount of sense. Small amounts of data are still lost on power failures due to RMW on the data stripes. It just doesn't break the entire filesystem because the metadata is on RAID1 and RAID1 doesn't use RMW. Data RAID6 does not make sense, unless we also have a way to have RAID1 make more than one mirror copy. With one mirror copy an array is not able to tolerate two disk failures, so the Q stripe for RAID6 is wasted CPU and space. Currently the upper layers of the filesystem assume that once data blocks are written to disk, they are stable. This is not true in raid5/6 because the parity and data blocks within each stripe cannot be updated atomically. True, but if we ignore parity, we'd find that, RAID5 is just RAID0. Degraded RAID5 is not RAID0. RAID5 has strict constraints that RAID0 does not. The way a RAID5 implementation behaves in degraded mode is the thing that usually matters after a disk fails. COW ensures (cowed) data and metadata are all safe and checksum will ensure they are OK, so even for RAID0, it's not a problem for case like power loss. This is not true. btrfs does not use stripes correctly to get CoW to work on RAID5/6. This is why power failures result in small amounts of data loss, if not filesystem-destroying disaster. See my below comments. And, I already said, forget parity. In that case, RAID5 without parity is just RAID0 with device rotation. For CoW to work you have to make sure that you never modify a RAID stripe that already contains committed data. Let's consider a 5-disk array and look at what we get when we try to reconstruct disk 2: Disk1 Disk2 Disk3 Disk4 Disk5 Data1 Data2 Parity Data3 Data4 Suppose one transaction writes Data1-Data4 and Parity. This is OK because no metadata reference would point to this stripe before it was committed to disk. Here's some data as an example: Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 Why do the d*mn reconstruction without checking csum? My strategy is clear enough. Never trust parity unless all data and reconstructed data matches csum. Csum is more precious than the unreliable parity. So, please forget csum first, just consider it as RAID0, and add parity back when all csum matches with each other. (to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^ Data5 here) Later, a transaction deletes Data3 and Data 4. Still OK, because we didn't modify any data in the stripe, so we may still be able to reconstruct the data from missing disks. The checksums for Data4 and Data5 are missing, so if there is any bitrot we lose the whole stripe (we can't tell whether the data is wrong or parity, we can't ignore the rotted data because it's included in the parity, and we didn't update the parity because deleting an extent doesn't modify its data stripe). Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 So
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote: > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote: > > On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: > > > In fact, the _concept_ to solve such RMW behavior is quite simple: > > > > > > Make sector size equal to stripe length. (Or vice versa if you like) > > > > > > Although the implementation will be more complex, people like Chandan are > > > already working on sub page size sector size support. > > > > So...metadata blocks would be 256K on the 5-disk RAID5 example above, > > and any file smaller than 256K would be stored inline? Ouch. That would > > also imply the compressed extent size limit (currently 128K) has to become > > much larger. > > > > I had been thinking that we could inject "plug" extents to fill up > > RAID5 stripes. This lets us keep the 4K block size for allocations, > > but at commit (or delalloc) time we would fill up any gaps in new RAID > > stripes to prevent them from being modified. As the real data is deleted > > from the RAID stripes, it would be replaced by "plug" extents to keep any > > new data from being allocated in the stripe. When the stripe consists > > entirely of "plug" extents, the plug extent would be deleted, allowing > > the stripe to be allocated again. The "plug" data would be zero for > > the purposes of parity reconstruction, regardless of what's on the disk. > > Balance would just throw the plug extents away (no need to relocate them). > > Your idea sounds good, but there's one problem: most real users don't > balance. Ever. Contrary to the tribal wisdom here, this actually works > fine, unless you had a pathologic load skewed to either data or metadata on > the first write then fill the disk to near-capacity with a load skewed the > other way. > Most usage patterns produce a mix of transient and persistent data (and at > write time you don't know which file is which), meaning that with time every > stripe will contain a smidge of cold data plus a fill of plug extents. Yes, it'll certainly reduce storage efficiency. I think all the RMW-avoidance strategies have this problem. The alternative is to risk losing data or the entire filesystem on disk failure, so any of the RMW-avoidance strategies are probably a worthwhile tradeoff. Big RAID5/6 arrays tend to be used mostly for storing large sequentially-accessed files which are less susceptible to this kind of problem. If the pattern is lots of small random writes then performance on raid5 will be terrible anyway (though it may even be improved by using plug extents, since RMW stripe updates would be replaced with pure CoW). > Thus, while the plug extents idea doesn't suffer from problems of big > sectors you just mentioned, we'd need some kind of auto-balance. Another way to approach the problem is to relocate the blocks in partially filled RMW stripes so they can be effectively CoW stripes; however, the requirement to do full extent relocations leads to some nasty write amplification and performance ramifications. Balance is hugely heavy I/O load and there are good reasons not to incur it at unexpected times. > > -- > A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg > raspberries, 0.4kg sugar; put into a big jar for 1 month. Filter out and > throw away the fruits (can dump them into a cake, etc), let the drink age > at least 3-6 months. > signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 11:19 AM, Zygo Blaxellwrote: > Degraded RAID5 is not RAID0. RAID5 has strict constraints that RAID0 > does not. The way a RAID5 implementation behaves in degraded mode is > the thing that usually matters after a disk fails. Is there degraded raid5 xfstesting happening? Or are the tests mainly done non-degraded? In particular, 2x device fail degraded raid6, because it's so expensive, has potential to expose even more bugs. > So...metadata blocks would be 256K on the 5-disk RAID5 example above, > and any file smaller than 256K would be stored inline? Ouch. That would > also imply the compressed extent size limit (currently 128K) has to become > much larger. There are patches to set strip size. Does it make sense to specify 4KiB strip size for metadata block groups and 64+KiB for data block groups? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Thu, Oct 13, 2016 at 12:33:31AM +0500, Roman Mamedov wrote: > On Wed, 12 Oct 2016 15:19:16 -0400 > Zygo Blaxellwrote: > > > I'm not even sure btrfs does this--I haven't checked precisely what > > it does in dup mode. It could send both copies of metadata to the > > disks with a single barrier to separate both metadata updates from > > the superblock updates. That would be bad in this particular case. > > It would be bad in any case, including a single physical disk and no RAID, and No, a single disk does not have these problems. On a single disk we don't have to deal with temporarily corrupted metadata _outside_ the areas we are writing, as the disk will confine damaged data to individual sectors. On RAID5, data damage is only limited at the stripe level, a unit orders of magnitude larger than a sector. > I don't think there's any basis to speculate that mdadm doesn't implement > write barriers properly. btrfs and mdadm have to use them properly together. It's possible to get it fatally wrong from the btrfs side even if mdadm does everything perfectly. Single disks don't have stripe consistency requirements, so if btrfs has single-disk assumptions about the behavior of writes then it can do the wrong thing on multi-disk systems. > > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there > > is an interruption (system crash, a disk times out, etc) in degraded mode, > > Moreover, in any non-COW system writes temporarily corrupt data. So again, > writing to a (degraded or not) mdadm RAID5 is not much different than writing > to a single physical disk. However I believe in the Btrfs case metadata is > always COW, so this particular problem may be not as relevant here in the > first place. Degraded RAID5 does not behave like a single disk. That's the point people seem to keep missing when thinking about this. btrfs CoW relies on single-disk behavior, and fails badly when it doesn't get it. btrfs CoW requires that writes to one sector don't modify or jeopardize data integrity in any other sectors. mdadm in degraded raid5/6 mode with no stripe journal device cannot deliver this requirement. Writes always temporarily disrupt data on other disks in the same RAID stripe. Each individual disruption lasts only milliseconds, but there may be hundreds or thousands of failure windows per second. > > -- > With respect, > Roman signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote: > On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: > > In fact, the _concept_ to solve such RMW behavior is quite simple: > > > > Make sector size equal to stripe length. (Or vice versa if you like) > > > > Although the implementation will be more complex, people like Chandan are > > already working on sub page size sector size support. > > So...metadata blocks would be 256K on the 5-disk RAID5 example above, > and any file smaller than 256K would be stored inline? Ouch. That would > also imply the compressed extent size limit (currently 128K) has to become > much larger. > > I had been thinking that we could inject "plug" extents to fill up > RAID5 stripes. This lets us keep the 4K block size for allocations, > but at commit (or delalloc) time we would fill up any gaps in new RAID > stripes to prevent them from being modified. As the real data is deleted > from the RAID stripes, it would be replaced by "plug" extents to keep any > new data from being allocated in the stripe. When the stripe consists > entirely of "plug" extents, the plug extent would be deleted, allowing > the stripe to be allocated again. The "plug" data would be zero for > the purposes of parity reconstruction, regardless of what's on the disk. > Balance would just throw the plug extents away (no need to relocate them). Your idea sounds good, but there's one problem: most real users don't balance. Ever. Contrary to the tribal wisdom here, this actually works fine, unless you had a pathologic load skewed to either data or metadata on the first write then fill the disk to near-capacity with a load skewed the other way. Most usage patterns produce a mix of transient and persistent data (and at write time you don't know which file is which), meaning that with time every stripe will contain a smidge of cold data plus a fill of plug extents. Thus, while the plug extents idea doesn't suffer from problems of big sectors you just mentioned, we'd need some kind of auto-balance. -- A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month. Filter out and throw away the fruits (can dump them into a cake, etc), let the drink age at least 3-6 months. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Wed, 12 Oct 2016 15:19:16 -0400 Zygo Blaxellwrote: > I'm not even sure btrfs does this--I haven't checked precisely what > it does in dup mode. It could send both copies of metadata to the > disks with a single barrier to separate both metadata updates from > the superblock updates. That would be bad in this particular case. It would be bad in any case, including a single physical disk and no RAID, and I don't think there's any basis to speculate that mdadm doesn't implement write barriers properly. > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there > is an interruption (system crash, a disk times out, etc) in degraded mode, Moreover, in any non-COW system writes temporarily corrupt data. So again, writing to a (degraded or not) mdadm RAID5 is not much different than writing to a single physical disk. However I believe in the Btrfs case metadata is always COW, so this particular problem may be not as relevant here in the first place. -- With respect, Roman pgpM_a8ZbdVne.pgp Description: OpenPGP digital signature
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 01:31:41PM -0400, Zygo Blaxell wrote: > On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote: > > Zygo Blaxellwrote: > > > > > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a > > > snowball's chance in hell of surviving a disk failure on a live array > > > with only data losses. This would work if mdadm and btrfs successfully > > > arrange to have each dup copy of metadata updated separately, and one > > > of the copies survives the raid5 write hole. I've never tested this > > > configuration, and I'd test the heck out of it before considering > > > using it. > > > > Not sure what you mean here, a non-fatal disk failure (i.e. within being > > compensated by redundancy) is invisible to the upper layers on mdadm arrays. > > They do not need to "arrange" anything, on such failure from the point of > > view > > of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's > > still perfectly and correctly readable and writable. > > btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm > forwards those writes to the disks. btrfs sends a barrier to mdadm, > mdadm must properly forward that barrier to all the disks and wait until > they're all done. Repeat the above for the other metadata copy. I'm not even sure btrfs does this--I haven't checked precisely what it does in dup mode. It could send both copies of metadata to the disks with a single barrier to separate both metadata updates from the superblock updates. That would be bad in this particular case. > If that's all implemented correctly in mdadm, all is well; otherwise, > mdadm and btrfs fail to arrange to have each dup copy of metadata > updated separately. To be clearer about the consequences of this: If both copies of metadata are updated at the same time (because btrfs and mdadm failed to get the barriers right), it's possible to have both copies of metadata in an inconsistent (unreadable) state at the same time, ending the filesystem. In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there is an interruption (system crash, a disk times out, etc) in degraded mode, one of the metadata copies will be damaged. The damage may not be limited to the current commit, so we need the second copy of the metadata intact to recover from broken changes to the first copy. Usually metadata chunks are larger than RAID5 stripes, so this works out for btrfs on mdadm RAID5 (maybe not if two metadata chunks are adjacent and not stripe-aligned, but that's a rare case, and one that only affects array sizes that are not a power of 2 + 1 disk for RAID5, or power of 2 + 2 disks for RAID6). > The present state of the disks is irrelevant. The array could go > degraded due to a disk failure at any time, so for practical failure > analysis purposes, only the behavior in degraded mode is relevant. > > > > > -- > > With respect, > > Roman > > signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote: > Zygo Blaxellwrote: > > > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a > > snowball's chance in hell of surviving a disk failure on a live array > > with only data losses. This would work if mdadm and btrfs successfully > > arrange to have each dup copy of metadata updated separately, and one > > of the copies survives the raid5 write hole. I've never tested this > > configuration, and I'd test the heck out of it before considering > > using it. > > Not sure what you mean here, a non-fatal disk failure (i.e. within being > compensated by redundancy) is invisible to the upper layers on mdadm arrays. > They do not need to "arrange" anything, on such failure from the point of view > of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's > still perfectly and correctly readable and writable. btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm forwards those writes to the disks. btrfs sends a barrier to mdadm, mdadm must properly forward that barrier to all the disks and wait until they're all done. Repeat the above for the other metadata copy. If that's all implemented correctly in mdadm, all is well; otherwise, mdadm and btrfs fail to arrange to have each dup copy of metadata updated separately. The present state of the disks is irrelevant. The array could go degraded due to a disk failure at any time, so for practical failure analysis purposes, only the behavior in degraded mode is relevant. > > -- > With respect, > Roman signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: > >btrfs also doesn't avoid the raid5 write hole properly. After a crash, > >a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) > >to reconstruct any parity that was damaged by an incomplete data stripe > >update. > > As long as all disks are working, the parity can be reconstructed > >from the data disks. If a disk fails prior to the completion of the > >scrub, any data stripes that were written during previous crashes may > >be destroyed. And all that assumes the scrub bugs are fixed first. > > This is true. > I didn't take this into account. > > But this is not a *single* problem, but 2 problems. > 1) Power loss > 2) Device crash > > Before making things complex, why not focusing on single problem. Solve one problem at a time--but don't lose sight of the whole list of problems either, especially when they are interdependent. > Not to mention the possibility is much smaller than single problem. Having field experience with both problems, I disagree with that. The power loss/system crash problem is much more common than the device failure/scrub problems. More data is lost when a disk fails, but the amount of data lost in a power failure isn't zero. Before I gave up on btrfs raid5, it worked out to about equal amounts of admin time recovering from the two different failure modes. > >If writes occur after a disk fails, they all temporarily corrupt small > >amounts of data in the filesystem. btrfs cannot tolerate any metadata > >corruption (it relies on redundant metadata to self-repair), so when a > >write to metadata is interrupted, the filesystem is instantly doomed > >(damaged beyond the current tools' ability to repair and mount > >read-write). > > That's why we used higher duplication level for metadata by default. > And considering metadata size, it's much acceptable to use RAID1 for > metadata other than RADI5/6. Data RAID5 metadata RAID1 makes a limited amount of sense. Small amounts of data are still lost on power failures due to RMW on the data stripes. It just doesn't break the entire filesystem because the metadata is on RAID1 and RAID1 doesn't use RMW. Data RAID6 does not make sense, unless we also have a way to have RAID1 make more than one mirror copy. With one mirror copy an array is not able to tolerate two disk failures, so the Q stripe for RAID6 is wasted CPU and space. > >Currently the upper layers of the filesystem assume that once data > >blocks are written to disk, they are stable. This is not true in raid5/6 > >because the parity and data blocks within each stripe cannot be updated > >atomically. > > True, but if we ignore parity, we'd find that, RAID5 is just RAID0. Degraded RAID5 is not RAID0. RAID5 has strict constraints that RAID0 does not. The way a RAID5 implementation behaves in degraded mode is the thing that usually matters after a disk fails. > COW ensures (cowed) data and metadata are all safe and checksum will ensure > they are OK, so even for RAID0, it's not a problem for case like power loss. This is not true. btrfs does not use stripes correctly to get CoW to work on RAID5/6. This is why power failures result in small amounts of data loss, if not filesystem-destroying disaster. For CoW to work you have to make sure that you never modify a RAID stripe that already contains committed data. Let's consider a 5-disk array and look at what we get when we try to reconstruct disk 2: Disk1 Disk2 Disk3 Disk4 Disk5 Data1 Data2 Parity Data3 Data4 Suppose one transaction writes Data1-Data4 and Parity. This is OK because no metadata reference would point to this stripe before it was committed to disk. Here's some data as an example: Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 (to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^ Data5 here) Later, a transaction deletes Data3 and Data 4. Still OK, because we didn't modify any data in the stripe, so we may still be able to reconstruct the data from missing disks. The checksums for Data4 and Data5 are missing, so if there is any bitrot we lose the whole stripe (we can't tell whether the data is wrong or parity, we can't ignore the rotted data because it's included in the parity, and we didn't update the parity because deleting an extent doesn't modify its data stripe). Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 Now a third transaction allocates Data3 and Data 4. Bad. First, Disk4 is written and existing data is temporarily corrupted: Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 1234 7452 then Disk5 is written, and the data is still corrupted: Disk1 Disk2 Disk3 Disk4 Disk5 Reconstructed Disk2 1234 5678 aaa2 then parity is written, and the
Re: RAID system with adaption to changed number of disks
On Tue, 11 Oct 2016 17:58:22 -0600 Chris Murphywrote: > But consider the identical scenario with md or LVM raid5, or any > conventional hardware raid5. A scrub check simply reports a mismatch. > It's unknown whether data or parity is bad, so the bad data strip is > propagated upward to user space without error. On a scrub repair, the > data strip is assumed to be good, and good parity is overwritten with > bad. That's why I love to use Btrfs on top of mdadm RAID5/6 -- combining a mature and stable RAID implementation with Btrfs anti-corruption checksumming "watchdog". In the case that you described, no silent corruption will occur, as Btrfs will report an uncorrectable read error -- and I can just restore the file in question from backups. On Wed, 12 Oct 2016 00:37:19 -0400 Zygo Blaxell wrote: > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a > snowball's chance in hell of surviving a disk failure on a live array > with only data losses. This would work if mdadm and btrfs successfully > arrange to have each dup copy of metadata updated separately, and one > of the copies survives the raid5 write hole. I've never tested this > configuration, and I'd test the heck out of it before considering > using it. Not sure what you mean here, a non-fatal disk failure (i.e. within being compensated by redundancy) is invisible to the upper layers on mdadm arrays. They do not need to "arrange" anything, on such failure from the point of view of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's still perfectly and correctly readable and writable. -- With respect, Roman pgpCiQALhZ93Z.pgp Description: OpenPGP digital signature
Re: RAID system with adaption to changed number of disks
Missing device is the _only_ thing the current design handles. Right. below patches in the ML added two more device states offline and failed. It is tested with raid1. [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed [PATCH 12/13] btrfs: check device for critical errors and mark failed Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
At 10/12/2016 12:37 PM, Zygo Blaxell wrote: On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote: But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. Totally true. Original RAID5/6 design is only to handle missing device, not rotted bits. Missing device is the _only_ thing the current design handles. i.e. you umount the filesystem cleanly, remove a disk, and mount it again degraded, and then the only thing you can safely do with the filesystem is delete or replace a device. There is also a probability of being able to repair bitrot under some circumstances. If your disk failure looks any different from this, btrfs can't handle it. If a disk fails while the array is running and the filesystem is writing, the filesystem is likely to be severely damaged, possibly unrecoverably. A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a snowball's chance in hell of surviving a disk failure on a live array with only data losses. This would work if mdadm and btrfs successfully arrange to have each dup copy of metadata updated separately, and one of the copies survives the raid5 write hole. I've never tested this configuration, and I'd test the heck out of it before considering using it. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and data checksum. [snip] This leads directly to a variety of problems with the diagnostic tools, e.g. scrub reports errors randomly across devices, and cannot report the path of files containing corrupted blocks if it's the parity block that gets corrupted. At least better than screwing up good stripes. The tool is just used to let user know if there is any corrupted stripes like kernel scrub, but with better behavior, like won't reconstruct stripes ignoring checksum. For human readable report, it's not that hard (compared the the complex csum and parity check) to implement and can be added later. For parity report, there is no way to output any human readable result anyway. btrfs also doesn't avoid the raid5 write hole properly. After a crash, a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) to reconstruct any parity that was damaged by an incomplete data stripe update. As long as all disks are working, the parity can be reconstructed from the data disks. If a disk fails prior to the completion of the scrub, any data stripes that were written during previous crashes may be destroyed. And all that assumes the scrub bugs are fixed first. This is true. I didn't take this into account. But this is not a *single* problem, but 2 problems. 1) Power loss 2) Device crash Before making things complex, why not focusing on single problem. Not to mention the possibility is much smaller than single problem. If writes occur after a disk fails, they all temporarily corrupt small amounts of data in the filesystem. btrfs cannot tolerate any metadata corruption (it relies on redundant metadata to self-repair), so when a write to metadata is interrupted, the filesystem is instantly doomed (damaged beyond the current tools' ability to repair and mount read-write). That's why we used higher duplication level for metadata by default. And considering metadata size, it's much acceptable to use RAID1 for metadata other than RADI5/6. Currently the upper layers of the filesystem assume that once data blocks are written to disk, they are stable. This is not true in raid5/6 because the parity and data blocks within each stripe cannot be updated atomically. True, but if we ignore parity, we'd find that, RAID5 is just RAID0. COW ensures (cowed) data and metadata are all safe and checksum will ensure they are OK, so even for RAID0, it's not a problem for case like power loss. So we should follow csum first and then parity. If we following this principle, RAID5 should be a raid0 with a little higher possibility to recover some cases, like missing one device. So, I'd like to fix RAID5 scrub to make it at least better than RAID0, not worse than RAID0. btrfs doesn't avoid writing new data in the same RAID stripe as old data (it provides a rmw function for raid56, which is simply a bug in a CoW filesystem), so previously committed data can be
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote: > >But consider the identical scenario with md or LVM raid5, or any > >conventional hardware raid5. A scrub check simply reports a mismatch. > >It's unknown whether data or parity is bad, so the bad data strip is > >propagated upward to user space without error. On a scrub repair, the > >data strip is assumed to be good, and good parity is overwritten with > >bad. > > Totally true. > > Original RAID5/6 design is only to handle missing device, not rotted bits. Missing device is the _only_ thing the current design handles. i.e. you umount the filesystem cleanly, remove a disk, and mount it again degraded, and then the only thing you can safely do with the filesystem is delete or replace a device. There is also a probability of being able to repair bitrot under some circumstances. If your disk failure looks any different from this, btrfs can't handle it. If a disk fails while the array is running and the filesystem is writing, the filesystem is likely to be severely damaged, possibly unrecoverably. A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a snowball's chance in hell of surviving a disk failure on a live array with only data losses. This would work if mdadm and btrfs successfully arrange to have each dup copy of metadata updated separately, and one of the copies survives the raid5 write hole. I've never tested this configuration, and I'd test the heck out of it before considering using it. > >So while I agree in total that Btrfs raid56 isn't mature or tested > >enough to consider it production ready, I think that's because of the > >UNKNOWN causes for problems we've seen with raid56. Not the parity > >scrub bug which - yeah NOT good, not least of which is the data > >integrity guarantees Btrfs is purported to make are substantially > >negated by this bug. I think the bark is worse than the bite. It is > >not the bark we'd like Btrfs to have though, for sure. > > > > Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and > data checksum. [snip] This leads directly to a variety of problems with the diagnostic tools, e.g. scrub reports errors randomly across devices, and cannot report the path of files containing corrupted blocks if it's the parity block that gets corrupted. btrfs also doesn't avoid the raid5 write hole properly. After a crash, a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) to reconstruct any parity that was damaged by an incomplete data stripe update. As long as all disks are working, the parity can be reconstructed from the data disks. If a disk fails prior to the completion of the scrub, any data stripes that were written during previous crashes may be destroyed. And all that assumes the scrub bugs are fixed first. If writes occur after a disk fails, they all temporarily corrupt small amounts of data in the filesystem. btrfs cannot tolerate any metadata corruption (it relies on redundant metadata to self-repair), so when a write to metadata is interrupted, the filesystem is instantly doomed (damaged beyond the current tools' ability to repair and mount read-write). Currently the upper layers of the filesystem assume that once data blocks are written to disk, they are stable. This is not true in raid5/6 because the parity and data blocks within each stripe cannot be updated atomically. btrfs doesn't avoid writing new data in the same RAID stripe as old data (it provides a rmw function for raid56, which is simply a bug in a CoW filesystem), so previously committed data can be lost. If the previously committed data is part of the metadata tree, the filesystem is doomed; for ordinary data blocks there are just a few dozen to a few thousand corrupted files for the admin to clean up after each crash. It might be possible to hack up the allocator to pack writes into empty stripes to avoid the write hole, but every time I think about this it looks insanely hard to do (or insanely wasteful of space) for data stripes. signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
Ignoring the RAID56 bugs for a moment, if you have mismatched drives, BtrFS RAID1 is a pretty good way of utilising available space and having redundancy. My home array is BtrFS with a hobbled together collection of disks ranging from 500GB to 3TB (and 5 of them, so it's not an even number). I have a grand total of 8TB of linear space, and with BtrFS RAID1 I can use exactly 50% of this (4TB) even with the weird combination of disks. That's something other RAID1 implementations can't do (they're limited to the size of the smallest disk of any pair, and need an even number of disks all up), and I get free compression and snapshotting, so yay for that. As drives die of natural old age, I replace them ad-hoc with bigger drives (whatever is the sane price-point at the time). A replace followed by a rebalance later, and I'm back to using all available space (which grows every time I throw a bigger drive in the mix), which again is incredibly handy when you're a home user looking for sane long-term storage that doesn't require complete rebuilds of your array. -Dan Dan Mons - VFX Sysadmin Cutting Edge http://cuttingedge.com.au On 12 October 2016 at 01:14, Philip Louis Moetteliwrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
At 10/12/2016 07:58 AM, Chris Murphy wrote: https://btrfs.wiki.kernel.org/index.php/Status Scrub + RAID56 Unstable will verify but not repair This doesn't seem quite accurate. It does repair the vast majority of the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad data strip results in a.) fixed up data strip from parity b.) wrong recomputation of replacement parity c.) good parity is overwritten with bad, silently, d.) if parity reconstruction is needed in the future e.g. device or sector failure, it results in EIO, a kind of data loss. Bad bug. For sure. But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. Totally true. Original RAID5/6 design is only to handle missing device, not rotted bits. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and data checksum. In ideal situation, btrfs should detect which stripe is corrupted, and only try to recover data/parity if recovered data checksum matches. For example, for a very traditional RAID5 layout like the following: Disk 1| Disk 2| Disk 3 | - Data 1| Data 2| Parity | Scrub should check data stripe 1 and 2, against their checksum first [All data extents has csum] 1) All csum matches Good, then check parity. 1.1) Parity matches Nothing wrong at all 1.1) Parity mismatch Just recalculate parity. Corruption may happen in unused data space or in parity. Either way recalculate parity is good enough. 2) One data stripe csum mismatches(missing), parity mismatches too We only know one data stripe mismatch, not sure if parity is OK. Try to recover that data stripe from parity, and recheck csum. 2.1) Recovered data stripe matches csum That data stripe is corrupted and parity is OK Recoverable. 2.2) Recovered data stripe mismatch csum Both that data stripe and parity is corrupted. 3) Two data stripes csum mismatch, no matter parity matches or not At least 2 stripes are screwed up. no fix anyway. [Some data extents has no csum(nodatasum)] 4) Existing(or no csum at all) csum matches, parity matches Good, nothing to worry about 5) Exist csum mismatch for one data stripe, parity mismatch Like 2), try to recover that data stripe, and re-check csum. 5.1) recovered data stripes matches csum At least we can recover the data covered by csum. Corrupted no-csum data is not our concern. 5.2) recovered data stripes mismatches csum Screwed up 6) No csum at all, parity mismatch We are screwed up, just like traditional RAID5. And I'm coding for the above cases in btrfs-progs to implement an off-line scrub tool. Currently it looks good, and can already handle case from 1) to 3). And I tend to ignore any full stripe who lacks checksum and parity mismatches. But as you can see, there are so many things(csum exists,matches pairty matches, missing devices) involved in btrfs RAID5(RAID6 will be more complex), it's already much complex than traditional RAID5/6 or current scrub implementation. So what current kernel scub lacks is: 1) Detection of good/bad stripes 2) Recheck of recovery attempts But that's all traditional RAID5/6 lacks unless there is some hidden checksum like btrfs they can use. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
https://btrfs.wiki.kernel.org/index.php/Status Scrub + RAID56 Unstable will verify but not repair This doesn't seem quite accurate. It does repair the vast majority of the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad data strip results in a.) fixed up data strip from parity b.) wrong recomputation of replacement parity c.) good parity is overwritten with bad, silently, d.) if parity reconstruction is needed in the future e.g. device or sector failure, it results in EIO, a kind of data loss. Bad bug. For sure. But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Tue, Oct 11, 2016 at 8:14 AM, Philip Louis Moetteliwrote: > > Hello, > > > I have to build a RAID 6 with the following 3 requirements: You should under no circumstances use RAID5/6 for anything other than test and throw-away data. It has several known issues that will eat your data. Total data loss is a real possibility. (the capability to even create raid5/6 filesystems should imho be removed from btrfs until this changes.) > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
I think you just described all the benefits of btrfs in that type of configuration unfortunately after btrfs RAID 5 & 6 was marked as OK it got marked as "it will eat your data" (and there is a tone of people in random places poping up with raid 5 & 6 that just killed their data) On 11 October 2016 at 16:14, Philip Louis Moetteliwrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Tue, Oct 11, 2016 at 03:14:30PM +, Philip Louis Moetteli wrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? 1) Take a look at http://carfax.org.uk/btrfs-usage/ which will tell you how much space you can get out of a btrfs array with different sized devices. 2) Btrfs's parity RAID implementation is not in good shape right now. It has known data corruption issues, and should not be used in production. 3) The redistribution of space is something that btrfs can do. It needs to be triggered manually at the moment, but it definitely works. Hugo. -- Hugo Mills | We are all lying in the gutter, but some of us are hugo@... carfax.org.uk | looking at the stars. http://carfax.org.uk/ | PGP: E2AB1DE4 | Oscar Wilde signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On 2016-10-11 11:14, Philip Louis Moetteli wrote: Hello, I have to build a RAID 6 with the following 3 requirements: • Use different kinds of disks with different sizes. • When a disk fails and there's enough space, the RAID should be able to reconstruct itself out of the degraded state. Meaning, if I have e. g. a RAID with 8 disks and 1 fails, I should be able to chose to transform this in a non-degraded (!) RAID with 7 disks. • Also the other way round: If I add a disk of what size ever, it should redistribute the data, so that it becomes a RAID with 9 disks. I don’t care, if I have to do it manually. I don’t care so much about speed either. Is BTrFS capable of doing that? In theory yes. In practice, BTRFS RAID5/6 mode should not be used in production due to a number of known serious issues relating to rebuilding and reshaping arrays. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID system with adaption to changed number of disks
Hello, I have to build a RAID 6 with the following 3 requirements: • Use different kinds of disks with different sizes. • When a disk fails and there's enough space, the RAID should be able to reconstruct itself out of the degraded state. Meaning, if I have e. g. a RAID with 8 disks and 1 fails, I should be able to chose to transform this in a non-degraded (!) RAID with 7 disks. • Also the other way round: If I add a disk of what size ever, it should redistribute the data, so that it becomes a RAID with 9 disks. I don’t care, if I have to do it manually. I don’t care so much about speed either. Is BTrFS capable of doing that? Thanks a lot for your help!