Re: Exactly what is wrong with RAID5/6
On 2017-06-23 13:25, Michał Sokołowski wrote: Hello group. I am confused: Can somebody please confirm/deny, which RAID subsystem is affected? BTRFS' RAID5/6 or mdadm (Linux kernel raid) RAID 5/6 ? All of the issues mentioned here are specific to BTRFS raid5/raid6 profiles, with the exception of the write-hole, which inherently affects any raid5/raid6 system that does not specifically account for it (which means that it does affect MD RAID5 and RAID6 modes if you aren't using the journaling). Are there some gotchas (in terms of broken reliability) when using kernel one? The web is full of legends, it seems that this confusion is quite common... Which brings up one of the reasons I really hate the choice to use the term 'raid' in the profile names. At a minimum, we should have gone a similar route to ZFS in naming the striped parity implementations (RAID-B1 and RAID-B2 for example), but personally I really would have preferred if they were just called what they are (namely, (n,n+1) and (n,n+2) erasure coding for raid5 and raid6 respectively, with mirroring, striping and striped mirroring for raid1, raid0, and raid10), or at least used some naming scheme that wasn't obviously going to cause such issues. On 06/21/2017 12:57 AM, waxhead wrote: I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. The wiki refer to kernel 3.19 which was released in February 2015 so I assume that the information there is a tad outdated (the last update on the wiki page was July 2016) https://btrfs.wiki.kernel.org/index.php/RAID56 Now there are four problems listed 1. Parity may be inconsistent after a crash (the "write hole") Is this still true, if yes - would not this apply for RAID1 / RAID10 as well? How was it solved there , and why can't that be done for RAID5/6 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? 3. No support for discard? (possibly -- needs confirmation with cmason) Does this matter that much really?, is there an update on this? 4. The algorithm uses as many devices as are available: No support for a fixed-width stripe. What is the plan for this one? There was patches on the mailing list by the SnapRAID author to support up to 6 parity devices. Will the (re?) resign of btrfs raid5/6 support a scheme that allows for multiple parity devices? I do have a few other questions as well... 5. BTRFS does still (kernel 4.9) not seem to use the device ID to communicate with devices. If you on a multi device filesystem yank out a device, for example /dev/sdg and it reappear as /dev/sdx for example btrfs will still happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the correct device ID. What is the status for getting BTRFS to properly understand that a device is missing? 6. RAID1 needs to be able to make two copies always. E.g. if you have three disks you can loose one and it should still work. What about RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and reboots (due to #5 above). Will RAID10 recognize that the array now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) instead of 3? In other words, will it work as long as it can create a RAID10 profile that requires a minimum of four disks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
Hello group. I am confused: Can somebody please confirm/deny, which RAID subsystem is affected? BTRFS' RAID5/6 or mdadm (Linux kernel raid) RAID 5/6 ? Are there some gotchas (in terms of broken reliability) when using kernel one? The web is full of legends, it seems that this confusion is quite common... On 06/21/2017 12:57 AM, waxhead wrote: > I am trying to piece together the actual status of the RAID5/6 bit of > BTRFS. > The wiki refer to kernel 3.19 which was released in February 2015 so I > assume that the information there is a tad outdated (the last update > on the wiki page was July 2016) > https://btrfs.wiki.kernel.org/index.php/RAID56 > > Now there are four problems listed > > 1. Parity may be inconsistent after a crash (the "write hole") > Is this still true, if yes - would not this apply for RAID1 / RAID10 > as well? How was it solved there , and why can't that be done for RAID5/6 > > 2. Parity data is not checksummed > Why is this a problem? Does it have to do with the design of BTRFS > somehow? > Parity is after all just data, BTRFS does checksum data so what is the > reason this is a problem? > > 3. No support for discard? (possibly -- needs confirmation with cmason) > Does this matter that much really?, is there an update on this? > > 4. The algorithm uses as many devices as are available: No support for > a fixed-width stripe. > What is the plan for this one? There was patches on the mailing list > by the SnapRAID author to support up to 6 parity devices. Will the > (re?) resign of btrfs raid5/6 support a scheme that allows for > multiple parity devices? > > I do have a few other questions as well... > > 5. BTRFS does still (kernel 4.9) not seem to use the device ID to > communicate with devices. > > If you on a multi device filesystem yank out a device, for example > /dev/sdg and it reappear as /dev/sdx for example btrfs will still > happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the > correct device ID. What is the status for getting BTRFS to properly > understand that a device is missing? > > 6. RAID1 needs to be able to make two copies always. E.g. if you have > three disks you can loose one and it should still work. What about > RAID10 ? If you have for example 6 disk RAID10 array, loose one disk > and reboots (due to #5 above). Will RAID10 recognize that the array > now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 > disks?) instead of 3? In other words, will it work as long as it can > create a RAID10 profile that requires a minimum of four disks? smime.p7s Description: S/MIME Cryptographic Signature
Re: Exactly what is wrong with RAID5/6
On 2017-06-22 04:12, Qu Wenruo wrote: > > And in that case even device of data stripe 2 is missing, btrfs don't really > need to use parity to rebuild it, as btrfs knows there is no extent in that > stripe, and data csum matches for data stripe 1. You are assuming that there is no data in disk2. This is likely, due to COW nature of BTRFS. But it is not always true. Anyway, the same problem happens if you are writing data on disk2 . If a) data (disk2) is written b) parity is not updated (due to a power failure) until that you don't lose anything, but if c) disk1 disappear you are not in position to recompute valid data in disk1 using only data2 and parity > No need to use parity at all. > > So that's why I think the hole write is not an urgent case to handle right > now. > > Thanks, > Qu -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
At 06/22/2017 10:43 AM, Chris Murphy wrote: On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruowrote: Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a high chance that we won't cause any data damage. But we have examples where data does not COW, we see a partial stripe overwrite. And if that is interrupted it's clear that both old and new metadata pointing to that stripe is wrong. There are way more problems where we see csum errors on Btrfs raid56 after crashes, and there are no bad devices. First, if it's interrupted, there is no new metadata, as metadata is always updated after data. And metadata is always update CoW, so if data write is interrupted, we are still at previous trans. And in that case, no COW means no csum. Btrfs won't check the correctness due to the lack of csum. So the case will be that, for nodatacow case, btrfs won't detect the corruption, users take the responsibility to keep their data correct. For the example I gave above, no data damage at all. First the data is written and power loss, and data is always written before metadata, so that's to say, after power loss, superblock is still using the old tree roots. So no one is really using that newly written data. OK but that assumes that the newly written data is always COW which on Btrfs raid56 is not certain, there's a bunch of RMW code which suggests overwrites are possible. RMW is mainly to update P/Q, as even we only update data stripe1, we still need data stripe 2 to calculate P/Q. And for raid56 metadata it suggests RMW could happen for metadata also. As long as we have P/Q, RMW must be used. The root problem will be, we need cross-device FUA to ensure full stripe is written correctly. Or we may take the extent allocator modification, to ensure we only write into vertical stripe without used data. So anyway, RAID5/6 is only designed to handle missing devices, not power loss. IIRC mdadm RAID5/6 array needs to be scrubbed each time power loss is detected. Thanks, Qu There's fairly strong anecdotal evidence that people have less problems with Btrfs raid5 when raid5 applies to data block groups, and metadata block groups use some other non-parity based profile like raid1. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruowrote: > > Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a > high chance that we won't cause any data damage. But we have examples where data does not COW, we see a partial stripe overwrite. And if that is interrupted it's clear that both old and new metadata pointing to that stripe is wrong. There are way more problems where we see csum errors on Btrfs raid56 after crashes, and there are no bad devices. > > For the example I gave above, no data damage at all. > > First the data is written and power loss, and data is always written before > metadata, so that's to say, after power loss, superblock is still using the > old tree roots. > > So no one is really using that newly written data. OK but that assumes that the newly written data is always COW which on Btrfs raid56 is not certain, there's a bunch of RMW code which suggests overwrites are possible. And for raid56 metadata it suggests RMW could happen for metadata also. There's fairly strong anecdotal evidence that people have less problems with Btrfs raid5 when raid5 applies to data block groups, and metadata block groups use some other non-parity based profile like raid1. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
At 06/22/2017 02:24 AM, Chris Murphy wrote: On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruowrote: Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior, which is made up by N data stripes and correct P/Q. Given one example to show how write sequence affects the usability of RAID5/6. Existing full stripe: X = Used space (Extent allocated) O = Unused space Data 1 |XX|| Data 2 |OOO| Parity |WW|| When some new extent is allocated to data 1 stripe, if we write data directly into that region, and crashed. The result will be: Data 1 |XX|XX|O| Data 2 |OOO| Parity |WW|| Parity stripe is not updated, although it's fine since data is still correct, this reduces the usability, as in this case, if we lost device containing data 2 stripe, we can't recover correct data of data 2. Although personally I don't think it's a big problem yet. Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy. If there is parity corruption and there is a lost device (or bad sector causing lost data strip), that is in effect two failures and no raid5 recovers, you have to have raid6. However, I don't know whether Btrfs raid6 can even recover from it? If there is a single device failure, with a missing data strip, you have both P Typically raid6 implementations use P first, and only use Q if P is not available. Is Btrfs raid6 the same? And if reconstruction from P fails to match data csum, does Btrfs retry using Q? Probably not is my guess. Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a high chance that we won't cause any data damage. For the example I gave above, no data damage at all. First the data is written and power loss, and data is always written before metadata, so that's to say, after power loss, superblock is still using the old tree roots. So no one is really using that newly written data. And in that case even device of data stripe 2 is missing, btrfs don't really need to use parity to rebuild it, as btrfs knows there is no extent in that stripe, and data csum matches for data stripe 1. No need to use parity at all. So that's why I think the hole write is not an urgent case to handle right now. Thanks, Qu I think that is a valid problem calling for a solution on Btrfs, given its mandate. It is no worse than other raid6 implementations though which would reconstruct from bad P, and give no warning, leaving it up to application layers to deal with the problem. I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario. 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? Because that's one solution to solve above problem. And no, parity is not data. Parity strip is differentiated from data strip, and by itself parity is meaningless. But parity plus n-1 data strips is an encoded form of the missing data strip, and is therefore an encoded copy of the data. We kinda have to treat the parity as fractionally important compared to data; just like each mirror copy has some fractional value. You don't have to have both of them, but you do have to have at least one of them. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
At 06/22/2017 01:03 AM, Goffredo Baroncelli wrote: Hi Qu, On 2017-06-21 10:45, Qu Wenruo wrote: At 06/21/2017 06:57 AM, waxhead wrote: I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. The wiki refer to kernel 3.19 which was released in February 2015 so I assume that the information there is a tad outdated (the last update on the wiki page was July 2016) https://btrfs.wiki.kernel.org/index.php/RAID56 Now there are four problems listed 1. Parity may be inconsistent after a crash (the "write hole") Is this still true, if yes - would not this apply for RAID1 / RAID10 as well? How was it solved there , and why can't that be done for RAID5/6 Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior, which is made up by N data stripes and correct P/Q. Given one example to show how write sequence affects the usability of RAID5/6. Existing full stripe: X = Used space (Extent allocated) O = Unused space Data 1 |XX|| Data 2 |OOO| Parity |WW|| When some new extent is allocated to data 1 stripe, if we write data directly into that region, and crashed. The result will be: Data 1 |XX|XX|O| Data 2 |OOO| Parity |WW|| Parity stripe is not updated, although it's fine since data is still correct, this reduces the usability, as in this case, if we lost device containing data 2 stripe, we can't recover correct data of data 2. Although personally I don't think it's a big problem yet. Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy. 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? Because that's one solution to solve above problem. In what it could be a solution for the write hole ? Not my idea, so I don't why this is a solution either. I prefer to lower the priority for such case as we have more work to do. Thanks, Qu If a parity is wrong AND you lost a disk, even having a checksum of the parity, you are not in position to rebuild the missing data. And if you rebuild wrong data, anyway the checksum highlights it. So adding the checksum to the parity should not solve any issue. A possible "mitigation", is to track in a "intent log" all the not "full stripe writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process is started to correct the parities only in the stripes tracked before. A solution, is to journal all the not "full stripe writes", as MD does. BR G.Baroncelli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, Jun 21, 2017 at 2:12 PM, Goffredo Baroncelliwrote: > > Generally speaking, when you write "two failure" this means two failure at > the same time. But the write hole happens even if these two failures are not > at the same time: > > Event #1: power failure between the data stripe write and the parity stripe > write. The stripe is incoherent. > Event #2: a disk is failing: if you try to read the data from the remaining > data and the parity you have wrong data. > > The likelihood of these two event at the same time (power failure and in the > next boot a disk is failing) is quite low. But in the life of a filesystem, > these two event likely happens. > > However BTRFS has an advantage: a simple scrub may (crossing finger) recover > from event #1. Event #3: the stripe is read, missing a data strip due to event #2, and is wrongly reconstructed due to event #1, Btrfs computes crc32c on the reconstructed data and compares to extent csum, which then fails and EIO happens. Btrfs is susceptible to the write hole happening on disk. But it's still detected and corrupt data isn't propagated upward. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On 2017-06-21 20:24, Chris Murphy wrote: > On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruowrote: > >> Unlike pure stripe method, one fully functional RAID5/6 should be written in >> full stripe behavior, which is made up by N data stripes and correct P/Q. >> >> Given one example to show how write sequence affects the usability of >> RAID5/6. >> >> Existing full stripe: >> X = Used space (Extent allocated) >> O = Unused space >> Data 1 |XX|| >> Data 2 |OOO| >> Parity |WW|| >> >> When some new extent is allocated to data 1 stripe, if we write >> data directly into that region, and crashed. >> The result will be: >> >> Data 1 |XX|XX|O| >> Data 2 |OOO| >> Parity |WW|| >> >> Parity stripe is not updated, although it's fine since data is still >> correct, this reduces the usability, as in this case, if we lost device >> containing data 2 stripe, we can't recover correct data of data 2. >> >> Although personally I don't think it's a big problem yet. >> >> Someone has idea to modify extent allocator to handle it, but anyway I don't >> consider it's worthy. > > > If there is parity corruption and there is a lost device (or bad > sector causing lost data strip), that is in effect two failures and no > raid5 recovers, you have to have raid6. Generally speaking, when you write "two failure" this means two failure at the same time. But the write hole happens even if these two failures are not at the same time: Event #1: power failure between the data stripe write and the parity stripe write. The stripe is incoherent. Event #2: a disk is failing: if you try to read the data from the remaining data and the parity you have wrong data. The likelihood of these two event at the same time (power failure and in the next boot a disk is failing) is quite low. But in the life of a filesystem, these two event likely happens. However BTRFS has an advantage: a simple scrub may (crossing finger) recover from event #1. > However, I don't know whether > Btrfs raid6 can even recover from it? If there is a single device > failure, with a missing data strip, you have both P Typically raid6 > implementations use P first, and only use Q if P is not available. Is > Btrfs raid6 the same? And if reconstruction from P fails to match data > csum, does Btrfs retry using Q? Probably not is my guess. It could, and in any case it is only an "implementation detail" :-) > > I think that is a valid problem calling for a solution on Btrfs, given > its mandate. It is no worse than other raid6 implementations though > which would reconstruct from bad P, and give no warning, leaving it up > to application layers to deal with the problem. > > I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario. If I understood correctly, ZFS has a variable stripe size. In BTRFS could be easily implemented: it would be sufficient to have different block group with different number of disk. If a filesystem is composed by 5 disks, it will contain: 1 BG RAID1 for writing up-to 64k 1 BG RAID5 (3 disks) for writing up-to 128k 1 BG RAID5 (4 disks) for writing up-to 192k 1 BG RAID5 (5 disks) for all other disks Time to time the filesystem would need a re-balance in order to empty the smaller block group. Another option could be to track the stripes involved by a RWM cycle (i.e. all the writings smaller than a stripe size, which in a COW filesystem, are suppose to be few) in an "intent log", and scrubbing all these stripes if a power failure happens . > > > >> >>> >>> 2. Parity data is not checksummed >>> Why is this a problem? Does it have to do with the design of BTRFS >>> somehow? >>> Parity is after all just data, BTRFS does checksum data so what is the >>> reason this is a problem? >> >> >> Because that's one solution to solve above problem. >> >> And no, parity is not data. > > Parity strip is differentiated from data strip, and by itself parity > is meaningless. But parity plus n-1 data strips is an encoded form of > the missing data strip, and is therefore an encoded copy of the data. > We kinda have to treat the parity as fractionally important compared > to data; just like each mirror copy has some fractional value. You > don't have to have both of them, but you do have to have at least one > of them. > > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, Jun 21, 2017 at 12:51 AM, Marat Khaliliwrote: > On 21/06/17 06:48, Chris Murphy wrote: >> >> Another possibility is to ensure a new write is written to a new*not* >> full stripe, i.e. dynamic stripe size. So if the modification is a 50K >> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K >> parity strip (a full stripe write); write out 1 64K data strip + 1 64K >> parity strip. In effect, a 4 disk raid5 would quickly get not just 3 >> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 >> data + 1 parity chunks, and direct those write to the proper chunk >> based on size. Anyway that's beyond my ability to assess how much >> allocator work that is. Balance I'd expect to rewrite everything to >> max data strips possible; the optimization would only apply to normal >> operation COW.. > This will make some filesystems mostly RAID1, negating all space savings of > RAID5, won't it? No. It'd only apply to partial stripe writes, typically small files. But small file, metadata centric workloads suck for raid5 anyway, and should use raid1. So making the implementation more like raid1 than raid5 for the RMW case I think is still better than Btrfs raid56 RMW writes in effect being no-COW. > Isn't it easier to recalculate parity block based using previous state of > two rewritten strips, parity and data? I don't understand all performance > implications, but it might scale better with number of devices. The problem is atomicity. Either the data strip or parity strip is overwritten first, and before the other is committed, the file system is not merely inconsistent, it's basically lying, there's no way to know for sure after the fact whether the data or parity were properly written. And even the metadata is inconsistent too because it can only describe the unmodified state and the successfully modified state, whereas a 3rd state "partially modified" is possible and no way to really fix it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruowrote: > Unlike pure stripe method, one fully functional RAID5/6 should be written in > full stripe behavior, which is made up by N data stripes and correct P/Q. > > Given one example to show how write sequence affects the usability of > RAID5/6. > > Existing full stripe: > X = Used space (Extent allocated) > O = Unused space > Data 1 |XX|| > Data 2 |OOO| > Parity |WW|| > > When some new extent is allocated to data 1 stripe, if we write > data directly into that region, and crashed. > The result will be: > > Data 1 |XX|XX|O| > Data 2 |OOO| > Parity |WW|| > > Parity stripe is not updated, although it's fine since data is still > correct, this reduces the usability, as in this case, if we lost device > containing data 2 stripe, we can't recover correct data of data 2. > > Although personally I don't think it's a big problem yet. > > Someone has idea to modify extent allocator to handle it, but anyway I don't > consider it's worthy. If there is parity corruption and there is a lost device (or bad sector causing lost data strip), that is in effect two failures and no raid5 recovers, you have to have raid6. However, I don't know whether Btrfs raid6 can even recover from it? If there is a single device failure, with a missing data strip, you have both P Typically raid6 implementations use P first, and only use Q if P is not available. Is Btrfs raid6 the same? And if reconstruction from P fails to match data csum, does Btrfs retry using Q? Probably not is my guess. I think that is a valid problem calling for a solution on Btrfs, given its mandate. It is no worse than other raid6 implementations though which would reconstruct from bad P, and give no warning, leaving it up to application layers to deal with the problem. I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario. > >> >> 2. Parity data is not checksummed >> Why is this a problem? Does it have to do with the design of BTRFS >> somehow? >> Parity is after all just data, BTRFS does checksum data so what is the >> reason this is a problem? > > > Because that's one solution to solve above problem. > > And no, parity is not data. Parity strip is differentiated from data strip, and by itself parity is meaningless. But parity plus n-1 data strips is an encoded form of the missing data strip, and is therefore an encoded copy of the data. We kinda have to treat the parity as fractionally important compared to data; just like each mirror copy has some fractional value. You don't have to have both of them, but you do have to have at least one of them. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On 2017-06-21 13:20, Andrei Borzenkov wrote: 21.06.2017 16:41, Austin S. Hemmelgarn пишет: On 2017-06-21 08:43, Christoph Anton Mitterer wrote: On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote: Btrfs is always using device ID to build up its device mapping. And for any multi-device implementation (LVM,mdadam) it's never a good idea to use device path. Isn't it rather the other way round? Using the ID is bad? Don't you remember our discussion about using leaked UUIDs (or accidental collisions) for all kinds of attacks? Both are bad for different reasons. For the particular case of sanely handling transient storage failures (device disappears then reappears), you can't do it with a path in /dev (which is what most people mean when they say device path), and depending on how the hardware failed and the specifics of the firmware, you may not be able to do it with a hardware-level device path, but you can do it with a device ID assuming you sanely verify the ID. Right now, BTRFS is not sanely checking the ID (it only verifies the UUID's in the FS itself, it should also be checking hardware-level identifiers like WWN). Which is not enough too; if device dropped off array and reappeared later we need to be able to declare it stale, even if it has exactly the same UUID and WWN and whatever hardware identifier is used. So we need some generation number to be able to do it. Incidentally MD does have them and compares generation numbers to decide whether device can be assimilated. I was not disputing that aspect, just the method verifying the device that reappeared is the same one that disappeared. Outside of the requirement to properly re-sync (we would also need to do some kind of sanity check on the generation number too, otherwise we end up with the possibility of a partial write there nuking the whole FS when the device reconnects), verifying some level of hardware identification covers the security and data safety issues that Christoph is referring to sufficiently for the common cases (with the biggest being USB attached devices with BTRFS volumes on them). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
21.06.2017 16:41, Austin S. Hemmelgarn пишет: > On 2017-06-21 08:43, Christoph Anton Mitterer wrote: >> On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote: >>> Btrfs is always using device ID to build up its device mapping. >>> And for any multi-device implementation (LVM,mdadam) it's never a >>> good >>> idea to use device path. >> >> Isn't it rather the other way round? Using the ID is bad? Don't you >> remember our discussion about using leaked UUIDs (or accidental >> collisions) for all kinds of attacks? > Both are bad for different reasons. For the particular case of sanely > handling transient storage failures (device disappears then reappears), > you can't do it with a path in /dev (which is what most people mean when > they say device path), and depending on how the hardware failed and the > specifics of the firmware, you may not be able to do it with a > hardware-level device path, but you can do it with a device ID assuming > you sanely verify the ID. Right now, BTRFS is not sanely checking the > ID (it only verifies the UUID's in the FS itself, it should also be > checking hardware-level identifiers like WWN). Which is not enough too; if device dropped off array and reappeared later we need to be able to declare it stale, even if it has exactly the same UUID and WWN and whatever hardware identifier is used. So we need some generation number to be able to do it. Incidentally MD does have them and compares generation numbers to decide whether device can be assimilated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
21.06.2017 09:51, Marat Khalili пишет: > On 21/06/17 06:48, Chris Murphy wrote: >> Another possibility is to ensure a new write is written to a new*not* >> full stripe, i.e. dynamic stripe size. So if the modification is a 50K >> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K >> parity strip (a full stripe write); write out 1 64K data strip + 1 64K >> parity strip. In effect, a 4 disk raid5 would quickly get not just 3 >> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 >> data + 1 parity chunks, and direct those write to the proper chunk >> based on size. Anyway that's beyond my ability to assess how much >> allocator work that is. Balance I'd expect to rewrite everything to >> max data strips possible; the optimization would only apply to normal >> operation COW. > This will make some filesystems mostly RAID1, negating all space savings > of RAID5, won't it? > > Isn't it easier to recalculate parity block based using previous state > of two rewritten strips, parity and data? I don't understand all > performance implications, but it might scale better with number of devices. > That's what it effectively does today; the problem is, RAID[56] layer is below btrfs allocator so same stripe may be shared by different transactions. This defeats the very idea of redirect on write where data on disk is assumed to never be changed by subsequent modifications. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
Hi Qu, On 2017-06-21 10:45, Qu Wenruo wrote: > At 06/21/2017 06:57 AM, waxhead wrote: >> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. >> The wiki refer to kernel 3.19 which was released in February 2015 so I assume >> that the information there is a tad outdated (the last update on the wiki >> page was July 2016) >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> Now there are four problems listed >> >> 1. Parity may be inconsistent after a crash (the "write hole") >> Is this still true, if yes - would not this apply for RAID1 / >> RAID10 as well? How was it solved there , and why can't that be done for >> RAID5/6 > > Unlike pure stripe method, one fully functional RAID5/6 should be written in > full stripe behavior, > which is made up by N data stripes and correct P/Q. > > Given one example to show how write sequence affects the usability of RAID5/6. > > Existing full stripe: > X = Used space (Extent allocated) > O = Unused space > Data 1 |XX|| > Data 2 |OOO| > Parity |WW|| > > When some new extent is allocated to data 1 stripe, if we write > data directly into that region, and crashed. > The result will be: > > Data 1 |XX|XX|O| > Data 2 |OOO| > Parity |WW|| > > Parity stripe is not updated, although it's fine since data is still correct, > this reduces the > usability, as in this case, if we lost device containing data 2 stripe, we > can't > recover correct data of data 2. > > Although personally I don't think it's a big problem yet. > > Someone has idea to modify extent allocator to handle it, but anyway I don't > consider it's worthy. > >> >> 2. Parity data is not checksummed >> Why is this a problem? Does it have to do with the design of BTRFS somehow? >> Parity is after all just data, BTRFS does checksum data so what is the >> reason this is a problem? > > Because that's one solution to solve above problem. In what it could be a solution for the write hole ? If a parity is wrong AND you lost a disk, even having a checksum of the parity, you are not in position to rebuild the missing data. And if you rebuild wrong data, anyway the checksum highlights it. So adding the checksum to the parity should not solve any issue. A possible "mitigation", is to track in a "intent log" all the not "full stripe writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process is started to correct the parities only in the stripes tracked before. A solution, is to journal all the not "full stripe writes", as MD does. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On 2017-06-21 08:43, Christoph Anton Mitterer wrote: On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote: Btrfs is always using device ID to build up its device mapping. And for any multi-device implementation (LVM,mdadam) it's never a good idea to use device path. Isn't it rather the other way round? Using the ID is bad? Don't you remember our discussion about using leaked UUIDs (or accidental collisions) for all kinds of attacks? Both are bad for different reasons. For the particular case of sanely handling transient storage failures (device disappears then reappears), you can't do it with a path in /dev (which is what most people mean when they say device path), and depending on how the hardware failed and the specifics of the firmware, you may not be able to do it with a hardware-level device path, but you can do it with a device ID assuming you sanely verify the ID. Right now, BTRFS is not sanely checking the ID (it only verifies the UUID's in the FS itself, it should also be checking hardware-level identifiers like WWN). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote: > Btrfs is always using device ID to build up its device mapping. > And for any multi-device implementation (LVM,mdadam) it's never a > good > idea to use device path. Isn't it rather the other way round? Using the ID is bad? Don't you remember our discussion about using leaked UUIDs (or accidental collisions) for all kinds of attacks? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: Exactly what is wrong with RAID5/6
At 06/21/2017 06:57 AM, waxhead wrote: I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. The wiki refer to kernel 3.19 which was released in February 2015 so I assume that the information there is a tad outdated (the last update on the wiki page was July 2016) https://btrfs.wiki.kernel.org/index.php/RAID56 Now there are four problems listed 1. Parity may be inconsistent after a crash (the "write hole") Is this still true, if yes - would not this apply for RAID1 / RAID10 as well? How was it solved there , and why can't that be done for RAID5/6 Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior, which is made up by N data stripes and correct P/Q. Given one example to show how write sequence affects the usability of RAID5/6. Existing full stripe: X = Used space (Extent allocated) O = Unused space Data 1 |XX|| Data 2 |OOO| Parity |WW|| When some new extent is allocated to data 1 stripe, if we write data directly into that region, and crashed. The result will be: Data 1 |XX|XX|O| Data 2 |OOO| Parity |WW|| Parity stripe is not updated, although it's fine since data is still correct, this reduces the usability, as in this case, if we lost device containing data 2 stripe, we can't recover correct data of data 2. Although personally I don't think it's a big problem yet. Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy. 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? Because that's one solution to solve above problem. And no, parity is not data. Parity/mirror/stripe is done in btrfs chunk level, and represents a nice, easy to understand linear logical space for higher level. For example: If in the btrfs logical space, 0~1G is mapped to a RAID5 chunk with 3 devices, higher level only needs to tell btrfs chunk layer how many bytes it wants to read and where the read starts. If one devices is missing, then try to rebuild the data using parity so that higher layer don't need to care what the profile is or if there is parity or not. So parity can't be addressed from btrfs logical space, that's to say no possible position to record csum for it in current btrfs design. 3. No support for discard? (possibly -- needs confirmation with cmason) Does this matter that much really?, is there an update on this? Not familiar with this though. 4. The algorithm uses as many devices as are available: No support for a fixed-width stripe. What is the plan for this one? There was patches on the mailing list by the SnapRAID author to support up to 6 parity devices. Will the (re?) resign of btrfs raid5/6 support a scheme that allows for multiple parity devices? Considering current maintainers seems to be focusing on bug fixes, not new features, I'm not confident with such new feature. I do have a few other questions as well... 5. BTRFS does still (kernel 4.9) not seem to use the device ID to communicate with devices. Btrfs is always using device ID to build up its device mapping. And for any multi-device implementation (LVM,mdadam) it's never a good idea to use device path. If you on a multi device filesystem yank out a device, for example /dev/sdg and it reappear as /dev/sdx for example btrfs will still happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the correct device ID. What is the status for getting BTRFS to properly understand that a device is missing? It's btrfs that doesn't support runtime switch from missing device to re-appeared device. Most device missing detection is done at btrfs device scan time. Runtime detect is not that perfect, but Anand Jain is introducing some nice infrastructure as basis to enhance it. 6. RAID1 needs to be able to make two copies always. E.g. if you have three disks you can loose one and it should still work. What about RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and reboots (due to #5 above). Will RAID10 recognize that the array now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) instead of 3? In other words, will it work as long as it can create a RAID10 profile that requires a minimum of four disks? At least, after reboot, btrfs still knows it's a fs on 6 disks, although lost one, it will still create new chunk using all 6 disks. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to
Re: Exactly what is wrong with RAID5/6
> [ ... ] This will make some filesystems mostly RAID1, negating > all space savings of RAID5, won't it? [ ... ] RAID5/RAID6/... don't merely save space, more precisely they trade lower resilience and a more anisotropic and smaller performance envelope to gain lower redundancy (= save space). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On 21/06/17 06:48, Chris Murphy wrote: Another possibility is to ensure a new write is written to a new*not* full stripe, i.e. dynamic stripe size. So if the modification is a 50K file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K parity strip (a full stripe write); write out 1 64K data strip + 1 64K parity strip. In effect, a 4 disk raid5 would quickly get not just 3 data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 data + 1 parity chunks, and direct those write to the proper chunk based on size. Anyway that's beyond my ability to assess how much allocator work that is. Balance I'd expect to rewrite everything to max data strips possible; the optimization would only apply to normal operation COW. This will make some filesystems mostly RAID1, negating all space savings of RAID5, won't it? Isn't it easier to recalculate parity block based using previous state of two rewritten strips, parity and data? I don't understand all performance implications, but it might scale better with number of devices. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Tue, Jun 20, 2017 at 5:25 PM, Hugo Millswrote: > On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote: >> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. >> The wiki refer to kernel 3.19 which was released in February 2015 so >> I assume that the information there is a tad outdated (the last >> update on the wiki page was July 2016) >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> Now there are four problems listed >> >> 1. Parity may be inconsistent after a crash (the "write hole") >> Is this still true, if yes - would not this apply for RAID1 / RAID10 >> as well? How was it solved there , and why can't that be done for >> RAID5/6 > >Yes, it's still true, and it's specific to parity RAID, not the > other RAID levels. The issue is (I think) that if you write one block, > that block is replaced, but then the other blocks in the stripe need > to be read for the parity block to be recalculated, before the new > parity can be written. There's a read-modify-write cycle involved > which isn't inherent for the non-parity RAID levels (which would just > overwrite both copies). Yeah, there's a lwn article from Neil Brown about how the likelihood of hitting the write hole is almost impossible. But nevertheless the md devs implemented a journal to close the write hole. Also on Btrfs, while the write hole can manifest on disk, it does get detected on a subsequent read. That is, a bad reconstruction of data from parity, will not match data csum and you'll get EIO and an path to the bad file. What is really not good though I think is metadata raid56. If that gets hose, the whole fs is going to face plant. And we've seen some evidence of this. So I really think the wiki could make it more clear to just not use raid56 for metadata. > >One of the proposed solutions for dealing with the write hole in > btrfs's parity RAID is to ensure that any new writes are written to a > completely new stripe. The problem is that this introduces a whole new > level of fragmentation if the FS has lots of small writes (because > your write unit is limited to a complete stripe, even for a single > byte update). Another possibility is to ensure a new write is written to a new *not* full stripe, i.e. dynamic stripe size. So if the modification is a 50K file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K parity strip (a full stripe write); write out 1 64K data strip + 1 64K parity strip. In effect, a 4 disk raid5 would quickly get not just 3 data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 data + 1 parity chunks, and direct those write to the proper chunk based on size. Anyway that's beyond my ability to assess how much allocator work that is. Balance I'd expect to rewrite everything to max data strips possible; the optimization would only apply to normal operation COW. Also, ZFS has a functional equivalent, a variable stripe size for raid, so it's always doing COW writes for raid56, no RMW. > >There are probably others here who can explain this better. :) > >> 2. Parity data is not checksummed >> Why is this a problem? Does it have to do with the design of BTRFS somehow? >> Parity is after all just data, BTRFS does checksum data so what is >> the reason this is a problem? > >It increases the number of unrecoverable (or not-guaranteed- > recoverable) cases. btrfs's csums are based on individual blocks on > individual devices -- each item of data is independently checksummed > (even if it's a copy of something else). On parity RAID > configurations, if you have a device failure, you've lost a piece of > the parity-protected data. To repair it, you have to recover from n-1 > data blocks (which are checksummed), and one parity block (which > isn't). This means that if the parity block happens to have an error > on it, you can't recover cleanly from the device loss, *and you can't > know that an error has happened*. Uhh no I've done quite a number of tests and absolutely if the parity is corrupt and therefore you get a bad reconstruction, you definitely get a csum mismatch and EIO. Corrupt data does not propagate upward. The csums are in the csum tree which is part of metadata block groups. If those are raid56 and there's a loss of data, now you're at pretty high risk because you can get a bad reconstruction, which btrfs will recognize but unable to recover from, and should go read only. We've seen that on the list. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote: > I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. > The wiki refer to kernel 3.19 which was released in February 2015 so > I assume that the information there is a tad outdated (the last > update on the wiki page was July 2016) > https://btrfs.wiki.kernel.org/index.php/RAID56 > > Now there are four problems listed > > 1. Parity may be inconsistent after a crash (the "write hole") > Is this still true, if yes - would not this apply for RAID1 / RAID10 > as well? How was it solved there , and why can't that be done for > RAID5/6 Yes, it's still true, and it's specific to parity RAID, not the other RAID levels. The issue is (I think) that if you write one block, that block is replaced, but then the other blocks in the stripe need to be read for the parity block to be recalculated, before the new parity can be written. There's a read-modify-write cycle involved which isn't inherent for the non-parity RAID levels (which would just overwrite both copies). One of the proposed solutions for dealing with the write hole in btrfs's parity RAID is to ensure that any new writes are written to a completely new stripe. The problem is that this introduces a whole new level of fragmentation if the FS has lots of small writes (because your write unit is limited to a complete stripe, even for a single byte update). There are probably others here who can explain this better. :) > 2. Parity data is not checksummed > Why is this a problem? Does it have to do with the design of BTRFS somehow? > Parity is after all just data, BTRFS does checksum data so what is > the reason this is a problem? It increases the number of unrecoverable (or not-guaranteed- recoverable) cases. btrfs's csums are based on individual blocks on individual devices -- each item of data is independently checksummed (even if it's a copy of something else). On parity RAID configurations, if you have a device failure, you've lost a piece of the parity-protected data. To repair it, you have to recover from n-1 data blocks (which are checksummed), and one parity block (which isn't). This means that if the parity block happens to have an error on it, you can't recover cleanly from the device loss, *and you can't know that an error has happened*. > 3. No support for discard? (possibly -- needs confirmation with cmason) > Does this matter that much really?, is there an update on this? > > 4. The algorithm uses as many devices as are available: No support > for a fixed-width stripe. > What is the plan for this one? There was patches on the mailing list > by the SnapRAID author to support up to 6 parity devices. Will the > (re?) resign of btrfs raid5/6 support a scheme that allows for > multiple parity devices? That's a problem because it limits the practical number of devices you can use. When the stripe size gets too large, you're having to read/modify/(re)write every device on an update, even for very small updates -- as this ratio of update-size to read-size goes up, the FS has increasingly bad performance. Your personal limits of what's acceptable will vary, but I'd be surprised to find anyone with, say, 40 parity RAID devices who finds their performance acceptable. Limit the stripe width, and you can limit the performance degradation from lots of devices. Even with a limited stripe width, however, you're still looking at decreasing reliability as the number of devices increases... It shouldn't be *massively* hard to implement, but there's a load of opportunities around managing RAID options in general that would probably need to be addressed at the same time (e.g. per-subvol RAID settings, more general RAID parameterisation). It's going to need some fairly major properties handling, plus rewriting the chunk allocator and pushing the allocator decisions quite a way up from where they're currently made. > I do have a few other questions as well... > > 5. BTRFS does still (kernel 4.9) not seem to use the device ID to > communicate with devices. > > If you on a multi device filesystem yank out a device, for example > /dev/sdg and it reappear as /dev/sdx for example btrfs will still > happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the > correct device ID. What is the status for getting BTRFS to properly > understand that a device is missing? I don't know about this one. > 6. RAID1 needs to be able to make two copies always. E.g. if you > have three disks you can loose one and it should still work. What > about RAID10 ? If you have for example 6 disk RAID10 array, loose > one disk and reboots (due to #5 above). Will RAID10 recognize that > the array now is a 5 disk array and stripe+mirror over 2 disks (or > possibly 2.5 disks?) instead of 3? In other words, will it work as > long as it can create a RAID10 profile that requires a minimum of > four disks? Yes. RAID-10 will work on any number of devices (>=4), not just an
Exactly what is wrong with RAID5/6
I am trying to piece together the actual status of the RAID5/6 bit of BTRFS. The wiki refer to kernel 3.19 which was released in February 2015 so I assume that the information there is a tad outdated (the last update on the wiki page was July 2016) https://btrfs.wiki.kernel.org/index.php/RAID56 Now there are four problems listed 1. Parity may be inconsistent after a crash (the "write hole") Is this still true, if yes - would not this apply for RAID1 / RAID10 as well? How was it solved there , and why can't that be done for RAID5/6 2. Parity data is not checksummed Why is this a problem? Does it have to do with the design of BTRFS somehow? Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem? 3. No support for discard? (possibly -- needs confirmation with cmason) Does this matter that much really?, is there an update on this? 4. The algorithm uses as many devices as are available: No support for a fixed-width stripe. What is the plan for this one? There was patches on the mailing list by the SnapRAID author to support up to 6 parity devices. Will the (re?) resign of btrfs raid5/6 support a scheme that allows for multiple parity devices? I do have a few other questions as well... 5. BTRFS does still (kernel 4.9) not seem to use the device ID to communicate with devices. If you on a multi device filesystem yank out a device, for example /dev/sdg and it reappear as /dev/sdx for example btrfs will still happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the correct device ID. What is the status for getting BTRFS to properly understand that a device is missing? 6. RAID1 needs to be able to make two copies always. E.g. if you have three disks you can loose one and it should still work. What about RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and reboots (due to #5 above). Will RAID10 recognize that the array now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) instead of 3? In other words, will it work as long as it can create a RAID10 profile that requires a minimum of four disks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html