Re: Scrub on btrfs single device only to detect errors, not correct them?
Jon Panozzo posted on Mon, 07 Dec 2015 08:43:14 -0600 as excerpted: [On single-device dup data] > Thanks for the additional feedback. Two follow-up questions to this is: > > Can the --mixed option only be applied when first creating the fs, or > can you simply add this to the balance command to take an existing > filesystem and add this to it? Mixed-bg mode has to be done at btrfs creation. It changes the way btrfs handles chunks, and doing that _live_, with a non-zero time during which both modes are active, would be... complex and an invitation to all sorts of race bugs, to put it mildly. > So it sounds like there are really three ways to enable scrub to repair > errors on a btrfs single device (please confirm): Yes. > 1) mkfs.btrfs with the --mixed option This would be my current preferred to filesystem sizes of a quarter to perhaps a half terabyte on spinning rust, and some people are known to use mixed for exactly this reason, tho it's not particularly well tested at the terabyte scale filesystem level, where as a result you might uncover some unusual bugs. > 2) create two partitions on a single phys device, > then present them as logical devices (maybe a loopback or something) > and create a btrfs raid1 for both data/metadata No special loopback, etc, required. Btrfs deploys just fine on pretty much any block device as presented by the kernel, including both partitions and LVM volumes, the two ways single physical devices are likely to be presented as multiple logical devices. In fact I use btrfs on partitions here, tho in my case it's two devices partitioned up identically, with raid1 across the parallel partitions on each device, instead of using multiple partitions on the same physical device, which is what we're talking about here. This option will be rather inefficient on spinning rust as the write head will have to write one copy to the one partition, then reposition itself to write the second copy to the other partition, and that repositioning is non-zero time on spinning rust, but there's no such repositioning latency on SSDs, where it might actually be faster than mixed-mode, tho I'm unaware of any benchmarking to find out. Despite the inefficiency, both partitions and btrfs raid1 are separately well tested and their combined use on a single device should introduce no race conditions that wouldn't have been found by previous separate usage, so this would be my current preferred at filesystem sizes over a half terabyte on spinning rust, or on SSDs with their zero seek times. But writing /will/ be slow on spinning rust, particularly with partition sizes of a half-TiB or larger each, as that write-mode seek-time will be /nasty/. That said, again, there are people known to be using this mode, and it's a viable choice in deployments such as laptops where physical multi- device isn't an option, but the additional reliability of pair-copy data is highly desirable. > 3) wait for the patch in process to allow for btrfs single devices to > support dup mode for data This should be the preferred mode in the future, tho as with any new btrfs feature, it'll probably take a couple kernel versions after initial introduction for the most critical bugs in the new feature to be found and duly exterminated, so I'd consider anyone using it the first kernel cycle or two after introduction to be volunteering as guinea pigs. That said, the individual components of this feature have been in btrfs for some time and are well tested by now, so I'd expect the introduction of this feature to be rather smoother than many. For the much more disruptive raid56 mode, I suggested a guinea-pig time of a year, five kernel cycles, for instance, and that turned out to be about right. (Interestingly enough, that put raid56 mode feature stability at the soon to be released kernel 4.4, which is scheduled to be a long-term-support release, so the raid56 mode stability timing worked out rather well, tho I had no idea 4.4 would be an LTS when I originally predicted the year's settle-time.) > Is that about right? =:^) One further caveat regarding SSDs. On SSDs, many commonly deployed FTLs do dedup. Sandforce firmware, where dedup is sold as a feature, is known for this. If the firmware is doing dedup, then duplicated data /or/ metadata at the filesystem level is simply being deduped at the physical device firmware level, so you end up with only one physical copy in any case, and filesystem efforts to provide redundancy only end up costing CPU cycles at both the filesystem and device-firmware levels, all for naught. This is a big reason why mkfs.btrfs on a single device defaults to single metadata if it detects an SSD, despite the normally preferred dup metadata default. So if you're deploying on SSDs using sandforce firmware or otherwise known to do dedup at the FTL, don't bother with any of the above as the firmware will be simply defeating your efforts at
Re: Scrub on btrfs single device only to detect errors, not correct them?
Austin S Hemmelgarn posted on Mon, 07 Dec 2015 10:39:05 -0500 as excerpted: > On 2015-12-07 10:12, Jon Panozzo wrote: >> This is what I was thinking as well. In my particular use-case, parity >> is only really used today to reconstruct an entire device due to a >> device failure. I think if btrfs scrub detected errors on a single >> device, I could do a "reverse reconstruct" where instead of syncing TO >> the parity disk, I sync FROM the parity disk TO the btrfs single device >> with the error, replacing physical blocks that are out of sync with >> parity (thus repairing the scrub-found errrors). The downside to this >> approach is I would have to perform the reverse-sync against the entire >> btrfs block device, which could be much more time-consuming than if I >> could single out the specific block addresses and just sync those. >> That said, I guess option A is better than no option at all. >> >> I would be curious if any of the devs or other members of this mailing >> list have tried to correlate btrfs internal block addresses to a true >> block-address on the device being used. Any interesting articles / >> links that show how to do this? Not expecting much, but if someone >> does know, I'd be very grateful. > I think there is a tool in btrfs-progs to do it, but I've never used it, > and you would still need to get scrub to spit out actual error addresses > for you. btrfs-debug-tree is what you're looking for. =:^) As I understand things, the complexity is due to btrfs' chunk abstraction, along with the multi-device feature. On a normal filesystem, byte or block addresses are mapped linearly to absolute filesystem byte address and there's just the one device to worry about, so there's effectively little or no translation to be done. On btrfs by contrast, block addresses map into chunks, also known as block groups, which are designed to be more or less arbitrarily relocatable within the filesystem using balance (originally called the restriper). Further, these block groups can be single, striped across multiple devices (raid0 and the 0 side of raid10, duplicated on the same device (dup) or across multiple devices (only two devices currently, N- way-mirroring is on the roadmap, raid1 and the 1 side of raid10), or striped with parity (raid5 and 6). So while block addresses can map more or less linearly into block groups, btrfs has to maintain an entirely new layer of abstraction mapping in addition, that tells the filesystem where to look for that block group, that is, on what device (or across what devices if striped), and at what absolute bytenr offset into the device. And again, keep in mind that even with a constant single/dup/raid mapping and even in the simplest single mode on single device, balance can and does more or less arbitrarily dynamically relocate block groups within the filesystem, so the mapping you see today may or may not be the mapping you see tomorrow, depending on whether a balance was run in the mean time. Obviously the devs are going to need a tool to help them debug this additional complexity, and that's where btrfs-debug-tree comes in. =:^) But for "ordinary mortal admins", yes, btrfs is open source and btrfs-debug-tree is available for those that want to use it, but once they realize the complexity, most (including me) are going to simply be content to treat it as a black box and not worry too much about investigating its innards. So while specific block and/or byte mapping can be done and there's tools available for and appropriate to the task, it's the type of thing most admins are very content to treat as a black box and leave well enough alone, once they understand the complexities involved. "Btrfs, while he might use it, it ain't your grandfather's filesystem!" (TM) =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
And I'll throw this question out to everyone: Let's say I have a means of providing parity for a btrfs device, but in a way that's external to btrfs (imagine a btrfs single device as part of a hardware or software RAID). If BTRFS detected an error during a scrub, and parity wasn't updated as a result (say the result of bitrot on the btrfs device), couldn't parity be used to repair the broken bit(s)? If so, the big question is how to use scrub to determine the sector/bit (forgive me if I'm using wrong terminology) at the block level that needs to be fixed. I think my theory is sound in principle, but not sure if it's possible to correlate a scrub-found uncorrectable error to a physical location on the block device. - Jon On Sun, Dec 6, 2015 at 9:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted: > >> On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo>> wrote: >>> Just to confirm, is the sole purpose of supporting scrub on single >>> btrfs devices to detect errors, but not to correct them? >> >> If that single device metadata profile is DUP, then it will correct >> those. If there is only one copy of anything, then it just reports. >> Scrub works on all data, but a passive scrub happens anytime something >> is read. > > ... And more to the point, expanding on that, on a single device btrfs, > data is single mode by default, so scrub for it (as opposed to metadata) > is error-detect-only as mentioned. > > However, while the default mode separates data and metadata, and in that > mode, historically (there's a patch to change this, adding the missing > option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed > option) puts data and metadata both in the same shared block-group type, > which can then be either dup or single mode. > > Obviously duplicating data as well as metadata means you can only store > half as much data, since it's all stored twice, but that will let scrub > correct errors in cases where only one of the two copies doesn't verify > checksum, but the other one does. > > And as mentioned above, there's a patch in process now, that will remove > the single-device restriction of data (as opposed to metadata) to single > mode, allowing the choice of dup mode for data as well as metadata. > > > Also, in addition to the mixed-mode workaround to get dup data, it's > possible, altho rather inefficient in performance terms, to partition a > physical device such that two equal sized partitions are made available > as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical > devices into a single btrfs, raid1 for both data and metadata, so btrfs > creates two copies that way, again letting scrub correct errors when only > one of the two fails to verify against checksum. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
On 2015-12-07 09:47, Jon Panozzo wrote: And I'll throw this question out to everyone: Let's say I have a means of providing parity for a btrfs device, but in a way that's external to btrfs (imagine a btrfs single device as part of a hardware or software RAID). If BTRFS detected an error during a scrub, and parity wasn't updated as a result (say the result of bitrot on the btrfs device), couldn't parity be used to repair the broken bit(s)? If so, the big question is how to use scrub to determine the sector/bit (forgive me if I'm using wrong terminology) at the block level that needs to be fixed. I think my theory is sound in principle, but not sure if it's possible to correlate a scrub-found uncorrectable error to a physical location on the block device. In theory, this is possible, but it's _really_ tricky to do right. BTRFS uses it's own internal block addressing that is mostly independent from what's done at the block device level, which makes things non-trivial to map to actual addresses. On top of that, it's non-trivial to get a address for a block that failed the scrub operation. It's probably easier to just run a check on the lower-level device if scrub reports errors. If that fails, then it's probably fixable by the lower level directly, if it passes, then the issue is probably a bug in BTRFS. smime.p7s Description: S/MIME Cryptographic Signature
Re: Scrub on btrfs single device only to detect errors, not correct them?
This is what I was thinking as well. In my particular use-case, parity is only really used today to reconstruct an entire device due to a device failure. I think if btrfs scrub detected errors on a single device, I could do a "reverse reconstruct" where instead of syncing TO the parity disk, I sync FROM the parity disk TO the btrfs single device with the error, replacing physical blocks that are out of sync with parity (thus repairing the scrub-found errrors). The downside to this approach is I would have to perform the reverse-sync against the entire btrfs block device, which could be much more time-consuming than if I could single out the specific block addresses and just sync those. That said, I guess option A is better than no option at all. I would be curious if any of the devs or other members of this mailing list have tried to correlate btrfs internal block addresses to a true block-address on the device being used. Any interesting articles / links that show how to do this? Not expecting much, but if someone does know, I'd be very grateful. - Jon On Mon, Dec 7, 2015 at 9:01 AM, Austin S Hemmelgarnwrote: > On 2015-12-07 09:47, Jon Panozzo wrote: >> >> And I'll throw this question out to everyone: >> >> Let's say I have a means of providing parity for a btrfs device, but >> in a way that's external to btrfs (imagine a btrfs single device as >> part of a hardware or software RAID). If BTRFS detected an error >> during a scrub, and parity wasn't updated as a result (say the result >> of bitrot on the btrfs device), couldn't parity be used to repair the >> broken bit(s)? If so, the big question is how to use scrub to >> determine the sector/bit (forgive me if I'm using wrong terminology) >> at the block level that needs to be fixed. I think my theory is sound >> in principle, but not sure if it's possible to correlate a scrub-found >> uncorrectable error to a physical location on the block device. >> > In theory, this is possible, but it's _really_ tricky to do right. BTRFS > uses it's own internal block addressing that is mostly independent from > what's done at the block device level, which makes things non-trivial to map > to actual addresses. On top of that, it's non-trivial to get a address for > a block that failed the scrub operation. It's probably easier to just run a > check on the lower-level device if scrub reports errors. If that fails, > then it's probably fixable by the lower level directly, if it passes, then > the issue is probably a bug in BTRFS. > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
Duncan, Thanks for the additional feedback. Two follow-up questions to this is: Can the --mixed option only be applied when first creating the fs, or can you simply add this to the balance command to take an existing filesystem and add this to it? So it sounds like there are really three ways to enable scrub to repair errors on a btrfs single device (please confirm): 1) mkfs.btrfs with the --mixed option 2) create two partitions on a single phys device, then present them as logical devices (maybe a loopback or something) and create a btrfs raid1 for both data/metadata 3) wait for the patch in process to allow for btrfs single devices to support dup mode for data Is that about right? - Jon On Sun, Dec 6, 2015 at 9:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted: > >> On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo>> wrote: >>> Just to confirm, is the sole purpose of supporting scrub on single >>> btrfs devices to detect errors, but not to correct them? >> >> If that single device metadata profile is DUP, then it will correct >> those. If there is only one copy of anything, then it just reports. >> Scrub works on all data, but a passive scrub happens anytime something >> is read. > > ... And more to the point, expanding on that, on a single device btrfs, > data is single mode by default, so scrub for it (as opposed to metadata) > is error-detect-only as mentioned. > > However, while the default mode separates data and metadata, and in that > mode, historically (there's a patch to change this, adding the missing > option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed > option) puts data and metadata both in the same shared block-group type, > which can then be either dup or single mode. > > Obviously duplicating data as well as metadata means you can only store > half as much data, since it's all stored twice, but that will let scrub > correct errors in cases where only one of the two copies doesn't verify > checksum, but the other one does. > > And as mentioned above, there's a patch in process now, that will remove > the single-device restriction of data (as opposed to metadata) to single > mode, allowing the choice of dup mode for data as well as metadata. > > > Also, in addition to the mixed-mode workaround to get dup data, it's > possible, altho rather inefficient in performance terms, to partition a > physical device such that two equal sized partitions are made available > as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical > devices into a single btrfs, raid1 for both data and metadata, so btrfs > creates two copies that way, again letting scrub correct errors when only > one of the two fails to verify against checksum. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
On 2015-12-07 10:12, Jon Panozzo wrote: This is what I was thinking as well. In my particular use-case, parity is only really used today to reconstruct an entire device due to a device failure. I think if btrfs scrub detected errors on a single device, I could do a "reverse reconstruct" where instead of syncing TO the parity disk, I sync FROM the parity disk TO the btrfs single device with the error, replacing physical blocks that are out of sync with parity (thus repairing the scrub-found errrors). The downside to this approach is I would have to perform the reverse-sync against the entire btrfs block device, which could be much more time-consuming than if I could single out the specific block addresses and just sync those. That said, I guess option A is better than no option at all. I would be curious if any of the devs or other members of this mailing list have tried to correlate btrfs internal block addresses to a true block-address on the device being used. Any interesting articles / links that show how to do this? Not expecting much, but if someone does know, I'd be very grateful. I think there is a tool in btrfs-progs to do it, but I've never used it, and you would still need to get scrub to spit out actual error addresses for you. smime.p7s Description: S/MIME Cryptographic Signature
Re: Scrub on btrfs single device only to detect errors, not correct them?
Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted: > On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo> wrote: >> Just to confirm, is the sole purpose of supporting scrub on single >> btrfs devices to detect errors, but not to correct them? > > If that single device metadata profile is DUP, then it will correct > those. If there is only one copy of anything, then it just reports. > Scrub works on all data, but a passive scrub happens anytime something > is read. ... And more to the point, expanding on that, on a single device btrfs, data is single mode by default, so scrub for it (as opposed to metadata) is error-detect-only as mentioned. However, while the default mode separates data and metadata, and in that mode, historically (there's a patch to change this, adding the missing option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed option) puts data and metadata both in the same shared block-group type, which can then be either dup or single mode. Obviously duplicating data as well as metadata means you can only store half as much data, since it's all stored twice, but that will let scrub correct errors in cases where only one of the two copies doesn't verify checksum, but the other one does. And as mentioned above, there's a patch in process now, that will remove the single-device restriction of data (as opposed to metadata) to single mode, allowing the choice of dup mode for data as well as metadata. Also, in addition to the mixed-mode workaround to get dup data, it's possible, altho rather inefficient in performance terms, to partition a physical device such that two equal sized partitions are made available as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical devices into a single btrfs, raid1 for both data and metadata, so btrfs creates two copies that way, again letting scrub correct errors when only one of the two fails to verify against checksum. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrub on btrfs single device only to detect errors, not correct them?
Just to confirm, is the sole purpose of supporting scrub on single btrfs devices to detect errors, but not to correct them? Best Regards, Jonathan Panozzo Lime Technology, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzowrote: > Just to confirm, is the sole purpose of supporting scrub on single btrfs > devices to detect errors, but not to correct them? If that single device metadata profile is DUP, then it will correct those. If there is only one copy of anything, then it just reports. Scrub works on all data, but a passive scrub happens anytime something is read. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html