Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-08 Thread Duncan
Jon Panozzo posted on Mon, 07 Dec 2015 08:43:14 -0600 as excerpted:

[On single-device dup data]

> Thanks for the additional feedback.  Two follow-up questions to this is:
> 
> Can the --mixed option only be applied when first creating the fs, or
> can you simply add this to the balance command to take an existing
> filesystem and add this to it?

Mixed-bg mode has to be done at btrfs creation.

It changes the way btrfs handles chunks, and doing that _live_, with a 
non-zero time during which both modes are active, would be... complex and 
an invitation to all sorts of race bugs, to put it mildly.

> So it sounds like there are really three ways to enable scrub to repair
> errors on a btrfs single device (please confirm):

Yes.

> 1) mkfs.btrfs with the --mixed option

This would be my current preferred to filesystem sizes of a quarter to 
perhaps a half terabyte on spinning rust, and some people are known to 
use mixed for exactly this reason, tho it's not particularly well tested 
at the terabyte scale filesystem level, where as a result you might 
uncover some unusual bugs.

> 2) create two partitions on a single phys device,
> then present them as logical devices (maybe a loopback or something)
> and create a btrfs raid1 for both data/metadata

No special loopback, etc, required.  Btrfs deploys just fine on pretty 
much any block device as presented by the kernel, including both 
partitions and LVM volumes, the two ways single physical devices are 
likely to be presented as multiple logical devices.

In fact I use btrfs on partitions here, tho in my case it's two devices 
partitioned up identically, with raid1 across the parallel partitions on 
each device, instead of using multiple partitions on the same physical 
device, which is what we're talking about here.

This option will be rather inefficient on spinning rust as the write head 
will have to write one copy to the one partition, then reposition itself 
to write the second copy to the other partition, and that repositioning 
is non-zero time on spinning rust, but there's no such repositioning 
latency on SSDs, where it might actually be faster than mixed-mode, tho 
I'm unaware of any benchmarking to find out.

Despite the inefficiency, both partitions and btrfs raid1 are separately 
well tested and their combined use on a single device should introduce no 
race conditions that wouldn't have been found by previous separate usage, 
so this would be my current preferred at filesystem sizes over a half 
terabyte on spinning rust, or on SSDs with their zero seek times.

But writing /will/ be slow on spinning rust, particularly with partition 
sizes of a half-TiB or larger each, as that write-mode seek-time will be 
/nasty/.

That said, again, there are people known to be using this mode, and it's 
a viable choice in deployments such as laptops where physical multi-
device isn't an option, but the additional reliability of pair-copy data 
is highly desirable.

> 3) wait for the patch in process to allow for btrfs single devices to
> support dup mode for data

This should be the preferred mode in the future, tho as with any new 
btrfs feature, it'll probably take a couple kernel versions after initial 
introduction for the most critical bugs in the new feature to be found 
and duly exterminated, so I'd consider anyone using it the first kernel 
cycle or two after introduction to be volunteering as guinea pigs.  That 
said, the individual components of this feature have been in btrfs for 
some time and are well tested by now, so I'd expect the introduction of 
this feature to be rather smoother than many.  For the much more 
disruptive raid56 mode, I suggested a guinea-pig time of a year, five 
kernel cycles, for instance, and that turned out to be about right.

(Interestingly enough, that put raid56 mode feature stability at the soon 
to be released kernel 4.4, which is scheduled to be a long-term-support 
release, so the raid56 mode stability timing worked out rather well, tho 
I had no idea 4.4 would be an LTS when I originally predicted the year's 
settle-time.)

> Is that about right?

=:^)


One further caveat regarding SSDs.

On SSDs, many commonly deployed FTLs do dedup.  Sandforce firmware, where 
dedup is sold as a feature, is known for this.  If the firmware is doing 
dedup, then duplicated data /or/ metadata at the filesystem level is 
simply being deduped at the physical device firmware level, so you end up 
with only one physical copy in any case, and filesystem efforts to 
provide redundancy only end up costing CPU cycles at both the filesystem 
and device-firmware levels, all for naught.  This is a big reason why 
mkfs.btrfs on a single device defaults to single metadata if it detects 
an SSD, despite the normally preferred dup metadata default.

So if you're deploying on SSDs using sandforce firmware or otherwise 
known to do dedup at the FTL, don't bother with any of the above as the 
firmware will be simply defeating your efforts at 

Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-08 Thread Duncan
Austin S Hemmelgarn posted on Mon, 07 Dec 2015 10:39:05 -0500 as
excerpted:

> On 2015-12-07 10:12, Jon Panozzo wrote:
>> This is what I was thinking as well.  In my particular use-case, parity
>> is only really used today to reconstruct an entire device due to a
>> device failure.  I think if btrfs scrub detected errors on a single
>> device, I could do a "reverse reconstruct" where instead of syncing TO
>> the parity disk, I sync FROM the parity disk TO the btrfs single device
>> with the error, replacing physical blocks that are out of sync with
>> parity (thus repairing the scrub-found errrors).  The downside to this
>> approach is I would have to perform the reverse-sync against the entire
>> btrfs block device, which could be much more time-consuming than if I
>> could single out the specific block addresses and just sync those. 
>> That said, I guess option A is better than no option at all.
>>
>> I would be curious if any of the devs or other members of this mailing
>> list have tried to correlate btrfs internal block addresses to a true
>> block-address on the device being used.  Any interesting articles /
>> links that show how to do this?  Not expecting much, but if someone
>> does know, I'd be very grateful.

> I think there is a tool in btrfs-progs to do it, but I've never used it,
> and you would still need to get scrub to spit out actual error addresses
> for you.

btrfs-debug-tree is what you're looking for. =:^)

As I understand things, the complexity is due to btrfs' chunk 
abstraction, along with the multi-device feature.

On a normal filesystem, byte or block addresses are mapped linearly to 
absolute filesystem byte address and there's just the one device to worry 
about, so there's effectively little or no translation to be done.

On btrfs by contrast, block addresses map into chunks, also known as 
block groups, which are designed to be more or less arbitrarily 
relocatable within the filesystem using balance (originally called the 
restriper).  Further, these block groups can be single, striped across 
multiple devices (raid0 and the 0 side of raid10, duplicated on the same 
device (dup) or across multiple devices (only two devices currently, N-
way-mirroring is on the roadmap, raid1 and the 1 side of raid10), or 
striped with parity (raid5 and 6).

So while block addresses can map more or less linearly into block groups, 
btrfs has to maintain an entirely new layer of abstraction mapping in 
addition, that tells the filesystem where to look for that block group, 
that is, on what device (or across what devices if striped), and at what 
absolute bytenr offset into the device.

And again, keep in mind that even with a constant single/dup/raid mapping 
and even in the simplest single mode on single device, balance can and 
does more or less arbitrarily dynamically relocate block groups within 
the filesystem, so the mapping you see today may or may not be the 
mapping you see tomorrow, depending on whether a balance was run in the 
mean time.

Obviously the devs are going to need a tool to help them debug this 
additional complexity, and that's where btrfs-debug-tree comes in. =:^)

But for "ordinary mortal admins", yes, btrfs is open source and
btrfs-debug-tree is available for those that want to use it, but once 
they realize the complexity, most (including me) are going to simply be 
content to treat it as a black box and not worry too much about 
investigating its innards.

So while specific block and/or byte mapping can be done and there's tools 
available for and appropriate to the task, it's the type of thing most 
admins are very content to treat as a black box and leave well enough 
alone, once they understand the complexities involved.

"Btrfs, while he might use it, it ain't your grandfather's 
filesystem!" (TM) =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-07 Thread Jon Panozzo
And I'll throw this question out to everyone:

Let's say I have a means of providing parity for a btrfs device, but
in a way that's external to btrfs (imagine a btrfs single device as
part of a hardware or software RAID).  If BTRFS detected an error
during a scrub, and parity wasn't updated as a result (say the result
of bitrot on the btrfs device), couldn't parity be used to repair the
broken bit(s)?  If so, the big question is how to use scrub to
determine the sector/bit (forgive me if I'm using wrong terminology)
at the block level that needs to be fixed.  I think my theory is sound
in principle, but not sure if it's possible to correlate a scrub-found
uncorrectable error to a physical location on the block device.

- Jon

On Sun, Dec 6, 2015 at 9:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted:
>
>> On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo 
>> wrote:
>>> Just to confirm, is the sole purpose of supporting scrub on single
>>> btrfs devices to detect errors, but not to correct them?
>>
>> If that single device metadata profile is DUP, then it will correct
>> those. If there is only one copy of anything, then it just reports.
>> Scrub works on all data, but a passive scrub happens anytime something
>> is read.
>
> ... And more to the point, expanding on that, on a single device btrfs,
> data is single mode by default, so scrub for it (as opposed to metadata)
> is error-detect-only as mentioned.
>
> However, while the default mode separates data and metadata, and in that
> mode, historically (there's a patch to change this, adding the missing
> option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed
> option) puts data and metadata both in the same shared block-group type,
> which can then be either dup or single mode.
>
> Obviously duplicating data as well as metadata means you can only store
> half as much data, since it's all stored twice, but that will let scrub
> correct errors in cases where only one of the two copies doesn't verify
> checksum, but the other one does.
>
> And as mentioned above, there's a patch in process now, that will remove
> the single-device restriction of data (as opposed to metadata) to single
> mode, allowing the choice of dup mode for data as well as metadata.
>
>
> Also, in addition to the mixed-mode workaround to get dup data, it's
> possible, altho rather inefficient in performance terms, to partition a
> physical device such that two equal sized partitions are made available
> as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical
> devices into a single btrfs, raid1 for both data and metadata, so btrfs
> creates two copies that way, again letting scrub correct errors when only
> one of the two fails to verify against checksum.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-07 Thread Austin S Hemmelgarn

On 2015-12-07 09:47, Jon Panozzo wrote:

And I'll throw this question out to everyone:

Let's say I have a means of providing parity for a btrfs device, but
in a way that's external to btrfs (imagine a btrfs single device as
part of a hardware or software RAID).  If BTRFS detected an error
during a scrub, and parity wasn't updated as a result (say the result
of bitrot on the btrfs device), couldn't parity be used to repair the
broken bit(s)?  If so, the big question is how to use scrub to
determine the sector/bit (forgive me if I'm using wrong terminology)
at the block level that needs to be fixed.  I think my theory is sound
in principle, but not sure if it's possible to correlate a scrub-found
uncorrectable error to a physical location on the block device.

In theory, this is possible, but it's _really_ tricky to do right. 
BTRFS uses it's own internal block addressing that is mostly independent 
from what's done at the block device level, which makes things 
non-trivial to map to actual addresses.  On top of that, it's 
non-trivial to get a address for a block that failed the scrub 
operation.  It's probably easier to just run a check on the lower-level 
device if scrub reports errors.  If that fails, then it's probably 
fixable by the lower level directly, if it passes, then the issue is 
probably a bug in BTRFS.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-07 Thread Jon Panozzo
This is what I was thinking as well.  In my particular use-case,
parity is only really used today to reconstruct an entire device due
to a device failure.  I think if btrfs scrub detected errors on a
single device, I could do a "reverse reconstruct" where instead of
syncing TO the parity disk, I sync FROM the parity disk TO the btrfs
single device with the error, replacing physical blocks that are out
of sync with parity (thus repairing the scrub-found errrors).  The
downside to this approach is I would have to perform the reverse-sync
against the entire btrfs block device, which could be much more
time-consuming than if I could single out the specific block addresses
and just sync those.  That said, I guess option A is better than no
option at all.

I would be curious if any of the devs or other members of this mailing
list have tried to correlate btrfs internal block addresses to a true
block-address on the device being used.  Any interesting articles /
links that show how to do this?  Not expecting much, but if someone
does know, I'd be very grateful.

- Jon

On Mon, Dec 7, 2015 at 9:01 AM, Austin S Hemmelgarn
 wrote:
> On 2015-12-07 09:47, Jon Panozzo wrote:
>>
>> And I'll throw this question out to everyone:
>>
>> Let's say I have a means of providing parity for a btrfs device, but
>> in a way that's external to btrfs (imagine a btrfs single device as
>> part of a hardware or software RAID).  If BTRFS detected an error
>> during a scrub, and parity wasn't updated as a result (say the result
>> of bitrot on the btrfs device), couldn't parity be used to repair the
>> broken bit(s)?  If so, the big question is how to use scrub to
>> determine the sector/bit (forgive me if I'm using wrong terminology)
>> at the block level that needs to be fixed.  I think my theory is sound
>> in principle, but not sure if it's possible to correlate a scrub-found
>> uncorrectable error to a physical location on the block device.
>>
> In theory, this is possible, but it's _really_ tricky to do right. BTRFS
> uses it's own internal block addressing that is mostly independent from
> what's done at the block device level, which makes things non-trivial to map
> to actual addresses.  On top of that, it's non-trivial to get a address for
> a block that failed the scrub operation.  It's probably easier to just run a
> check on the lower-level device if scrub reports errors.  If that fails,
> then it's probably fixable by the lower level directly, if it passes, then
> the issue is probably a bug in BTRFS.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-07 Thread Jon Panozzo
Duncan,

Thanks for the additional feedback.  Two follow-up questions to this is:

Can the --mixed option only be applied when first creating the fs, or
can you simply add this to the balance command to take an existing
filesystem and add this to it?

So it sounds like there are really three ways to enable scrub to
repair errors on a btrfs single device (please confirm):

1) mkfs.btrfs with the --mixed option
2) create two partitions on a single phys device, then present them as
logical devices (maybe a loopback or something) and create a btrfs
raid1 for both data/metadata
3) wait for the patch in process to allow for btrfs single devices to
support dup mode for data

Is that about right?

- Jon

On Sun, Dec 6, 2015 at 9:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted:
>
>> On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo 
>> wrote:
>>> Just to confirm, is the sole purpose of supporting scrub on single
>>> btrfs devices to detect errors, but not to correct them?
>>
>> If that single device metadata profile is DUP, then it will correct
>> those. If there is only one copy of anything, then it just reports.
>> Scrub works on all data, but a passive scrub happens anytime something
>> is read.
>
> ... And more to the point, expanding on that, on a single device btrfs,
> data is single mode by default, so scrub for it (as opposed to metadata)
> is error-detect-only as mentioned.
>
> However, while the default mode separates data and metadata, and in that
> mode, historically (there's a patch to change this, adding the missing
> option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed
> option) puts data and metadata both in the same shared block-group type,
> which can then be either dup or single mode.
>
> Obviously duplicating data as well as metadata means you can only store
> half as much data, since it's all stored twice, but that will let scrub
> correct errors in cases where only one of the two copies doesn't verify
> checksum, but the other one does.
>
> And as mentioned above, there's a patch in process now, that will remove
> the single-device restriction of data (as opposed to metadata) to single
> mode, allowing the choice of dup mode for data as well as metadata.
>
>
> Also, in addition to the mixed-mode workaround to get dup data, it's
> possible, altho rather inefficient in performance terms, to partition a
> physical device such that two equal sized partitions are made available
> as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical
> devices into a single btrfs, raid1 for both data and metadata, so btrfs
> creates two copies that way, again letting scrub correct errors when only
> one of the two fails to verify against checksum.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-07 Thread Austin S Hemmelgarn

On 2015-12-07 10:12, Jon Panozzo wrote:

This is what I was thinking as well.  In my particular use-case,
parity is only really used today to reconstruct an entire device due
to a device failure.  I think if btrfs scrub detected errors on a
single device, I could do a "reverse reconstruct" where instead of
syncing TO the parity disk, I sync FROM the parity disk TO the btrfs
single device with the error, replacing physical blocks that are out
of sync with parity (thus repairing the scrub-found errrors).  The
downside to this approach is I would have to perform the reverse-sync
against the entire btrfs block device, which could be much more
time-consuming than if I could single out the specific block addresses
and just sync those.  That said, I guess option A is better than no
option at all.

I would be curious if any of the devs or other members of this mailing
list have tried to correlate btrfs internal block addresses to a true
block-address on the device being used.  Any interesting articles /
links that show how to do this?  Not expecting much, but if someone
does know, I'd be very grateful.
I think there is a tool in btrfs-progs to do it, but I've never used it, 
and you would still need to get scrub to spit out actual error addresses 
for you.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-06 Thread Duncan
Chris Murphy posted on Sun, 06 Dec 2015 13:42:57 -0700 as excerpted:

> On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo 
> wrote:
>> Just to confirm, is the sole purpose of supporting scrub on single
>> btrfs devices to detect errors, but not to correct them?
> 
> If that single device metadata profile is DUP, then it will correct
> those. If there is only one copy of anything, then it just reports.
> Scrub works on all data, but a passive scrub happens anytime something
> is read.

... And more to the point, expanding on that, on a single device btrfs, 
data is single mode by default, so scrub for it (as opposed to metadata) 
is error-detect-only as mentioned.

However, while the default mode separates data and metadata, and in that 
mode, historically (there's a patch to change this, adding the missing 
option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed 
option) puts data and metadata both in the same shared block-group type, 
which can then be either dup or single mode.

Obviously duplicating data as well as metadata means you can only store 
half as much data, since it's all stored twice, but that will let scrub 
correct errors in cases where only one of the two copies doesn't verify 
checksum, but the other one does.

And as mentioned above, there's a patch in process now, that will remove 
the single-device restriction of data (as opposed to metadata) to single 
mode, allowing the choice of dup mode for data as well as metadata.


Also, in addition to the mixed-mode workaround to get dup data, it's 
possible, altho rather inefficient in performance terms, to partition a 
physical device such that two equal sized partitions are made available 
as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical 
devices into a single btrfs, raid1 for both data and metadata, so btrfs 
creates two copies that way, again letting scrub correct errors when only 
one of the two fails to verify against checksum.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Scrub on btrfs single device only to detect errors, not correct them?

2015-12-06 Thread Jon Panozzo
Just to confirm, is the sole purpose of supporting scrub on single btrfs
devices to detect errors, but not to correct them?

Best Regards,

Jonathan Panozzo
Lime Technology, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-06 Thread Chris Murphy
On Sun, Dec 6, 2015 at 12:15 PM, Jon Panozzo  wrote:
> Just to confirm, is the sole purpose of supporting scrub on single btrfs
> devices to detect errors, but not to correct them?

If that single device metadata profile is DUP, then it will correct
those. If there is only one copy of anything, then it just reports.
Scrub works on all data, but a passive scrub happens anytime something
is read.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html