Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Hans van Kranenburg
roups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes
where the next one will be (just after the end of the vaddr+length of
it), and then jumps over all normal extent items and searches again near
where the next block group item has to be. So, yes, that means that they
depend on each other.

Two possible ways to improve this:

1. Instead, walk the chunk tree (which has all related items packed
together) instead to find out at which locations in the extent tree the
block group items are located and then start getting items in parallel.
If you have storage with a lot of rotating rust that can deliver much
more random reads if you ask for more of them at the same time, then
this can already cause a massive speedup.

2. Move the block group items somewhere else, where they can nicely be
grouped together, so that the amount of metadata pages that has to be
looked up is minimal. Quoting from the link below, "slightly tricky
[...] but there are no fundamental obstacles".

https://www.spinics.net/lists/linux-btrfs/msg71766.html

I think the main obstacle here is finding a developer with enough
experience and time to do it. :)

For fun, you can also just read the block group metadata after dropping
caches each time, which should give similar relative timing results as
mounting the filesystem again. (Well, if disk IO wait is the main
slowdown of course.)

Attached are two example programs, using python-btrfs.

* bg_after_another.py does the same thing as the kernel code I just linked.
* bg_via_chunks.py looks them up based on chunk tree info.

The time that it takes after option 2 above would be implemented should
be very similar to just reading the chunk tree. (remove the block group
lookup from bg_via_chunks and run that).

Now what's still missing is changing the bg_via_chunks one to start
kicking off the block group searches in parallel, and then you can
predict how long it would take if 1 would be implemented.

\:D/

-- 
Hans van Kranenburg
#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
print("Usage: {} ".format(sys.argv[0]))
sys.exit(1)


tree = btrfs.ctree.EXTENT_TREE_OBJECTID
min_key = btrfs.ctree.Key(0, 0, 0)
bufsize = btrfs.utils.SZ_4K


def first_block_group_after(fs, key):
for header, data in btrfs.ioctl.search_v2(fs.fd, tree, min_key, buf_size=bufsize):
if header.type == btrfs.ctree.BLOCK_GROUP_ITEM_KEY:
return header


fs = btrfs.FileSystem(sys.argv[1])
while True:
header = first_block_group_after(fs, min_key)
if header is None:
break
min_key = btrfs.ctree.Key(header.objectid + header.offset,
  btrfs.ctree.BLOCK_GROUP_ITEM_KEY, 0)
print('.', end='', flush=True)

print()
#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
print("Usage: {} ".format(sys.argv[0]))
sys.exit(1)

fs = btrfs.FileSystem(sys.argv[1])
for chunk in fs.chunks():
fs.block_group(chunk.vaddr, chunk.length)
print('.', end='', flush=True)

print()


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Hans van Kranenburg
On 12/1/18 1:19 AM, Qu Wenruo wrote:
> 
> 
> On 2018/12/1 上午12:52, Josef Bacik wrote:
>> From: Josef Bacik 
>>
>> When debugging some weird extent reference bug I suspected that we were
>> changing a snapshot while we were deleting it, which could explain my
>> bug.
> 
> May I ask under which case we're going to modify an unlinked snapshot?
> 
> Maybe metadata relocation?

I have some old logging from a filesystem that got corrupted when doing
a subvolume delete while doing metadata balance:

https://syrinx.knorrie.org/~knorrie/btrfs/domokun-balance-skinny-lockup.txt

This is one of the very few moments when I really lost a btrfs
filesystem and had to mkfs again to move on.

I'm wondering if this one looks related. I never found an explanation
for it.

Hans

> 
> Thanks,
> Qu
> 
>> This was indeed what was happening, and this patch helped me
>> verify my theory.  It is never correct to modify the snapshot once it's
>> being deleted, so mark the root when we are deleting it and make sure we
>> complain about it when it happens.
>>
>> Signed-off-by: Josef Bacik 
>> ---
>>  fs/btrfs/ctree.c   | 3 +++
>>  fs/btrfs/ctree.h   | 1 +
>>  fs/btrfs/extent-tree.c | 9 +
>>  3 files changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 5912a97b07a6..5f82f86085e8 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
>> *trans,
>>  u64 search_start;
>>  int ret;
>>  
>> +if (test_bit(BTRFS_ROOT_DELETING, >state))
>> +WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
>> dropped\n");
>> +
>>  if (trans->transaction != fs_info->running_transaction)
>>  WARN(1, KERN_CRIT "trans %llu running %llu\n",
>> trans->transid,
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index facde70c15ed..5a3a94ccb65c 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1199,6 +1199,7 @@ enum {
>>  BTRFS_ROOT_FORCE_COW,
>>  BTRFS_ROOT_MULTI_LOG_TASKS,
>>  BTRFS_ROOT_DIRTY,
>> +BTRFS_ROOT_DELETING,
>>  };
>>  
>>  /*
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 581c2a0b2945..dcb699dd57f3 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>>  if (block_rsv)
>>  trans->block_rsv = block_rsv;
>>  
>> +/*
>> + * This will help us catch people modifying the fs tree while we're
>> + * dropping it.  It is unsafe to mess with the fs tree while it's being
>> + * dropped as we unlock the root node and parent nodes as we walk down
>> + * the tree, assuming nothing will change.  If something does change
>> + * then we'll have stale information and drop references to blocks we've
>> + * already dropped.
>> + */
>> +set_bit(BTRFS_ROOT_DELETING, >state);
>>  if (btrfs_disk_key_objectid(_item->drop_progress) == 0) {
>>  level = btrfs_header_level(root->node);
>>  path->nodes[level] = btrfs_lock_root_node(root);
>>
> 


-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-11-13 Thread Hans van Kranenburg
On 11/13/18 4:03 PM, David Sterba wrote:
> On Thu, Oct 11, 2018 at 07:40:22PM +0000, Hans van Kranenburg wrote:
>> On 10/11/2018 05:13 PM, David Sterba wrote:
>>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>>>> This patch set contains an additional fix for a newly exposed bug after
>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>>
>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>>
>>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>>> before starting to change more things so it can be applied to earlier
>>>> LTS kernels.
>>>>
>>>> Besides that patch, which is fixing the bug in a way that is least
>>>> intrusive, I added a bunch of other patches to help getting the chunk
>>>> allocator code in a state that is a bit less error-prone and
>>>> bug-attracting.
>>>>
>>>> When running this and trying the reproduction scenario, I can now see
>>>> that the created DUP device extent is 827326464 bytes long, which is
>>>> good.
>>>>
>>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>>> a list of related use cases and test more things to at least walk
>>>> through a bunch of obvious use cases to see if there's nothing exploding
>>>> too quickly with these changes. However, I'm quite confident about it,
>>>> so I'm sharing all of it already.
>>>>
>>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>>> I'm still very much in a learning stage regarding kernel development.
>>>
>>> The patches look good, thanks. Problem is explained, preparatory work is
>>> separated from the fix itself.
>>
>> \o/
>>
>>>> The stable patches handling workflow is not 100% clear to me yet. I
>>>> guess I have to add a Fixes: in the DUP patch which points to the
>>>> previous commit 92e222df7b.
>>>
>>> Almost nobody does it right, no worries. If you can identify a single
>>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
>>> with version where it makes sense & applies is enough. I do that check
>>> myself regardless of what's in the patch.
>>
>> It's 92e222df7b and the thing I'm not sure about is if it also will
>> catch the same patch which was already backported to LTS kernels since
>> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
>> 4.14, 4.9, 4.4, 3.16...
>>
>>> I ran the patches in a VM and hit a division-by-zero in test
>>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
>>> patch 3/6.
>>
>> Ah interesting, dev replace.
>>
>> I'll play around with replace and find out how to run the tests properly
>> and then reproduce this.
>>
>> The code introduced in patch 3 is removed again in patch 6, so I don't
>> suspect that one. :)
>>
>> But, I'll find out.
>>
>> Thanks for testing.
> 
> I've merged patches 1-5 to misc-next as they seem to be safe and pass
> fstests, please let me know when you have updates to the last one.
> Thanks.

I'll definitely follow up on that. It has not happened because something
about todo lists and ordering work and not doing too many things at the
same time.

Thanks for already confirming it was not patch 3 but 6 that has the
problem, I already suspected that.

For patch 3, the actual fix, how do we get that on top of old stable
kernels where the previous fix (92e222df7b) is in? Because of the Fixes:
line that one was picked again in 4.14, 4.9, 4.4 and even 3.16. How does
this work? Does it need all the other commit ids from those branches in
Fixes lines? Or is there magic somewhere that does this?

From your "development cycle open" mails, I understand that for testing
myself, I would test based on misc-next, for-next or for-x.y in that
order depending on where the first 5 are yet at that moment?

Hans


Re: [PATCH 1/5] btrfs: Ensure at least 1g is free for balance

2018-10-26 Thread Hans van Kranenburg
On 10/26/18 2:16 PM, Nikolay Borisov wrote:
> 
> (Adding Chris to CC since he is the original author of the code)
> 
> On 26.10.2018 15:09, Hans van Kranenburg wrote:
>> On 10/26/18 1:43 PM, Nikolay Borisov wrote:
>>> The first part of balance operation is to shrink every constituting
>>> device to ensure there is free space for chunk allocation.
>>
>> A very useful thing to be able to do if there's no unallocated raw disk
>> space left, is use balance to rewrite some blockgroup into the free
>> space of other ones.
>>
>> This does not need any raw space for a chunk allocation. Requiring so
>> makes it unneccesarily more complicated for users to fix the situation
>> they got in.
> 
> The current logic of the code doesn't preclude this use case since if we
> can't shrink we just proceed to relocation. So even if 1g is replaced
> with 5eb and shrink_device always fails balancing will still proceed.
> 
> However, all of this really makes me wonder do we even need this "first
> stage" code in balancing (Chris, any thoughts?).

Having something that tries to free up 1G on each device can be very
useful when converting to another profile. I guess that's why it is there.

And shrink device + grow device is a nice way to try packing some data
into other existing block groups. Only... if shrink+grow is done for
each device after each other, and there's one with 1G unallocated left,
and it can't easily move it into existing free space, then it may just
be moving data around to the next disk all the time I guess, ending up
with the same situation as before. :D

>>> However, the code
>>> has been buggy ever since its introduction since calculating the space to 
>>> shrink
>>> the device by was bounded by 1 mb. Most likely the original intention was to
>>> have an upper bound of 1g and not 1m, since the largest chunk size is 1g. 
>>> This 
>>> means the first stage in __btrfs_balance so far has been a null op since it
>>> effectively freed just a single megabyte.
>>>
>>> Fix this by setting an upper bound of size_to_free of 1g. 
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> ---
>>>  fs/btrfs/volumes.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index f435d397019e..8b0fd7bf3447 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -3467,7 +3467,7 @@ static int __btrfs_balance(struct btrfs_fs_info 
>>> *fs_info)
>>> list_for_each_entry(device, devices, dev_list) {
>>> old_size = btrfs_device_get_total_bytes(device);
>>> size_to_free = div_factor(old_size, 1);
>>> -       size_to_free = min_t(u64, size_to_free, SZ_1M);
>>> +   size_to_free = min_t(u64, size_to_free, SZ_1G);
>>> if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
>>> btrfs_device_get_total_bytes(device) -
>>> btrfs_device_get_bytes_used(device) > size_to_free ||
>>>
>>
>>


-- 
Hans van Kranenburg


Re: [PATCH 1/5] btrfs: Ensure at least 1g is free for balance

2018-10-26 Thread Hans van Kranenburg
On 10/26/18 1:43 PM, Nikolay Borisov wrote:
> The first part of balance operation is to shrink every constituting
> device to ensure there is free space for chunk allocation.

A very useful thing to be able to do if there's no unallocated raw disk
space left, is use balance to rewrite some blockgroup into the free
space of other ones.

This does not need any raw space for a chunk allocation. Requiring so
makes it unneccesarily more complicated for users to fix the situation
they got in.

> However, the code
> has been buggy ever since its introduction since calculating the space to 
> shrink
> the device by was bounded by 1 mb. Most likely the original intention was to
> have an upper bound of 1g and not 1m, since the largest chunk size is 1g. 
> This 
> means the first stage in __btrfs_balance so far has been a null op since it
> effectively freed just a single megabyte.
> 
> Fix this by setting an upper bound of size_to_free of 1g. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f435d397019e..8b0fd7bf3447 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3467,7 +3467,7 @@ static int __btrfs_balance(struct btrfs_fs_info 
> *fs_info)
>   list_for_each_entry(device, devices, dev_list) {
>   old_size = btrfs_device_get_total_bytes(device);
>   size_to_free = div_factor(old_size, 1);
> - size_to_free = min_t(u64, size_to_free, SZ_1M);
> + size_to_free = min_t(u64, size_to_free, SZ_1G);
>   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
>   btrfs_device_get_total_bytes(device) -
>   btrfs_device_get_bytes_used(device) > size_to_free ||
> 


-- 
Hans van Kranenburg


Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-11 Thread Hans van Kranenburg
On 10/11/2018 09:40 PM, Hans van Kranenburg wrote:
> On 10/11/2018 05:13 PM, David Sterba wrote:
>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>>> This patch set contains an additional fix for a newly exposed bug after
>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>
>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>
>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>> before starting to change more things so it can be applied to earlier
>>> LTS kernels.
>>>
>>> Besides that patch, which is fixing the bug in a way that is least
>>> intrusive, I added a bunch of other patches to help getting the chunk
>>> allocator code in a state that is a bit less error-prone and
>>> bug-attracting.
>>>
>>> When running this and trying the reproduction scenario, I can now see
>>> that the created DUP device extent is 827326464 bytes long, which is
>>> good.
>>>
>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>> a list of related use cases and test more things to at least walk
>>> through a bunch of obvious use cases to see if there's nothing exploding
>>> too quickly with these changes. However, I'm quite confident about it,
>>> so I'm sharing all of it already.
>>>
>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>> I'm still very much in a learning stage regarding kernel development.
>>
>> The patches look good, thanks. Problem is explained, preparatory work is
>> separated from the fix itself.
> 
> \o/
> 
>>> The stable patches handling workflow is not 100% clear to me yet. I
>>> guess I have to add a Fixes: in the DUP patch which points to the
>>> previous commit 92e222df7b.
>>
>> Almost nobody does it right, no worries. If you can identify a single
>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
>> with version where it makes sense & applies is enough. I do that check
>> myself regardless of what's in the patch.
> 
> It's 92e222df7b and the thing I'm not sure about is if it also will
> catch the same patch which was already backported to LTS kernels since
> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
> 4.14, 4.9, 4.4, 3.16...
> 
>> I ran the patches in a VM and hit a division-by-zero in test
>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
>> patch 3/6.
> 
> Ah interesting, dev replace.
> 
> I'll play around with replace and find out how to run the tests properly
> and then reproduce this.
> 
> The code introduced in patch 3 is removed again in patch 6, so I don't
> suspect that one. :)

Actually, while writing this I realize that this means it should be
tested separately (like, older kernel with only 3), because, well,
obvious I guess.

> But, I'll find out.
> 
> Thanks for testing.
> 
> Hans
> 
>> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
>> 1 transid 5 /dev/vdb
>> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
>> 2 transid 5 /dev/vdc
>> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
>> [ 3116.088644] BTRFS info (device vdb): has skinny extents
>> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
>> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
>> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) 
>> to /dev/vdd started
>> [ 3125.860269] divide error:  [#1] PREEMPT SMP
>> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ 
>> #288
>> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
>> [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206
>> [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: 
>> 0002
>> [ 3125.873587] RDX:  RSI: 94e017818c80 RDI: 
>> 0200
>> [ 3125.874677] RBP: 8080 R08:  R09: 
>> 0002
>> [ 3125.875816] R10: 0003 R11: 8090 R12: 
>> 
>> [ 3125.876742] R13: 0001 R14: 0001 R15: 
>> 0002
>> [ 3125.877657] FS:  7f6de3

Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-11 Thread Hans van Kranenburg
On 10/11/2018 05:13 PM, David Sterba wrote:
> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>> This patch set contains an additional fix for a newly exposed bug after
>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>
>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>
>> The DUP fix is "fix more DUP stripe size handling". I did that one
>> before starting to change more things so it can be applied to earlier
>> LTS kernels.
>>
>> Besides that patch, which is fixing the bug in a way that is least
>> intrusive, I added a bunch of other patches to help getting the chunk
>> allocator code in a state that is a bit less error-prone and
>> bug-attracting.
>>
>> When running this and trying the reproduction scenario, I can now see
>> that the created DUP device extent is 827326464 bytes long, which is
>> good.
>>
>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>> a list of related use cases and test more things to at least walk
>> through a bunch of obvious use cases to see if there's nothing exploding
>> too quickly with these changes. However, I'm quite confident about it,
>> so I'm sharing all of it already.
>>
>> Any feedback and review is appreciated. Be gentle and keep in mind that
>> I'm still very much in a learning stage regarding kernel development.
> 
> The patches look good, thanks. Problem is explained, preparatory work is
> separated from the fix itself.

\o/

>> The stable patches handling workflow is not 100% clear to me yet. I
>> guess I have to add a Fixes: in the DUP patch which points to the
>> previous commit 92e222df7b.
> 
> Almost nobody does it right, no worries. If you can identify a single
> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
> with version where it makes sense & applies is enough. I do that check
> myself regardless of what's in the patch.

It's 92e222df7b and the thing I'm not sure about is if it also will
catch the same patch which was already backported to LTS kernels since
92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
4.14, 4.9, 4.4, 3.16...

> I ran the patches in a VM and hit a division-by-zero in test
> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
> patch 3/6.

Ah interesting, dev replace.

I'll play around with replace and find out how to run the tests properly
and then reproduce this.

The code introduced in patch 3 is removed again in patch 6, so I don't
suspect that one. :)

But, I'll find out.

Thanks for testing.

Hans

> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
> 1 transid 5 /dev/vdb
> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
> 2 transid 5 /dev/vdc
> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
> [ 3116.088644] BTRFS info (device vdb): has skinny extents
> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) 
> to /dev/vdd started
> [ 3125.860269] divide error:  [#1] PREEMPT SMP
> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ 
> #288
> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
> [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206
> [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: 
> 0002
> [ 3125.873587] RDX:  RSI: 94e017818c80 RDI: 
> 0200
> [ 3125.874677] RBP: 8080 R08:  R09: 
> 0002
> [ 3125.875816] R10: 0003 R11: 8090 R12: 
> 
> [ 3125.876742] R13: 0001 R14: 0001 R15: 
> 0002
> [ 3125.877657] FS:  7f6de34208c0() GS:94e07d60() 
> knlGS:
> [ 3125.878862] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3125.880080] CR2: 7ffe963d5ce8 CR3: 7659b000 CR4: 
> 06e0
> [ 3125.881485] Call Trace:
> [ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
> [ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
> [ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
> [ 3125.884658]  ? is_module_address+0x11/0x30
> [ 3125.885271]  ? wait_for_completion+0x160/0x190
> [ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
>

Re: Scrub aborts due to corrupt leaf

2018-10-10 Thread Hans van Kranenburg
On 10/10/2018 07:44 PM, Chris Murphy wrote:
> On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte
>  wrote:
>> On 10/10/18 17:44, Larkin Lowrey wrote:
>> (..)
>>>
>>> About once a week, or so, I'm running into the above situation where
>>> FS seems to deadlock. All IO to the FS blocks, there is no IO
>>> activity at all. I have to hard reboot the system to recover. There
>>> are no error indications except for the following which occurs well
>>> before the FS freezes up:
>>>
>>> BTRFS warning (device dm-3): block group 78691883286528 has wrong amount
>>> of free space
>>> BTRFS warning (device dm-3): failed to load free space cache for block
>>> group 78691883286528, rebuilding it now
>>>
>>> Do I have any options other the nuking the FS and starting over?
>>
>>
>> Unmount cleanly & mount again with -o space_cache=v2.
> 
> I'm pretty sure you have to umount, and then clear the space_cache
> with 'btrfs check --clear-space-cache=v1' and then do a one time mount
> with -o space_cache=v2.

The --clear-space-cache=v1 is optional, but recommended, if you are
someone who do not likes to keep accumulated cruft.

The v2 mount (rw mount!!!) does not remove the v1 cache. If you just
mount with v2, the v1 data keeps being there, doing nothing any more.

> But anyway, to me that seems premature because we don't even know
> what's causing the problem.
> 
> a. Freezing means there's a kernel bug. Hands down.
> b. Is it freezing on the rebuild? Or something else?
> c. I think the devs would like to see the output from btrfs-progs
> v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in
> particular something not related to free space cache.
> 
> Rebuilding either version of space cache requires successfully reading
> (and parsing) the extent tree.
> 
> 


-- 
Hans van Kranenburg


Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-09 Thread Hans van Kranenburg
On 10/09/2018 03:14 AM, Qu Wenruo wrote:
> 
> 
> On 2018/10/9 上午6:20, Hans van Kranenburg wrote:
>> On 10/08/2018 02:30 PM, Qu Wenruo wrote:
>>> Obviously, used bytes can't be larger than total bytes.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  check/mode-lowmem.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
>>> index 07c03cad77af..1173b963b8f3 100644
>>> --- a/check/mode-lowmem.c
>>> +++ b/check/mode-lowmem.c
>>> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info 
>>> *fs_info,
>>> used = btrfs_device_bytes_used(eb, dev_item);
>>> total_bytes = btrfs_device_total_bytes(eb, dev_item);
>>>  
>>> +   if (used > total_bytes) {
>>> +   error("device %llu has incorrect used bytes %llu > total bytes 
>>> %llu",
>>> +   dev_id, used, total_bytes);
>>> +   return ACCOUNTING_MISMATCH;
>>
>> The message and return code point at an error in accounting logic.
> 
> One of the biggest problem in lowmem is we don't always have the error
> code we really want.
> 
> And that's the case for this error message.
> It's indeed not an accounting error, as in that case (just like that
> test case image) the used bytes is correct accounted.
> 
> The good news is, the return value is never really used to classify the
> error.
> So as long as the error message makes sense, it's not a big problem.

Aha. Clear, thanks for explaining.

> 
> Thanks,
> Qu
> 
>>
>> However, if you have a fully allocated device and a DUP chunk ending
>> beyond device, then having used > total_bytes is expected...
>>
>> So maybe there's two possibilities... There's an error in the accounting
>> logic, or there's an "over-allocation", which is another type of issue
>> which produces used > total with correct accounting logic.
>>
>>> +   }
>>> key.objectid = dev_id;
>>> key.type = BTRFS_DEV_EXTENT_KEY;
>>> key.offset = 0;
>>>
>>
>>
> 


-- 
Hans van Kranenburg


Re: merge status of per-chunk degradable check [was Re: Which device is missing ?]

2018-10-08 Thread Hans van Kranenburg
On 10/09/2018 02:08 AM, Nicholas D Steeves wrote:
> On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
>> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
>>> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
>>>
>>> But I have stranges status or errors  about "missing devices" and I
>>> do not understand the current situation :
> [...]
>>Note that, since the main FS is missing a device, it will probably
>> need to be mounted in degraded mode (-o degraded), and that on kernels
>> earlier than (IIRC) 4.14, this can only be done *once* without the FS
>> becoming more or less permanently read-only. On recent kernels, it
>> _should_ be OK.
>>
>> *WARNING ENDS*
> 
> I think this was the patch that addressed this?:
>   https://www.spinics.net/lists/linux-btrfs/msg47283.html
>   https://patchwork.kernel.org/patch/7226931/
> 
> In my notes it wasn't present in <= 4.14.15, but my notes might be
> wrong.  Does this patch resolve the one-shot -o degraded, reboot,
> forever read-only behaviour, or is something else required?  When was
> it merged?  Has it been or will it be backported to 4.14.x?  I'm
> guessing 4.9.x is too far back, but it would be really nice to see it
> there too :-)
> 
> Also, will this issue be resolved for linux-4.19?  If so I'd like to
> update the Debian btrfs wiki with this good news :-)

[...]

> P.S. Please let me know if you'd prefer for me to shift this
> documentation effort to btrfs.wiki.kernel.org.

Yes, absolutely. This is not specific to how we do things for Debian.
Upstream documentation can help all distros.

-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 03:19 PM, Hans van Kranenburg wrote:
> On 10/08/2018 08:43 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
>>> On 10/05/2018 09:51 AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>>>>> This patch set contains an additional fix for a newly exposed bug after
>>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>>>
>>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>>
>>>> For that bug, did you succeeded in reproducing the bug?
>>>
>>> Yes, open the above link and scroll to "Steps to reproduce".
>>
>> That's beyond device boundary one. Also reproduced here.
>> And hand-crafted a super small image as test case for btrfs-progs.
>>
>> But I'm a little curious about the dev extent overlapping case.
>> Have you got one?
> 
> Ah ok, I see. No, I didn't do that yet.
> 
> By using unmodified tooling, I think this can be done by a combination
> of a few resizings and using very specific balance to cause a fs of
> exactly 7880MiB again with a single 1578MiB gap inside...
> 
> I'll try later today to see if I can come up with a recipe for this.

Ok, this is actually pretty simple to do:

 >8 

-# mkdir bork
-# cd bork
-# dd if=/dev/zero of=image bs=1 count=0 seek=1024M
0+0 records in
0+0 records out
0 bytes copied, 0.000183343 s, 0.0 kB/s
-# mkfs.btrfs -d dup -m dup image
-# losetup -f image
-# mount -o space_cache=v2 /dev/loop0 mountpoint
-# dd if=/dev/zero of=mountpoint/flapsie bs=1M
dd: error writing 'mountpoint/flapsie': No space left on device
453+0 records in
452+0 records out
474185728 bytes (474 MB, 452 MiB) copied, 4.07663 s, 116 MB/s

 >8 

-# ./show_usage.py /bork/mountpoint/
Target profile for SYSTEM (chunk tree): DUP
Target profile for METADATA: DUP
Target profile for DATA: DUP
Mixed groups: False

Virtual space usage by block group type:
|
| typetotal used
| - 
| Data452.31MiB452.22MiB
| System8.00MiB 16.00KiB
| Metadata 51.19MiB656.00KiB

Total raw filesystem size: 1.00GiB
Total raw allocated bytes: 1023.00MiB
Allocatable bytes remaining: 1.00MiB
Unallocatable raw bytes: 0.00B
Unallocatable bytes that can be reclaimed by balancing: 0.00B

Estimated virtual space left to use for metadata: 51.05MiB
Estimated virtual space left to use for data: 96.00KiB

Allocated raw disk bytes by chunk type. Parity is a reserved part of the
allocated bytes, limiting the amount that can be used for data or metadata:
|
| flags   allocated used   parity
| -   -    --
| DATA|DUP904.62MiB904.44MiB0.00B
| SYSTEM|DUP   16.00MiB 32.00KiB0.00B
| METADATA|DUP102.38MiB  1.28MiB0.00B

Allocated bytes per device:
|
| devid  total sizeallocated path
| -  --- 
| 1 1.00GiB   1023.00MiB /dev/loop0

Allocated bytes per device, split up per chunk type. Parity bytes are again
part of the total amount of allocated bytes.
|
| Device ID: 1
| | flags   allocated   parity
| | -   -   --
| | DATA|DUP904.62MiB0.00B
| | SYSTEM|DUP   16.00MiB0.00B
| | METADATA|DUP102.38MiB0.00B

Unallocatable raw bytes per device:
|
| devidunallocatable
| --
| 1   0.00B

 >8 

Now we're gonna cause some neat 1578MiB to happen that we're going to
free up later to reproduce the failure:

-# dd if=/dev/zero of=image bs=1 count=0 seek=2602M
0+0 records in
0+0 records out
0 bytes copied, 0.000188621 s, 0.0 kB/s
-# losetup -c /dev/loop0
-# btrfs fi resize max mountpoint/
Resize 'mountpoint/' of 'max'
-# dd if=/dev/zero of=mountpoint/1578MiB bs=1M
dd: error writing 'mountpoint/1578MiB': No space left on device
790+0 records in
789+0 records out
827326464 bytes (827 MB, 789 MiB) copied, 12.2452 s, 67.6 MB/s

 >8 

-# python3
import btrfs
fs = btrfs.FileSystem('/bork/mountpoint/')
for d in fs.dev_extents():
  print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length,
d.vaddr))

start 1048576 end 11534336 vaddr 547880960
start 11534336 end 22020096 vaddr 547880960
start 22020096 end 30408704 vaddr 22020096
start 30408704 end 38797312 vaddr 22020096
start 38797312 end 92471296 vaddr 30408704
start 92471296 end 146145280 vaddr 30408704
start 146145280 end 213254144 vaddr 84082688
start 213254144 end 280363008 vaddr 84082688
start 280363008 end 397803520 vaddr

Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 02:30 PM, Qu Wenruo wrote:
> Obviously, used bytes can't be larger than total bytes.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  check/mode-lowmem.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
> index 07c03cad77af..1173b963b8f3 100644
> --- a/check/mode-lowmem.c
> +++ b/check/mode-lowmem.c
> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info 
> *fs_info,
>   used = btrfs_device_bytes_used(eb, dev_item);
>   total_bytes = btrfs_device_total_bytes(eb, dev_item);
>  
> + if (used > total_bytes) {
> + error("device %llu has incorrect used bytes %llu > total bytes 
> %llu",
> + dev_id, used, total_bytes);
> + return ACCOUNTING_MISMATCH;

The message and return code point at an error in accounting logic.

However, if you have a fully allocated device and a DUP chunk ending
beyond device, then having used > total_bytes is expected...

So maybe there's two possibilities... There's an error in the accounting
logic, or there's an "over-allocation", which is another type of issue
which produces used > total with correct accounting logic.

> + }
>   key.objectid = dev_id;
>   key.type = BTRFS_DEV_EXTENT_KEY;
>   key.offset = 0;
> 


-- 
Hans van Kranenburg


Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 06:37 PM, Holger Hoffstätte wrote:
> On 10/08/18 17:46, Hans van Kranenburg wrote:
> 
>> fs.devices() also looks for dev_items in the chunk tree:
>>
>> https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481
>>
>> So, BOOM! you need root.
>>
>> Or just start a 0, ignore errors and start trying all devids until you
>> found num_devices amount of them that work, yolo.
> 
> Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look
> at the entries in /sys/fs/btrfs//devices/ and query them all
> directly.

But, you still need root for that right? The progs code does a RO open
directly on the block device.

-$ btrfs dev stats /dev/xvdb
ERROR: cannot open /dev/xvdb: Permission denied
ERROR: '/dev/xvdb' is not a mounted btrfs device

stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0
stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0
open("/dev/loop0", O_RDONLY)= -1 EACCES (Permission denied)

But:

-# btrfs dev stats /dev/xvdb
[/dev/xvdb].write_io_errs0
[/dev/xvdb].read_io_errs 0
[/dev/xvdb].flush_io_errs0
[/dev/xvdb].corruption_errs  0
[/dev/xvdb].generation_errs  0

> The skeleton btrfs_exporter already responds to http requests and
> returns dummy metrics, using the official python client library.
> I've found a nice little python sysfs scraper and now only need to
> figure out how best to map the btrfs info in sysfs to useful metrics.
> The privileged access issue would only have come into play much later,
> but it seems I can avoid it after all, which is great.
> I'm also (re-)learning python in the process, so that's the actual
> thing slowing me down..

:)

-- 
Hans van Kranenburg


Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 05:29 PM, Holger Hoffstätte wrote:
> On 10/08/18 16:40, Hans van Kranenburg wrote:
>>> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
>>> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.
>>
>> That's the tree search ioctl, for reading arbitrary metadata.
>>
>> The device stats ioctl is IOC_GET_DEV_STATS...
> 
> Yeah..OK, that is clear and gave me the hint to ask: why is it
> calling this in the first place? And as it turns out [1] is where
> it seems to go wrong, as is_block_device() returns 0 (as it should),
> fi_args.num_devices is never set (remains 0) and it proceeds to call
> everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in
> #1712. And that works fine:
> 
> 1711 if (fi_args->num_devices != 1) {
> (gdb) s
> 1712    ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
> (gdb) s
> 1713    if (ret < 0) {
> (gdb) p ret
> $28 = 0
> (gdb) p *fi_args
> $30 = {
>   max_id = 1,
>   num_devices = 1,
>   fsid = "z%:\371\315\033A\203\267.\200\255;FH\221",
>   nodesize = 16384,
>   sectorsize = 4096,
>   clone_alignment = 4096,
>   reserved32 = 0,
>   reserved = {0 }
> }
> 
> It's only when it goes into search_chunk_tree_for_fs_info()
> where things fail due to CAP_SYS_ADMIN.
> 
> And all this explains the actual bug: when I call btrfs device stats
> not on the mountpoint (as I've been trying all this time) but rather
> on the device, it works properly right away as regular user:
> 
> (gdb) set args device stats /dev/loop0
> (gdb) r
> Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0",
> fi_args=fi_args@entry=0x7fffd400,
>     di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652
> 1652    {
> (gdb) c
> Continuing.
> [/dev/loop0].write_io_errs    0
> [/dev/loop0].read_io_errs 0
> [/dev/loop0].flush_io_errs    0
> [/dev/loop0].corruption_errs  0
> [/dev/loop0].generation_errs  0
> [Inferior 1 (process 2805) exited normally]
> 
> So this is simply a discrepancy in handling a device vs. the device(s)
> for a mountpoint.

Apparently based on what you point it at, it does a different thing.

If you point it at a block device, it will try opening the block device
to find out which devid it has (from the superblock), find out where it
is mounted and then only ask for stats for this one device.

-# btrfs dev stats /dev/xvdc
[/dev/xvdc].write_io_errs0
[/dev/xvdc].read_io_errs 0
[/dev/xvdc].flush_io_errs0
[/dev/xvdc].corruption_errs  0
[/dev/xvdc].generation_errs  0

If you point it at a mounted filesystem, it lists stats for all devices.
Since there's no way to get a list of devids, it directly searches the
chunk tree directly for dev_item objects.

-# btrfs dev stats /mountpoint
[/dev/xvdb].write_io_errs0
[/dev/xvdb].read_io_errs 0
[/dev/xvdb].flush_io_errs0
[/dev/xvdb].corruption_errs  0
[/dev/xvdb].generation_errs  0
[/dev/xvdc].write_io_errs0
[/dev/xvdc].read_io_errs 0
[/dev/xvdc].flush_io_errs0
[/dev/xvdc].corruption_errs  0
[/dev/xvdc].generation_errs  0
[/dev/xvdd].write_io_errs0
[/dev/xvdd].read_io_errs 0
[/dev/xvdd].flush_io_errs0
[/dev/xvdd].corruption_errs  0
[/dev/xvdd].generation_errs  0

I do the same thing in the nagios plugin:

https://github.com/knorrie/python-btrfs/blob/master/examples/nagios/plugins/check_btrfs#L131

fs.devices() also looks for dev_items in the chunk tree:

https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481

So, BOOM! you need root.

Or just start a 0, ignore errors and start trying all devids until you
found num_devices amount of them that work, yolo.

>> I can do the device stats ioctl as normal user:
>>
>> import btrfs
>> fs = btrfs.FileSystem('/')
>> btrfs.utils.pretty_print(fs.dev_stats(1))
>>
>> 
>> devid: 1
>> nr_items: 5
>> flags: 0
>> write_errs: 0
>> read_errs: 0
>> flush_errs: 0
>> generation_errs: 0
>> corruption_errs: 0
> 
> That works for me too, and that's actually the important part. \o/
> Glad we talked about it. :}
> 
> -h
> 
> [1]
> https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666
> 
> 


-- 
Hans van Kranenburg


Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 04:40 PM, Hans van Kranenburg wrote:
> On 10/08/2018 04:27 PM, Holger Hoffstätte wrote:
>> (moving the discussion here from GH [1])
>>
>> Apparently there is something weird going on with the device stats
>> ioctls. I cannot get them to work as regular user, while they work
>> for David. A friend confirms the same issue on his system - no access
>> as non-root.
>>
>> So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
>> debug symbols and stepped into search_chunk_tree_for_fs_info().
>> Everything is fine, all args are correct, right until:
>>
>> (gdb) s
>> 1614    ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
>> (gdb) s
>> 1615    if (ret < 0)
>> (gdb) p ret
>> $4 = -1
>> (gdb) p search_args
>> $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset
>> = 1,
>> max_offset = 18446744073709551615, min_transid = 0, max_transid =
>> 18446744073709551615,
>> min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0,
>> unused2 = 0,
>> unused3 = 0, unused4 = 0}, buf = '\000' }
>>
>> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
>> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.
> 
> That's the tree search ioctl, for reading arbitrary metadata.
> 
> The device stats ioctl is IOC_GET_DEV_STATS...
> 
> I can do the device stats ioctl as normal user:
> 
> import btrfs
> fs = btrfs.FileSystem('/')
> btrfs.utils.pretty_print(fs.dev_stats(1))
> 
> 
> devid: 1
> nr_items: 5
> flags: 0
> write_errs: 0
> read_errs: 0
> flush_errs: 0
> generation_errs: 0
> corruption_errs: 0

By the way, I can also do BTRFS_IOC_FS_INFO, BTRFS_IOC_DEV_INFO and
BTRFS_IOC_SPACE_INFO as normal user.

However, while fs_info tells me that there are num_devices devices,
there's no place where you can actually get which devids these are, and
you need to provide them one by one to dev_info and dev_stats... :

btrfs.utils.pretty_print(fs.fs_info())


max_id: 1
num_devices: 1
nodesize: 4096
sectorsize: 4096
clone_alignment: 4096
fsid: 91077ca5-6559-4a90-9d03-912d3a33412e

btrfs.utils.pretty_print(fs.dev_info(1))

devid: 1
bytes_used: 60699967488
total_bytes: 107374182400
path: /dev/xvda
uuid: 7e998baa-b533-4476-9132-d7d748d28044


btrfs.utils.pretty_print(fs.space_info())
-
  
  flags: Data, single
  total_bytes: 54.00GiB
  used_bytes: 53.27GiB
-
  
  flags: System, single
  total_bytes: 32.00MiB
  used_bytes: 12.00KiB
-
  
  flags: Metadata, single
  total_bytes: 2.50GiB
  used_bytes: 1.30GiB
-
  
  flags: GlobalReserve, single
  total_bytes: 181.02MiB
  used_bytes: 0.00B


> 
>> So why can Dave get his dev stats as unprivileged user?
>> Does this work for anybody else? And why? :)
>>
>> cheers
>> Holger
>>
>> [1]
>> https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190
>>
> 
> 


-- 
Hans van Kranenburg


Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg
On 10/08/2018 04:27 PM, Holger Hoffstätte wrote:
> (moving the discussion here from GH [1])
> 
> Apparently there is something weird going on with the device stats
> ioctls. I cannot get them to work as regular user, while they work
> for David. A friend confirms the same issue on his system - no access
> as non-root.
> 
> So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
> debug symbols and stepped into search_chunk_tree_for_fs_info().
> Everything is fine, all args are correct, right until:
> 
> (gdb) s
> 1614    ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
> (gdb) s
> 1615    if (ret < 0)
> (gdb) p ret
> $4 = -1
> (gdb) p search_args
> $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset
> = 1,
> max_offset = 18446744073709551615, min_transid = 0, max_transid =
> 18446744073709551615,
> min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0,
> unused2 = 0,
> unused3 = 0, unused4 = 0}, buf = '\000' }
> 
> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.

That's the tree search ioctl, for reading arbitrary metadata.

The device stats ioctl is IOC_GET_DEV_STATS...

I can do the device stats ioctl as normal user:

import btrfs
fs = btrfs.FileSystem('/')
btrfs.utils.pretty_print(fs.dev_stats(1))


devid: 1
nr_items: 5
flags: 0
write_errs: 0
read_errs: 0
flush_errs: 0
generation_errs: 0
corruption_errs: 0

> So why can Dave get his dev stats as unprivileged user?
> Does this work for anybody else? And why? :)
> 
> cheers
> Holger
> 
> [1]
> https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190
> 


-- 
Hans van Kranenburg


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-10-07 Thread Hans van Kranenburg
Hi,

On 09/24/2018 01:19 AM, Adam Borowski wrote:
> On Sun, Sep 23, 2018 at 11:54:12PM +0200, Hans van Kranenburg wrote:
>> Two examples have been added, which use the new code. I would appreciate
>> extra testing. Please try them and see if the reported numbers make sense:
>>
>> space_calculator.py
>> ---
>> Best to be initially described as a CLI version of the well-known
>> webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at
>> it, choose data and metadata profile and see how much space you would
>> get to store actual data.
>>
>> See commit message "Add example to calculate usable and wasted space"
>> for example output.
>>
>> show_usage.py
>> -
>> The contents of the old show_usage.py example that simply showed a list
>> of block groups are replaced with a detailed usage report of an existing
>> filesystem.
> 
> I wonder, perhaps at least some of the examples could be elevated to
> commands meant to be run by end-user?  Ie, installing them to /usr/bin/,
> dropping the extension?  They'd probably need less generic names, though.

Some of the examples are very useful, and I keep using them frequently.
That's actually also the reason that I for now just have copied the
examples/ to /usr/share/doc/python3-btrfs/examples for the Debian
package, so that they're easily available on all systems that I work on.

Currently the examples collection is serving a few purposes. It's my
poor mans testing framework, which covers all functionality of the lib.
It displays all the things that you can do. There's a rich git commit
message history on them, which I plan to transform into documentation
and tutorial stuff later.

So, yes, a bunch of the things are quite useful actually. The new
show_usage and space_calculator are examples of things that are possible
which start to ascend the small thingies on debugging level.

So what would be candidates to be promoted to 'official' utils?

0) Ah, btrfs-heatmap

Yeah, that's the thing it all started with. I started writing all of the
code to be able to debug why my filesystems were allocating raw disk
space all the time and not reusing the free already allocated space.
But, that one is already done.

https://github.com/knorrie/btrfs-heatmap/

1) Custom btrfs balance

If really needed (and luckily, the need for it is mostly removed after
solving the -o ssd issues) I always use balance_least_used.py instead of
regular btrfs balance. I think it totally makes sense to do the analysis
of what blockgroups to feed to balance in what order in user space.

I also used another custom script to feed block groups with highly
fragmented free space to balance to try repairing filesystems that had
been using the cluster data extent allocator. That's not in examples,
but when you combine show_free_space_fragmentation with parts of
balance_least_used, you get the idea.

The best example I can think of here is a program that uses the new
usage information to find out how to feed block groups to balance to
actually get a balanced filesystem with minimal amount of wasted raw
space, and then do exactly that in the quickest way possible while
providing interesting progress information, instead of just brute force
rewriting all of the data and having no idea what's actually happening.

2) Advanced usage reporting

Something like the new show_usage, but hey, when using python with some
batteries included, I guess we can relatively easily do a nice html or
pdf output with pie and bar charts which provide the user with
information about the filesystem. Just having users run that when
they're asking for help on IRC and share the result would be nice. :o)

3) The space calculator

Yup, obviously.

4) Maybe show_orphan_cleaner_progress

I use that one now and then to get a live view on mass-removal of
subvolumes (backup snapshot expiry), but it's very close to a debug
tool. Or maybe I'm already spoiled and used to it now, and I don't
realize any more how frustrating it must be to see disk IO and cpu go
all places and have no idea about what btrfs is doing.

5) So much more...

So... the examples are just basic test coverage. There is so much more
that can be done.

And yes, to be able to write a small thingie that uses the lib, you
already have to know a lot about btrfs. -> That's why I started writing
the tutorial.

And yes, when promoting things like the new show_usage example to
programs that are easily available, users will probably start parsing
the output of them with sed and awk which is a total abomination and the
absolute opposite of the purpose of the library. So be it. Let it go. :D
"The code never bothered me any way".

The interesting question that remains is where the result should go.

btrfs-heatmap is a thing of its own now, but it's a bit of the "show
case" example using the lib, with it

Re: [PATCH 5/6] btrfs: introduce nparity raid_attr

2018-10-05 Thread Hans van Kranenburg
On 10/05/2018 04:42 PM, Nikolay Borisov wrote:
> 
> 
> On  5.10.2018 00:24, Hans van Kranenburg wrote:
>> Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
>> nparity field in raid_attr.
>>
>> Signed-off-by: Hans van Kranenburg 
>> ---
>>  fs/btrfs/volumes.c | 18 +++---
>>  fs/btrfs/volumes.h |  2 ++
>>  2 files changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index d82b3d735ebe..453046497ac8 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -37,6 +37,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 1,
>>  .devs_increment = 2,
>>  .ncopies= 2,
>> +.nparity= 0,
>>  .raid_name  = "raid10",
>>  .bg_flag= BTRFS_BLOCK_GROUP_RAID10,
>>  .mindev_error   = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET,
>> @@ -49,6 +50,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 1,
>>  .devs_increment = 2,
>>  .ncopies= 2,
>> +.nparity= 0,
>>  .raid_name  = "raid1",
>>  .bg_flag= BTRFS_BLOCK_GROUP_RAID1,
>>  .mindev_error   = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
>> @@ -61,6 +63,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 0,
>>  .devs_increment = 1,
>>  .ncopies= 2,
>> +.nparity= 0,
>>  .raid_name  = "dup",
>>  .bg_flag= BTRFS_BLOCK_GROUP_DUP,
>>  .mindev_error   = 0,
>> @@ -73,6 +76,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 0,
>>  .devs_increment = 1,
>>  .ncopies= 1,
>> +.nparity= 0,
>>  .raid_name  = "raid0",
>>  .bg_flag= BTRFS_BLOCK_GROUP_RAID0,
>>  .mindev_error   = 0,
>> @@ -85,6 +89,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 0,
>>  .devs_increment = 1,
>>  .ncopies= 1,
>> +.nparity= 0,
>>  .raid_name  = "single",
>>  .bg_flag= 0,
>>  .mindev_error   = 0,
>> @@ -97,6 +102,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 1,
>>  .devs_increment = 1,
>>  .ncopies= 1,
>> +.nparity= 2,
>>  .raid_name  = "raid5",
>>  .bg_flag= BTRFS_BLOCK_GROUP_RAID5,
>>  .mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
>> @@ -109,6 +115,7 @@ const struct btrfs_raid_attr 
>> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>  .tolerated_failures = 2,
>>  .devs_increment = 1,
>>  .ncopies= 1,
>> +.nparity= 2,
>>  .raid_name  = "raid6",
>>  .bg_flag= BTRFS_BLOCK_GROUP_RAID6,
>>  .mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
>> @@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct 
>> btrfs_trans_handle *trans,
>>  int devs_min;   /* min devs needed */
>>  int devs_increment; /* ndevs has to be a multiple of this */
>>  int ncopies;/* how many copies to data has */
>> +int nparity;/* number of stripes worth of bytes to
>> +   store parity information */
>>  int ret;
>>  u64 max_stripe_size;
>>  u64 max_chunk_size;
>> @@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct 
>> btrfs_trans_handle *trans,
>>  devs_min = btrfs_raid_array[index].devs_min;
>>  devs_increment = btrfs_raid_array[index].devs_increment;
>>  ncopies = btrfs_raid_array[index].ncopies;
>> +nparity = btrfs_raid_array[index].nparity;
>>  
>>  if (type & BTRFS_BLOCK_GROUP_DATA) {
>>  max_stripe_size = SZ_1G;
>> @@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct 
>

Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-05 Thread Hans van Kranenburg
On 10/05/2018 09:51 AM, Qu Wenruo wrote:
> 
> 
> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>> This patch set contains an additional fix for a newly exposed bug after
>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>
>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
> 
> For that bug, did you succeeded in reproducing the bug?

Yes, open the above link and scroll to "Steps to reproduce".

o/

> I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
> btrfs-progs.
> I think it could help to detect such problem.
> 
> Thanks,
> Qu
> 
>>
>> The DUP fix is "fix more DUP stripe size handling". I did that one
>> before starting to change more things so it can be applied to earlier
>> LTS kernels.
>>
>> Besides that patch, which is fixing the bug in a way that is least
>> intrusive, I added a bunch of other patches to help getting the chunk
>> allocator code in a state that is a bit less error-prone and
>> bug-attracting.
>>
>> When running this and trying the reproduction scenario, I can now see
>> that the created DUP device extent is 827326464 bytes long, which is
>> good.
>>
>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>> a list of related use cases and test more things to at least walk
>> through a bunch of obvious use cases to see if there's nothing exploding
>> too quickly with these changes. However, I'm quite confident about it,
>> so I'm sharing all of it already.
>>
>> Any feedback and review is appreciated. Be gentle and keep in mind that
>> I'm still very much in a learning stage regarding kernel development.
>>
>> The stable patches handling workflow is not 100% clear to me yet. I
>> guess I have to add a Fixes: in the DUP patch which points to the
>> previous commit 92e222df7b.
>>
>> Moo!,
>> Knorrie
>>
>> Hans van Kranenburg (6):
>>   btrfs: alloc_chunk: do not refurbish num_bytes
>>   btrfs: alloc_chunk: improve chunk size variable name
>>   btrfs: alloc_chunk: fix more DUP stripe size handling
>>   btrfs: fix ncopies raid_attr for RAID56
>>   btrfs: introduce nparity raid_attr
>>   btrfs: alloc_chunk: rework chunk/stripe calculations
>>
>>  fs/btrfs/volumes.c | 84 +++---
>>  fs/btrfs/volumes.h |  4 ++-
>>  2 files changed, 45 insertions(+), 43 deletions(-)
>>
> 


-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/6] btrfs: introduce nparity raid_attr

2018-10-04 Thread Hans van Kranenburg
On 10/04/2018 11:24 PM, Hans van Kranenburg wrote:
> Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
> nparity field in raid_attr.
> 
> Signed-off-by: Hans van Kranenburg 
> ---
>  fs/btrfs/volumes.c | 18 +++---
>  fs/btrfs/volumes.h |  2 ++
>  2 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d82b3d735ebe..453046497ac8 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -37,6 +37,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 1,
>   .devs_increment = 2,
>   .ncopies= 2,
> + .nparity= 0,
>   .raid_name  = "raid10",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID10,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET,
> @@ -49,6 +50,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 1,
>   .devs_increment = 2,
>   .ncopies= 2,
> + .nparity= 0,
>   .raid_name  = "raid1",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID1,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
> @@ -61,6 +63,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 0,
>   .devs_increment = 1,
>   .ncopies= 2,
> + .nparity= 0,
>   .raid_name  = "dup",
>   .bg_flag= BTRFS_BLOCK_GROUP_DUP,
>   .mindev_error   = 0,
> @@ -73,6 +76,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 0,
>   .devs_increment = 1,
>   .ncopies= 1,
> + .nparity= 0,
>   .raid_name  = "raid0",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID0,
>   .mindev_error   = 0,
> @@ -85,6 +89,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 0,
>   .devs_increment = 1,
>   .ncopies= 1,
> + .nparity= 0,
>   .raid_name  = "single",
>   .bg_flag= 0,
>   .mindev_error   = 0,
> @@ -97,6 +102,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 1,
>   .devs_increment = 1,
>   .ncopies= 1,
> + .nparity= 2,

Ahem, this obviously needs to be 1. Another reorganize/rebase-fail.

>   .raid_name  = "raid5",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID5,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
> @@ -109,6 +115,7 @@ const struct btrfs_raid_attr 
> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   .tolerated_failures = 2,
>   .devs_increment = 1,
>   .ncopies= 1,
> + .nparity= 2,
>   .raid_name  = "raid6",
>   .bg_flag= BTRFS_BLOCK_GROUP_RAID6,
>   .mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
> @@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   int devs_min;   /* min devs needed */
>   int devs_increment; /* ndevs has to be a multiple of this */
>   int ncopies;/* how many copies to data has */
> + int nparity;/* number of stripes worth of bytes to
> +store parity information */
>   int ret;
>   u64 max_stripe_size;
>   u64 max_chunk_size;
> @@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   devs_min = btrfs_raid_array[index].devs_min;
>   devs_increment = btrfs_raid_array[index].devs_increment;
>   ncopies = btrfs_raid_array[index].ncopies;
> + nparity = btrfs_raid_array[index].nparity;
>  
>   if (type & BTRFS_BLOCK_GROUP_DATA) {
>   max_stripe_size = SZ_1G;
> @@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>* this will have to be fixed for RAID1 and RAID10 over
>* more drives
>*/
> - data_stripes = num_stripes / ncopies;
> -
> - if (type & BTRFS_BLOCK_GROUP_RAID5)
> - data_stripes = num_stripes - 1;
> -
> -   

Re: [PATCH 2/6] btrfs: alloc_chunk: improve chunk size variable name

2018-10-04 Thread Hans van Kranenburg
On 10/04/2018 11:24 PM, Hans van Kranenburg wrote:
> num_bytes is really a way too generic name for a variable in this
> function. There are a dozen other variables that hold a number of bytes
> as value.
> 
> Give it a name that actually describes what it does, which is holding
> the size of the chunk that we're allocating.
> 
> Signed-off-by: Hans van Kranenburg 
> ---
>  fs/btrfs/volumes.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9c9bb127..40fa85e68b1f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -123,6 +123,8 @@ const char *get_raid_name(enum btrfs_raid_types type)
>   return btrfs_raid_array[type].raid_name;
>  }
>  
> +
> +

Meh, missed that one, will remove.

>  static int init_first_rw_device(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *fs_info);
>  static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
> @@ -4599,7 +4601,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   u64 max_stripe_size;
>   u64 max_chunk_size;
>   u64 stripe_size;
> - u64 num_bytes;
> + u64 chunk_size;
>   int ndevs;
>   int i;
>   int j;
> @@ -4801,9 +4803,9 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   map->type = type;
>   map->sub_stripes = sub_stripes;
>  
> - num_bytes = stripe_size * data_stripes;
> + chunk_size = stripe_size * data_stripes;
>  
> - trace_btrfs_chunk_alloc(info, map, start, num_bytes);
> + trace_btrfs_chunk_alloc(info, map, start, chunk_size);
>  
>   em = alloc_extent_map();
>   if (!em) {
> @@ -4814,7 +4816,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   set_bit(EXTENT_FLAG_FS_MAPPING, >flags);
>   em->map_lookup = map;
>   em->start = start;
> - em->len = num_bytes;
> + em->len = chunk_size;
>   em->block_start = 0;
>   em->block_len = em->len;
>   em->orig_block_len = stripe_size;
> @@ -4832,7 +4834,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   refcount_inc(>refs);
>   write_unlock(_tree->lock);
>  
> - ret = btrfs_make_block_group(trans, 0, type, start, num_bytes);
> + ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
>   if (ret)
>   goto error_del_extent;
>  
> 


-- 
Hans van Kranenburg


[PATCH 6/6] btrfs: alloc_chunk: rework chunk/stripe calculations

2018-10-04 Thread Hans van Kranenburg
Previously, the stripe_size variable was modified too many times in the
__btrfs_alloc_chunk function. The most problematic place was the if
block dealing with a chunk bigger than max_chunk_size, which would throw
away (overwrite) the value of stripe_size, maybe realizing a few lines
later that the previous value was actually better and executing a copy
of former logic to try get it back in the previous state.

Instead of on-the-fly calculating the target chunk size, adjust the
max_stripe_size variable based on the max_chunk_size that was set
before, and use that to simply compare it to stripe_size at some point.
This removes the whole problematic if block.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 46 +-
 fs/btrfs/volumes.h |  2 +-
 2 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 453046497ac8..862ee17ee0e5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4596,14 +4596,15 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
struct btrfs_device_info *devices_info = NULL;
u64 total_avail;
int num_stripes;/* total number of stripes to allocate */
-   int data_stripes;   /* number of stripes that count for
-  block group size */
+   int num_data_stripes;   /* number of stripes worth of bytes to
+  store data including copies */
int sub_stripes;/* sub_stripes info for map */
int dev_stripes;/* stripes per dev */
int devs_max;   /* max devs to use */
int devs_min;   /* min devs needed */
int devs_increment; /* ndevs has to be a multiple of this */
-   int ncopies;/* how many copies to data has */
+   int ncopies;/* how many times actual data is duplicated
+  inside num_data_stripes */
int nparity;/* number of stripes worth of bytes to
   store parity information */
int ret;
@@ -4747,6 +4748,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
}
 
ndevs = min(ndevs, devs_max);
+   num_stripes = ndevs * dev_stripes;
+   num_data_stripes = num_stripes - nparity;
 
/*
 * The primary goal is to maximize the number of stripes, so use as
@@ -4756,31 +4759,24 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
 * max_avail is the total size so we have to adjust.
 */
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
-   num_stripes = ndevs * dev_stripes;
-
-   /*
-* this will have to be fixed for RAID1 and RAID10 over
-* more drives
-*/
-   data_stripes = (num_stripes - nparity) / ncopies;
 
/*
-* Use the number of data stripes to figure out how big this chunk
-* is really going to be in terms of logical address space,
-* and compare that answer with the max chunk size. If it's higher,
-* we try to reduce stripe_size.
+* Now that we know how many stripes we're going to use, we can adjust
+* down max_stripe_size if needed, paying attention to max_chunk_size.
+*
+* By multiplying chunk size with ncopies, we get the total amount of
+* bytes that need to fit into all the non-parity stripes.
+*
+* A chunk is allowed to end up being a bit bigger than max_chunk_size
+* when rounding up the stripe_size to a 16MiB boundary makes it so.
+* Unless... it ends up being bigger than the amount of physical free
+* space we can use for it.
 */
-   if (stripe_size * data_stripes > max_chunk_size) {
-   /* Reduce stripe_size, round it up to a 16MB boundary
-* again and then use it, unless it ends up being even
-* bigger than the previous value we had already.
-*/
-   stripe_size = min(round_up(div_u64(max_chunk_size,
-  data_stripes), SZ_16M),
- stripe_size);
-   }
+   max_stripe_size = min(round_up((max_chunk_size * ncopies) /
+  num_data_stripes, SZ_16M),
+ max_stripe_size);
 
-   /* align to BTRFS_STRIPE_LEN */
+   stripe_size = min(max_stripe_size, stripe_size);
stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
 
map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
@@ -4804,7 +4800,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
map->type = type;
map->sub_stripes = sub_stripes;
 
-   chunk_size = stripe_size * data_stripes;
+   chunk_size = div_u64(stripe_size * num_data_stripes

[PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-04 Thread Hans van Kranenburg
This patch set contains an additional fix for a newly exposed bug after
the previous attempt to fix a chunk allocator bug for new DUP chunks:

https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374

The DUP fix is "fix more DUP stripe size handling". I did that one
before starting to change more things so it can be applied to earlier
LTS kernels.

Besides that patch, which is fixing the bug in a way that is least
intrusive, I added a bunch of other patches to help getting the chunk
allocator code in a state that is a bit less error-prone and
bug-attracting.

When running this and trying the reproduction scenario, I can now see
that the created DUP device extent is 827326464 bytes long, which is
good.

I wrote and tested this on top of linus 4.19-rc5. I still need to create
a list of related use cases and test more things to at least walk
through a bunch of obvious use cases to see if there's nothing exploding
too quickly with these changes. However, I'm quite confident about it,
so I'm sharing all of it already.

Any feedback and review is appreciated. Be gentle and keep in mind that
I'm still very much in a learning stage regarding kernel development.

The stable patches handling workflow is not 100% clear to me yet. I
guess I have to add a Fixes: in the DUP patch which points to the
previous commit 92e222df7b.

Moo!,
Knorrie

Hans van Kranenburg (6):
  btrfs: alloc_chunk: do not refurbish num_bytes
  btrfs: alloc_chunk: improve chunk size variable name
  btrfs: alloc_chunk: fix more DUP stripe size handling
  btrfs: fix ncopies raid_attr for RAID56
  btrfs: introduce nparity raid_attr
  btrfs: alloc_chunk: rework chunk/stripe calculations

 fs/btrfs/volumes.c | 84 +++---
 fs/btrfs/volumes.h |  4 ++-
 2 files changed, 45 insertions(+), 43 deletions(-)

-- 
2.19.0.329.g76f2f5c1e3



[PATCH 4/6] btrfs: fix ncopies raid_attr for RAID56

2018-10-04 Thread Hans van Kranenburg
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
values are not used anywhere by the way.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7045814fc98d..d82b3d735ebe 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -96,7 +96,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.devs_min   = 2,
.tolerated_failures = 1,
.devs_increment = 1,
-   .ncopies= 2,
+   .ncopies= 1,
.raid_name  = "raid5",
.bg_flag= BTRFS_BLOCK_GROUP_RAID5,
.mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
@@ -108,7 +108,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.devs_min   = 3,
.tolerated_failures = 2,
.devs_increment = 1,
-   .ncopies= 3,
+   .ncopies= 1,
.raid_name  = "raid6",
.bg_flag= BTRFS_BLOCK_GROUP_RAID6,
.mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
-- 
2.19.0.329.g76f2f5c1e3



[PATCH 3/6] btrfs: alloc_chunk: fix more DUP stripe size handling

2018-10-04 Thread Hans van Kranenburg
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.

However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.

The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
  chunk allocation is 20% of the total device size, or a few MiB more or
  less.

An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
  piece.

In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).

Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.

Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.

In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.

The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.

With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 40fa85e68b1f..7045814fc98d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4763,19 +4763,16 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
/*
 * Use the number of data stripes to figure out how big this chunk
 * is really going to be in terms of logical address space,
-* and compare that answer with the max chunk size
+* and compare that answer with the max chunk size. If it's higher,
+* we try to reduce stripe_size.
 */
if (stripe_size * data_stripes > max_chunk_size) {
-   stripe_size = div_u64(max_chunk_size, data_stripes);
-
-   /* bump the answer up to a 16MB boundary */
-   stripe_size = round_up(stripe_size, SZ_16M);
-
-   /*
-* But don't go higher than the limits we found while searching
-* for free extents
+   /* Reduce stripe_size, round it up to a 16MB boundary
+* again and then use it, unless it ends up being even
+* bigger than the previous value we had already.
 */
-   stripe_size = min(devices_info[ndevs - 1].max_avail,
+   stripe_size = min(round_up(div_u64(max_chunk_size,
+  data_stripes), SZ_16M),
  stripe_size);
}
 
-- 
2.19.0.329.g76f2f5c1e3



[PATCH 2/6] btrfs: alloc_chunk: improve chunk size variable name

2018-10-04 Thread Hans van Kranenburg
num_bytes is really a way too generic name for a variable in this
function. There are a dozen other variables that hold a number of bytes
as value.

Give it a name that actually describes what it does, which is holding
the size of the chunk that we're allocating.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9c9bb127..40fa85e68b1f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -123,6 +123,8 @@ const char *get_raid_name(enum btrfs_raid_types type)
return btrfs_raid_array[type].raid_name;
 }
 
+
+
 static int init_first_rw_device(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info);
 static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
@@ -4599,7 +4601,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
u64 max_stripe_size;
u64 max_chunk_size;
u64 stripe_size;
-   u64 num_bytes;
+   u64 chunk_size;
int ndevs;
int i;
int j;
@@ -4801,9 +4803,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
map->type = type;
map->sub_stripes = sub_stripes;
 
-   num_bytes = stripe_size * data_stripes;
+   chunk_size = stripe_size * data_stripes;
 
-   trace_btrfs_chunk_alloc(info, map, start, num_bytes);
+   trace_btrfs_chunk_alloc(info, map, start, chunk_size);
 
em = alloc_extent_map();
if (!em) {
@@ -4814,7 +4816,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
set_bit(EXTENT_FLAG_FS_MAPPING, >flags);
em->map_lookup = map;
em->start = start;
-   em->len = num_bytes;
+   em->len = chunk_size;
em->block_start = 0;
em->block_len = em->len;
em->orig_block_len = stripe_size;
@@ -4832,7 +4834,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
refcount_inc(>refs);
write_unlock(_tree->lock);
 
-   ret = btrfs_make_block_group(trans, 0, type, start, num_bytes);
+   ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
if (ret)
goto error_del_extent;
 
-- 
2.19.0.329.g76f2f5c1e3



[PATCH 1/6] btrfs: alloc_chunk: do not refurbish num_bytes

2018-10-04 Thread Hans van Kranenburg
num_bytes is used to store the chunk length of the chunk that we're
allocating. Do not reuse it for something really different in the same
function.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..9c9bb127 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4837,8 +4837,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
goto error_del_extent;
 
for (i = 0; i < map->num_stripes; i++) {
-   num_bytes = map->stripes[i].dev->bytes_used + stripe_size;
-   btrfs_device_set_bytes_used(map->stripes[i].dev, num_bytes);
+   btrfs_device_set_bytes_used(map->stripes[i].dev,
+   map->stripes[i].dev->bytes_used +
+   stripe_size);
}
 
atomic64_sub(stripe_size * map->num_stripes, >free_chunk_space);
-- 
2.19.0.329.g76f2f5c1e3



[PATCH 5/6] btrfs: introduce nparity raid_attr

2018-10-04 Thread Hans van Kranenburg
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
nparity field in raid_attr.

Signed-off-by: Hans van Kranenburg 
---
 fs/btrfs/volumes.c | 18 +++---
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d82b3d735ebe..453046497ac8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -37,6 +37,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 1,
.devs_increment = 2,
.ncopies= 2,
+   .nparity= 0,
.raid_name  = "raid10",
.bg_flag= BTRFS_BLOCK_GROUP_RAID10,
.mindev_error   = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET,
@@ -49,6 +50,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 1,
.devs_increment = 2,
.ncopies= 2,
+   .nparity= 0,
.raid_name  = "raid1",
.bg_flag= BTRFS_BLOCK_GROUP_RAID1,
.mindev_error   = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
@@ -61,6 +63,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 0,
.devs_increment = 1,
.ncopies= 2,
+   .nparity= 0,
.raid_name  = "dup",
.bg_flag= BTRFS_BLOCK_GROUP_DUP,
.mindev_error   = 0,
@@ -73,6 +76,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 0,
.devs_increment = 1,
.ncopies= 1,
+   .nparity= 0,
.raid_name  = "raid0",
.bg_flag= BTRFS_BLOCK_GROUP_RAID0,
.mindev_error   = 0,
@@ -85,6 +89,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 0,
.devs_increment = 1,
.ncopies= 1,
+   .nparity= 0,
.raid_name  = "single",
.bg_flag= 0,
.mindev_error   = 0,
@@ -97,6 +102,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 1,
.devs_increment = 1,
.ncopies= 1,
+   .nparity= 2,
.raid_name  = "raid5",
.bg_flag= BTRFS_BLOCK_GROUP_RAID5,
.mindev_error   = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
@@ -109,6 +115,7 @@ const struct btrfs_raid_attr 
btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
.tolerated_failures = 2,
.devs_increment = 1,
.ncopies= 1,
+   .nparity= 2,
.raid_name  = "raid6",
.bg_flag= BTRFS_BLOCK_GROUP_RAID6,
.mindev_error   = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
@@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
int devs_min;   /* min devs needed */
int devs_increment; /* ndevs has to be a multiple of this */
int ncopies;/* how many copies to data has */
+   int nparity;/* number of stripes worth of bytes to
+  store parity information */
int ret;
u64 max_stripe_size;
u64 max_chunk_size;
@@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
devs_min = btrfs_raid_array[index].devs_min;
devs_increment = btrfs_raid_array[index].devs_increment;
ncopies = btrfs_raid_array[index].ncopies;
+   nparity = btrfs_raid_array[index].nparity;
 
if (type & BTRFS_BLOCK_GROUP_DATA) {
max_stripe_size = SZ_1G;
@@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
 * this will have to be fixed for RAID1 and RAID10 over
 * more drives
 */
-   data_stripes = num_stripes / ncopies;
-
-   if (type & BTRFS_BLOCK_GROUP_RAID5)
-   data_stripes = num_stripes - 1;
-
-   if (type & BTRFS_BLOCK_GROUP_RAID6)
-   data_stripes = num_stripes - 2;
+   data_stripes = (num_stripes - nparity) / ncopies;
 
/*
 * Use the number of data stripes to figure out how big this chunk
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..0fe005b4295a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -331,6 +331,8 @@ struct btrfs_raid_attr {
int tolerated_failures; /* max tolerated fail devs */
in

Re: DUP dev_extent might overlap something next to it

2018-09-28 Thread Hans van Kranenburg
On 09/29/2018 01:30 AM, Hans van Kranenburg wrote:
> [...]
> 
> I didn't try filling it up and see what happens yet. Also, this can
> probably done with a DUP chunk, but it's a bit harder to quickly prove.

DUP metadata chunk   ^^


-- 
Hans van Kranenburg


Re: DUP dev_extent might overlap something next to it

2018-09-28 Thread Hans van Kranenburg
On 09/25/2018 02:05 AM, Hans van Kranenburg wrote:
> (I'm using v4.19-rc5 code here.)
> 
> Imagine allocating a DATA|DUP chunk.
> 
> [blub, see previous message]

Steps to reproduce DUP chunk beyond end of device:

First create a 6302M block device and fill it up.

mkdir bork
cd bork
dd if=/dev/zero of=image bs=1 count=0 seek=6302M
mkfs.btrfs -d dup -m dup image
losetup -f image
mkdir mountpoint
mount -o space_cache=v2 /dev/loop0 mountpoint
cp -a /usr mountpoint

After a while, this starts throwing: No space left on device

Now we extend the size of the image to 7880MiB so that the next dup data
chunk allocation will exactly try to use the 1578MiB free raw disk space
to trigger the bug.

dd if=/dev/zero of=image bs=1 count=0 seek=7880M
losetup -c /dev/loop0
btrfs fi resize max mountpoint/

Now we trigger the new DUP data chunk allocation:

cp /vmlinuz mountpoint/

Now I have a dev extent starting at 7446986752 on devid 1, with length
838860800. This means it ends at 8285847552, which is 7902, exactly
22MiB beyond the size of the device.

The only nice thing about this is that df shows me I still have more
than 8 exabyte of space in this image file.

Label: none  uuid: e711eea6-5332-44cf-9704-998a7a939970
Total devices 1 FS bytes used 2.81GiB
devid1 size 7.70GiB used 7.72GiB path /dev/loop0

I didn't try filling it up and see what happens yet. Also, this can
probably done with a DUP chunk, but it's a bit harder to quickly prove.
And making this happen in the middle of a block device instead of at the
end is also a bit harder.

-- 
Hans van Kranenburg


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-09-28 Thread Hans van Kranenburg
On 09/24/2018 10:08 AM, Nikolay Borisov wrote:
>>
>> The bugs are all related to repeated kernel code all over the place
>> containing a lot of if statements dealing with different kind of
>> allocation profiles and their exceptions. What I ended up doing is
>> making a few helper functions instead, see the commit "Add volumes.py,
>> handling device / chunk logic". It would probably be nice to do the same
>> in the kernel code, which would also solve the mentioned bugs and
>> prevent new similar ones from happening.
> 
> Would you care to report each bug separately so they can be triaged and
> fixed?

In case of the RAID10 5GiB thing I think I was mixing up things. When
doing mkfs you end up with a RAID10 chunk of 5GiB (dunno why, didn't
research), when mounting and pointing balance at it, I get a 10GiB for
it back, so that's ok.

For the DUP thing, I sent an explanation ("DUP dev_extent might overlap
something next to it"), which doesn't seem to attract much attention
yet. I'm preparing a pile of patches to volumes.[ch] to fix this, clean
up things that I ran into and make the logic a bit less convoluted.

-- 
Hans van Kranenburg


DUP dev_extent might overlap something next to it

2018-09-24 Thread Hans van Kranenburg
(I'm using v4.19-rc5 code here.)

Imagine allocating a DATA|DUP chunk.

In the chunk allocator, we first set...
  max_stripe_size = SZ_1G;
  max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.

Then...
  /* we don't want a chunk larger than 10% of writeable space */
  max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
   max_chunk_size);

Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.

The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail

Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
  stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);

stripe_size is now 789MiB

Next we do...
  data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.

Next there's a check...
  if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.

We go into the if code, and next is...
  stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB

Next is a fun one...
  /* bump the answer up to a 16MB boundary */
  stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.

We're not done changing stripe_size yet...
  /* But don't go higher than the limits we found while searching
   * for free extents
   */
  stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);

This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).

The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.

The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
  "[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."

In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?

Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?

What is the main purpose of this round_up?

The most straightforward thing to do seems something like...
  stripe_size = min(
  div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
  stripe_size
  )
..just putting half of the max_avail into stripe_size.

I hope the above analysis is correct...

If this is fixed, then it will be the 5th commit in btrfs history which
tries to fix the same thing in this code again, where all the previous
ones were breaking things while fixing things.

The real problem is that in the very first version of this code, the
stripe_size variable was already reused for very different purposes
throughout the function. And it still is, all the time.

So, while it may seem very logical fix (again), I guess this needs more
eyes, since we missed this line the previous time. D:

-- 
Hans van Kranenburg


python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-09-23 Thread Hans van Kranenburg
Hi all,

I'm planning for a python-btrfs release to happen in a about week.

All new changes are in the develop branch:
https://github.com/knorrie/python-btrfs/commits/develop

tl;dr: check out the two new examples added in the latest git commits
and see if they provide correct info!

## Detailed usage reporting

The new FsUsage object provides information on different levels
(physical allocated bytes on devices, virtual space usage, etc) and also
contains code to estimate how much space is still actually really
available before ENOSPC happens. (e.g. the values that would ideally
show up in df output). It works for all allocation profiles!

Two examples have been added, which use the new code. I would appreciate
extra testing. Please try them and see if the reported numbers make sense:

space_calculator.py
---
Best to be initially described as a CLI version of the well-known
webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at
it, choose data and metadata profile and see how much space you would
get to store actual data.

See commit message "Add example to calculate usable and wasted space"
for example output.

show_usage.py
-
The contents of the old show_usage.py example that simply showed a list
of block groups are replaced with a detailed usage report of an existing
filesystem.

See commit message "A new show usage example!" for example output.

## A btrfs tutorial!

A while ago I started creating documentation for python-btrfs in
tutorial style. By playing around with an example filesystem we learn
where btrfs puts our data on the disks, what a chunk, block group and an
extent is, how we can quickly look up interesting things in metadata and
how cows climb trees, moo.

https://github.com/knorrie/python-btrfs/issues/11
https://github.com/knorrie/python-btrfs/blob/tutorial/tutorial/README.md

I'm not sure yet if I'm going to 'ship' the first few pages already,
since it's still very much a work in progress, but in any case feedback
/ ideas are welcome. Have a look!

## Other changes

Other changes are the addition of the sync, fideduperange and
get_features ioctl calls and a workaround for python 3.7 which breaks
the struct module api.

## P.S.

And finally, when doing the above, I discovered a few extra unintended
features and bugs in the btrfs chunk allocator (Did you know RAID10
block groups are limited to 5GiB in size? Did you know that when the
last chunk added on a disk is of DUP type, it could end up having an end
beyond the limit of a device?).

I still have to actually test the second one, causing it to happen.
If anyone is interested to help with that, please ask about it.

The bugs are all related to repeated kernel code all over the place
containing a lot of if statements dealing with different kind of
allocation profiles and their exceptions. What I ended up doing is
making a few helper functions instead, see the commit "Add volumes.py,
handling device / chunk logic". It would probably be nice to do the same
in the kernel code, which would also solve the mentioned bugs and
prevent new similar ones from happening.

Have fun,
-- 
Hans van Kranenburg


Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-19 Thread Hans van Kranenburg
On 09/19/2018 10:04 PM, Martin Steigerwald wrote:
> Hans van Kranenburg - 19.09.18, 19:58:
>> However, as soon as we remount the filesystem with space_cache=v2 -
>>
>>> writes drop to just around 3-10 MB/s to each disk. If we remount to
>>> space_cache - lots of writes, system unresponsive. Again remount to
>>> space_cache=v2 - low writes, system responsive.
>>>
>>> That's a huuge, 10x overhead! Is it expected? Especially that
>>> space_cache=v1 is still the default mount option?
>>
>> Yes, that does not surprise me.
>>
>> https://events.static.linuxfound.org/sites/events/files/slides/vault20
>> 16_0.pdf
>>
>> Free space cache v1 is the default because of issues with btrfs-progs,
>> not because it's unwise to use the kernel code. I can totally
>> recommend using it. The linked presentation above gives some good
>> background information.
> 
> What issues in btrfs-progs are that?

Missing code to make offline changes to a filesystem that has a free
space tree. So when using btrfstune / repair / whatever you first need
to remove the whole free space tree with a command, and then add it back
on the next mount.

For me personally that's not a problem (I don't have to make offline
changes), but I understand that having that situation out of the box for
every new user would be a bit awkward.

> I am wondering whether to switch to freespace tree v2. Would it provide 
> benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
> on a laptop?

As shown in the linked presentation, it provides benefit on a largeish
filesystem and if your writes are touching a lot of different block
groups (since v1 writes out the full space cache for all of them on
every transaction commit). I'd say, it provides benefit as soon as you
encounter filesystem delays because of it, and as soon as you see using
it eases the pain a lot. So, yes, that's your case.

-- 
Hans van Kranenburg


Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-19 Thread Hans van Kranenburg
Hi,

On 09/19/2018 10:43 AM, Tomasz Chmielewski wrote:
> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.

Did you already test the difference with/without 'discard'? Also, I
think that depending on the tooling that you use to view disk IO,
discards will also show up as disk write statistics.

> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

Yes, that does not surprise me.

https://events.static.linuxfound.org/sites/events/files/slides/vault2016_0.pdf

Free space cache v1 is the default because of issues with btrfs-progs,
not because it's unwise to use the kernel code. I can totally recommend
using it. The linked presentation above gives some good background
information.

Another thing that's interesting is finding out what kind of things
btrfs is writing if it's writing that much MB/s to disk. Doing this is
not very trivial.

I've been spending quite some time researching these kind of issues.

Here's what I found out:
https://www.spinics.net/lists/linux-btrfs/msg70624.html (oh wow, that's
almost a year ago already)

There are a bunch of tracepoints in the kernel code that could help
debugging all of this more, but I've not yet gotten around to writing
something to conveniently to use them to live show what's happening.

I'm still using the "Thanks to a bug, solved in [2]" in the above
mailing list post way of combining extent allocators in btrfs now to
keep things workable on the larger filesystem.

-- 
Hans van Kranenburg


Re: Move data and mount point to subvolume

2018-09-18 Thread Hans van Kranenburg
On 09/18/2018 08:10 PM, Marc Joliet wrote:
> Am Sonntag, 16. September 2018, 14:50:04 CEST schrieb Hans van Kranenburg:
>> The last example, where you make a subvolume and move everything into
>> it, will not do what you want. Since a subvolume is a separate new
>> directoty/file hierarchy, mv will turn into a cp and rm operation
>> (without warning you) probably destroying information about data shared
>> between files.
> 
> I thought that wasn't true anymore.  The NEWS file to coreutils contains this 
> (for version 8.24):
> 
>   mv will try a reflink before falling back to a standard copy, which is
>   more efficient when moving files across BTRFS subvolume boundaries.

I was wrong when saying that mv will copy/rm between subvolumes indeed,
because you can reflink files between subvolumes, as long as they're
under the same mount point. (You still can NOT between mount points
iirc, even when they are mounts from the same btrfs.)

But still, mv silently does one or the other, which is also confusing,
because if it starts copying/removing things while that was not your
intention, at first you're like "hm, takes longer than I tought" and
then "oh wait, n", and then it's too late already.

-- 
Hans van Kranenburg


Re: can anyone receive this message

2018-09-17 Thread Hans van Kranenburg
On 09/18/2018 02:13 AM, sunny.s.zhang wrote:
> 
> Can anyone receive this message? I found i can receive message from the
> mail list, but can't receive message that send by myself.

Yes.

You can also check public list archives to see if your message shows up,
e.g.:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg80664.html

If it does, then the message reached the list, but your own incoming
mail server might be throwing away email with your own address in the
From, but forwarded via a mailing list?

-- 
Hans van Kranenburg


Re: Move data and mount point to subvolume

2018-09-16 Thread Hans van Kranenburg
On 09/16/2018 02:37 PM, Hans van Kranenburg wrote:
> On 09/16/2018 01:14 PM, Rory Campbell-Lange wrote:
>> Hi
>>
>> We have a backup machine that has been happily running its backup
>> partitions on btrfs (on top of a luks encrypted disks) for a few years.
>>
>> Our backup partition is on /bkp which is a top level subvolume. 
>> Data, RAID1: total=2.52TiB, used=1.36TiB
>> There are no other subvolumes.
>>
>> We have /bkp with /bkp/backup in it. We would like to mount /bkp/backup
>> at /bkp instead. Note that /bkp/backup has a lot of hardlinked files.
>>
>> I guess I could do 
>>
>> cd /bkp/backup
>> mv * ../
>> rmdir backup
> 
> Doing it the other way around is easier, since you don't have to think
> about hardlinked/reflinked/etc/anything while copying data:

Oh wait, I'm stupid, I was reading too quickly. Let me finish my coffee
first.

> btrfs sub snap /bkp /bkp/backup
> 
> Now you have the exact identical thing in /backup, and you can start
> throwing away everything outside of /backup.
> 
> To reduce chance for accidental errors, you can snapshot with -r to make
> the new /backup read-only first. Then after removing everything outside
> it, make it rw again with:
> 
> btrfs property set -ts /bkp/backup ro false
> 
>> But would it also be possible to do something like
>>
>> cd /bkp
>> btrfs subvol create backup-subvol
>> mv /bkp/backup/* /bkp/backup-subvol
>> ... then mount /bkp/backup-subvol at /bkp
>>
>> Would this second approach work, and preserve hardlinks?

You essentially want to turn a subdirectory into a subvolume.

The last example, where you make a subvolume and move everything into
it, will not do what you want. Since a subvolume is a separate new
directoty/file hierarchy, mv will turn into a cp and rm operation
(without warning you) probably destroying information about data shared
between files.

So, what you can do is (similar to what I earlier typed):

* You now have subvol 5 (at /bkp)
* btrfs sub snap /bkp /bkp/backup-subvol
* Now you have a new subvol (256 or higher number) which already
contains everything
* Then inside /bkp/backup-subvol you will again see
/bkp/backup-subvol/backup (since you snapshotted the toplevel which also
contains it)
* Now mv /bkp/backup/backup-subvol/* /bkp/backup-subvol (so the mv
operations stays within the same subvolume)
* Then remove everything outside /bkp/backup-subvol and mv
/bkp/backup-subvol /bkp/backup, and then voila... you can now use
subvol=/backup to mount it.

>> The machine is btrfs-progs v4.7.3 Linux 4.9.0-8-amd64 on Debian. The
>> coreutils version is 8.26-3.
> 
> Another note: since it seems you have about 100% more data space
> allocated (set apart for data purpose) than you're actually using, or in
> other words, having the 1GiB chunks on average just for 50% filled...
> 
>Data, RAID1: total=2.52TiB, used=1.36TiB
> 
> ...in combination with using Linux 4.9, I suspect there's also 'ssd' in
> your mount options (not in fstab, but enabled by btrfs while mounting,
> see /proc/mounts or mount command output)?
> 
> If so, this is a nice starting point for more info about what might also
> be happening to your filesystem:
> https://www.spinics.net/lists/linux-btrfs/msg70622.html


Have fun,


-- 
Hans van Kranenburg


Re: Move data and mount point to subvolume

2018-09-16 Thread Hans van Kranenburg
On 09/16/2018 01:14 PM, Rory Campbell-Lange wrote:
> Hi
> 
> We have a backup machine that has been happily running its backup
> partitions on btrfs (on top of a luks encrypted disks) for a few years.
> 
> Our backup partition is on /bkp which is a top level subvolume. 
> Data, RAID1: total=2.52TiB, used=1.36TiB
> There are no other subvolumes.
> 
> We have /bkp with /bkp/backup in it. We would like to mount /bkp/backup
> at /bkp instead. Note that /bkp/backup has a lot of hardlinked files.
> 
> I guess I could do 
> 
> cd /bkp/backup
> mv * ../
> rmdir backup

Doing it the other way around is easier, since you don't have to think
about hardlinked/reflinked/etc/anything while copying data:

btrfs sub snap /bkp /bkp/backup

Now you have the exact identical thing in /backup, and you can start
throwing away everything outside of /backup.

To reduce chance for accidental errors, you can snapshot with -r to make
the new /backup read-only first. Then after removing everything outside
it, make it rw again with:

btrfs property set -ts /bkp/backup ro false

> But would it also be possible to do something like
> 
> cd /bkp
> btrfs subvol create backup-subvol
> mv /bkp/backup/* /bkp/backup-subvol
> ... then mount /bkp/backup-subvol at /bkp
> 
> Would this second approach work, and preserve hardlinks?
> 
> The machine is btrfs-progs v4.7.3 Linux 4.9.0-8-amd64 on Debian. The
> coreutils version is 8.26-3.

Another note: since it seems you have about 100% more data space
allocated (set apart for data purpose) than you're actually using, or in
other words, having the 1GiB chunks on average just for 50% filled...

   Data, RAID1: total=2.52TiB, used=1.36TiB

...in combination with using Linux 4.9, I suspect there's also 'ssd' in
your mount options (not in fstab, but enabled by btrfs while mounting,
see /proc/mounts or mount command output)?

If so, this is a nice starting point for more info about what might also
be happening to your filesystem:
https://www.spinics.net/lists/linux-btrfs/msg70622.html

-- 
Hans van Kranenburg


Re: state of btrfs snapshot limitations?

2018-09-15 Thread Hans van Kranenburg
On 09/15/2018 05:56 AM, James A. Robinson wrote:
> [...]
> 
> I've got to read up a bit more on subvolumes, I am missing some
> context from the warnings given by Chris regarding per-subvolume
> options.

Btrfs lets you mount the filesystem multiple times, e.g. with a
different subvolume id, so you can mount part of the filesystem somewhere.

Some of the mount options (many btrfs specific ones, like space_cache*,
autodefrag etc) get a value when doing the first mount, and subsequent
ones cannot change them any more, because they're filesystem wide behavior.

Some others can be changed on each individual mount (like the atime
options), and when omitting them you get the non-optimal default again.

-- 
Hans van Kranenburg


Re: state of btrfs snapshot limitations?

2018-09-14 Thread Hans van Kranenburg
Hi,

On 09/14/2018 11:05 PM, James A. Robinson wrote:
> The mail archive seems to indicate this list is appropriate
> for not only the technical coding issues, but also for user
> questions, so I wanted to pose a question here.  If I'm
> wrong about that, I apologize in advance.

It's fine. Your observation is correct.

> The page
> 
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
> 
> talks about the basic snapshot capabilities of btrfs and led
> me to look up what, if any, limits might apply.  I find some
> threads from a few years ago that talk about limiting the
> number of snapshots for a volume to 100.
> 
> The reason I'm curious is I wanted to try and use the
> snapshot capability as a way of keeping a 'history' of a
> backup volume I maintain.  The backup doesn't change a
> lot overtime, but small changes are made to files within
> it daily.

The 100 above is just a number because users ask "ok, but *how* many?".

As far as I know, the real thing that is causing complexity for the
filesystem is how much actual changes are being done to the subvolume
all the time, after it's being snapshotted.

Creating btrfs snapshots is cheap. Only as soon as you start making
modifications, the subvolume in which you make the changes is going to
diverge from the other ones which share the same history. And changes
mean not only changes to data (changing, adding removing files), but
also pure metadata changes (e.g. using touch command on a file).

When just using the snapshots, opening and reading files etc, this
should however not be a big problem.

But, other btrfs specific actions are affected, like balance and device
remove, using quota.

In any case, make sure:
- you are not using quota / qgroups (highly affected by this sort of
complexity)
- you *always* mount with noatime (which is not the default, and yes,
noatime, not relatime or anything else) to prevent unnessary changes on
metadata which unnecessarily cause exactly this kind of complexity to
happen.

When doing this, and not having to use btrfs balance and add/remove
disks, and if the data doesn't change much over time (especially if it's
just adding new stuff all the time), you are likely able to have far
more snapshots of the thing.

> The Plan 9 OS has a nice archival filesystem that lets you
> easily maintain snapshots, and has various tools that make
> it simple to keep a /snapshot//mmdd snapshot going back
> for the life of the filesystem.
> 
> I wanted to try and replicate the basic functionality of
> that history using a non-plan-9 filesystem.  At first I
> tried rsnapshot but I find its technique of rotating and
> deleting backups is thrashing the disks to the point that it
> can't keep up with the rotations (the cp -al is fast, but
> the periodic rm -rf of older snapshots kills the disk).

Yes, btrfs snapshots are already a huge improvement compared to that.
(Also, cp -l causes a modifications to also be done in the "snapshots",
because it's still the same file, b)

> With btrfs I was thinking perhaps I could more efficiently
> maintain the archive of changes over time using a snapshot.
> If this is an awful thought and I should just go away,
> please let me know.
> 
> If the limit is 100 or less I'd need use a more complicated
> rotation scheme.  For example with a layout like the
> following:
> 
> min/
> hour/
> day/
> month/
> year/
> 
> The idea being each bucket, min, hour, day, month, would
> be capped and older snapshots would be removed and replaced
> with newer ones over time.
> 
> so with a 15-minute snapshot cycle I'd end up with
> 
> min/[00,15,30,45]
> hour/[00-23]
> day/[01-31]
> month/[01-12]
> year/[2018,2019,...]
> 
> (72+ snapshots with room for a few years worth of yearly's).
> 
> But if things have changed with btrfs over the past few
> years and number of snapshots scales much higher, I would
> use the easier scheme:
> 
> /min/[00,15,30,45]
> /hourly/[00-23]
> /daily//
> 
> with 365 snapshots added per additional year.

There are tools available that can do this for you. The one I use is
btrbk, https://github.com/digint/btrbk (probably packaged in your
favorite linux distro).

I'd say, just try it. Add a snapshot schedule in your btrbk config, and
set it to never expire older ones. Then, just see what happens, and only
if you start seeing things slow down a lot, start worrying about what to
do, and let us know how far you got.

Have fun,

P.S. Here's an unfinished page from a tutorial that I'm writing that is
still heavily under construction, which touches the subject of
snapshotting data and metadata. Maybe it might help to explain
"complexity starts when changing things" more:

https://github.com/knorrie/python-btrfs/blob/tutorial/tutorial/cows.md

-- 
Hans van Kranenburg


Re: How to ensure that a snapshot is not corrupted?

2018-08-14 Thread Hans van Kranenburg
On 08/10/2018 12:07 PM, Cerem Cem ASLAN wrote:
> Original question is here: https://superuser.com/questions/1347843
> 
> How can we sure that a readonly snapshot is not corrupted due to a disk 
> failure?
> 
> Is the only way calculating the checksums one on another and store it
> for further examination, or does BTRFS handle that on its own?

It's no different than any other data stored in your filesystem.

So when just reading things from the snapshot, or when using the btrfs
scrub functionality, it will tell you if data that is read back matches
the checksums.

-- 
Hans van Kranenburg


Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Hans van Kranenburg
On 08/14/2018 07:09 PM, Andrei Borzenkov wrote:
> 14.08.2018 18:16, Hans van Kranenburg пишет:
>> On 08/14/2018 03:00 PM, Dmitrii Tcvetkov wrote:
>>>> Scott E. Blomquist writes:
>>>>  > Hi All,
>>>>  > 
>>>>  > [...]
>>>
>>> I'm not a dev, just user.
>>> btrfs-zero-log is for very specific case[1], not for transid errors.
>>> Transid errors mean that some metadata writes are missing, if
>>> they prevent you from mounting filesystem it's pretty much fatal. If
>>> btrfs could recover metadata from good copy it'd have done that.
>>>
>>> "wanted 2159304 found 2159295" means that some metadata is stale by 
>>> 9 commits. You could try to mount it with "ro,usebackuproot" mount
>>> options as readonly mount is less strict. If that works you can try
>>> "usebackuproot" without ro option. But 9 commits is probably too much
>>> and there isn't enough data to rollback so far.
>>
>> Keep in mind that a successful mount with usebackuproot does not mean
>> you're looking at a consistent filesystem. After each transaction
>> commit, disk space that is no longer referenced is immediately freed for
>> reuse.
>>
> 
> With all respect - in this case btrfs should not even pretend it can do
> "usebackuproot". What this option is for then?

I have no idea. As the Dutch say: "schiet mij maar in de kerstboom". The
commit that introduces it contains no explanation at all.

The only thing I can think of is when you're a developer and writing
buggy code that writes out corrupted tree root blocks, which makes the
filesystem crash in some way directly after a transaction ends while no
other new writes are done again yet, combined with getting fed up of
having to mkfs and put your testdata back all the time.

>> So, even if you can mount with usebackuproot, you have to hope that none
>> of the metadata blocks that were used back then have been overwritten
>> already, even the ones in distant corners of trees. A full check / scrub
>> / etc would be needed to find out.

-- 
Hans van Kranenburg


Re: [PATCH] Btrfs: correctly caculate item size used when item key collision happends

2018-08-14 Thread Hans van Kranenburg
ndex 452 namelen 26 name: abcdefghijklmnopqrstuv0451
> 
> Signed-off-by: ethanwu 
> ---
>  fs/btrfs/ctree.c   | 15 +--
>  fs/btrfs/extent-tree.c |  5 +++--
>  fs/btrfs/file-item.c   |  2 +-
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 4bc326d..3614dd9 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2678,8 +2678,8 @@ static struct extent_buffer 
> *btrfs_search_slot_get_root(struct btrfs_root *root,
>   * @p:   Holds all btree nodes along the search path
>   * @root:The root node of the tree
>   * @key: The key we are looking for
> - * @ins_len: Indicates purpose of search, for inserts it is 1, for
> - *   deletions it's -1. 0 for plain searches
> + * @ins_len: Indicates purpose of search, for inserts it is item size
> + *   including btrfs_item, for deletions it's -1. 0 for plain 
> searches
>   * @cow: boolean should CoW operations be performed. Must always be 1
>   *   when modifying the tree.
>   *
> @@ -2893,6 +2893,17 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> *trans, struct btrfs_root *root,
>   }
>   } else {
>   p->slots[level] = slot;
> + /*
> +  * item key collision happens. In this case, if we are 
> allow
> +  * to insert the item(for example, in dir_item case, 
> item key
> +  * collision is allowed), it will be merged with the 
> original
> +  * item. Only the item size grows, no new btrfs item 
> will be
> +  * added. Since the ins_len already accounts the size 
> btrfs_item,
> +  * this value is counted twice. Duduct this value here 
> so the
> +  * leaf space check will be correct.
> +  */
> + if (ret == 0 && ins_len > 0)
> + ins_len -= sizeof(struct btrfs_item);
>   if (ins_len > 0 &&
>   btrfs_leaf_free_space(fs_info, b) < ins_len) {
>   if (write_lock_level < 1) {
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d9fe58..4e3aa09 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1094,7 +1094,7 @@ static int convert_extent_item_v0(struct 
> btrfs_trans_handle *trans,
>  
>   new_size -= sizeof(*ei0);
>   ret = btrfs_search_slot(trans, root, , path,
> - new_size + extra_size, 1);
> + new_size + extra_size + sizeof(struct 
> btrfs_item), 1);
>   if (ret < 0)
>   return ret;
>   BUG_ON(ret); /* Corruption */
> @@ -1644,7 +1644,8 @@ int lookup_inline_extent_backref(struct 
> btrfs_trans_handle *trans,
>   }
>  
>  again:
> - ret = btrfs_search_slot(trans, root, , path, extra_size, 1);
> + ret = btrfs_search_slot(trans, root, , path,
> + extra_size + sizeof(struct btrfs_item), 1);
>   if (ret < 0) {
>   err = ret;
>   goto out;
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index f9dd6d1..0b6c581 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -804,7 +804,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle 
> *trans,
>*/
>   btrfs_release_path(path);
>   ret = btrfs_search_slot(trans, root, _key, path,
> - csum_size, 1);
> + csum_size + sizeof(struct btrfs_item), 1);
>   if (ret < 0)
>   goto fail_unlock;
>  
> 


-- 
Hans van Kranenburg


Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Hans van Kranenburg
On 08/14/2018 03:00 PM, Dmitrii Tcvetkov wrote:
>> Scott E. Blomquist writes:
>>  > Hi All,
>>  > 
>>  > [...]
> 
> I'm not a dev, just user.
> btrfs-zero-log is for very specific case[1], not for transid errors.
> Transid errors mean that some metadata writes are missing, if
> they prevent you from mounting filesystem it's pretty much fatal. If
> btrfs could recover metadata from good copy it'd have done that.
> 
> "wanted 2159304 found 2159295" means that some metadata is stale by 
> 9 commits. You could try to mount it with "ro,usebackuproot" mount
> options as readonly mount is less strict. If that works you can try
> "usebackuproot" without ro option. But 9 commits is probably too much
> and there isn't enough data to rollback so far.

Keep in mind that a successful mount with usebackuproot does not mean
you're looking at a consistent filesystem. After each transaction
commit, disk space that is no longer referenced is immediately freed for
reuse.

So, even if you can mount with usebackuproot, you have to hope that none
of the metadata blocks that were used back then have been overwritten
already, even the ones in distant corners of trees. A full check / scrub
/ etc would be needed to find out.

-- 
Hans van Kranenburg


Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Hans van Kranenburg
On 08/12/2018 09:19 PM, Scott E. Blomquist wrote:
> 
> Hi All,
> 
> Early this morning there was a power glitch that affected our system.
> 
> The second enclosure went offline but the file system stayed up for a
> bit before rebooting and recovering the 2 missing arrays sdb1 and
> sdc1.
> 
> When mounting we get
> 
> Aug 12 14:52:43 localhost kernel: [ 8536.649270] BTRFS info (device 
> sda1): has skinny extents
> Aug 12 14:54:52 localhost kernel: [ 8665.900321] BTRFS error (device 
> sda1): parent transid verify failed on 177443463479296 wanted 2159304 found 
> 2159295
> Aug 12 14:54:52 localhost kernel: [ 8665.985512] BTRFS error (device 
> sda1): parent transid verify failed on 177443463479296 wanted 2159304 found 
> 2159295
> Aug 12 14:54:52 localhost kernel: [ 8666.056845] BTRFS error (device 
> sda1): failed to read block groups: -5
> Aug 12 14:54:52 localhost kernel: [ 8666.254178] BTRFS error (device 
> sda1): open_ctree failed
> 
> We are here...
> 
> # uname -a
> Linux localhost 4.17.14-custom #1 SMP Sun Aug 12 11:54:00 EDT 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> # btrfs --version
> btrfs-progs v4.17.1
> 
> # btrfs filesystem show
> Label: none  uuid: 8337c837-58cb-430a-a929-7f6d2f50bdbb
> Total devices 3 FS bytes used 75.05TiB
> devid1 size 47.30TiB used 42.07TiB path /dev/sda1
> devid2 size 21.83TiB used 16.61TiB path /dev/sdb1
> devid3 size 21.83TiB used 16.61TiB path /dev/sdc1

What kind of devices are this? You say enclosure... is it a bunch of
disks doing its own RAID, with btrfs on top?

Do you have RAID1 metadata on top of that, or single?

At least if you go the mkfs route (I read the other replies) then also
find out what happened. If your storage is losing data in situations
like this while it told btrfs that the data was safe, you're running a
dangerous operation.

-- 
Hans van Kranenburg


Re: [RFC PATCH 00/17] btrfs zoned block device support

2018-08-10 Thread Hans van Kranenburg
On 08/10/2018 09:28 AM, Qu Wenruo wrote:
> 
> 
> On 8/10/18 2:04 AM, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>>
>> [...]
> 
> And this is the patch modifying extent allocator.
> 
> Despite that, the support zoned storage looks pretty interesting and
> have something in common with planned priority-aware extent allocator.

Priority-aware allocator? Is someone actually working on that, or is it
planned like everything is 'planned' (i.e. nice idea, and might happen
or might as well not happen ever, SIYH)?

-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: Expected Received UUID

2018-07-31 Thread Hans van Kranenburg
Hi,

Can you please report the kernel versions you are using on every system
(uname -a)?

I would indeed expect the Received UUID value for C to have the same
uuid as the original UUID of A, so the b00e8ba1 one.

The send part, generating the send stream is done by the running kernel.
The receive part is done by the userspace btrfs progs. The btrfs progs
receive code just sets the Received UUID to whatever it reads from the
send stream information.

So, your sending kernel version is important here.

When looking up the lines that send the UUID, in fs/btrfs/send.c, I can
see there was a problem like this in the past:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b96b1db039ebc584d03a9933b279e0d3e704c528

I was introduced between linux 4.1 and 4.2 and solved in 2015 with that
commit, which ended up somewhere in 4.3 I think.

>From the btrfs-progs versions you list, I suspect this is a Debian
Jessie and 2x Debian Stretch, but the Debian 3.16 and 4.9 kernels would
not should not have this issue.

On 07/31/2018 07:42 PM, Gaurav Sanghavi wrote:
> Hello everyone, 
> 
> While researching BTRFS for a project that involves backing up snapshots
> from device A to device B 
> before consolidating backups from device B to device C, I noticed that
> received UUID on snapshot on 
> device C is not the same as received UUID for the same snapshot on
> device B. Here are my steps:
> 
> 1. Device A
> BTRFS version: v3.17
> 
> btrfs su sn -r /home/test/snapshot1 /home/test/snaps/snapshot1
> btrfs su show /home/test/snaps/snapshot1
> Name: snapshot1
> uuid: b00e8ba1-5aaa-3641-9c4c-e168eee5c296
> Parent uuid: cb570dec-e9fd-1f40-99d2-2070c8ee2466
>         Received UUID:      ---
> Creation time: 2018-07-30 18:32:37
> Flags: readonly
> 
> 2. Send to Device B
> btrfs send /home/test/snaps/snapshot1 | ssh  'btrfs receive
> /home/backups/'
> 
> After send completes, on Device B
> BTRFS version: v4.7.3
> btrfs su show /home/backups/snapshot1
> Name: snapshot1
> UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c
> Parent UUID: a2314f7c-4b11-ed40-901f-f1acb5ebf802
> Received UUID: b00e8ba1-5aaa-3641-9c4c-e168eee5c296
> Creation time: 2018-07-30 18:42:37 -0700
> Flags: readonly
> 
> 
> 3. Send to Device C
> btrfs send /home/backups/snapshot1 | ssh  'btrfs receive
> /home/backups2/'
> 
> After send completes, on Device C
> BTRFS version: v4.7.3
> btrfs su show /home/backups2/snapshot1
> Name: snapshot1
> UUID: 8a13aab5-8e44-2541-9082-bc583933b964
> Parent UUID: 54e9b4ff-46dc-534e-b70f-69eb7bb21028
> Received UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c
> Creation time: 2018-07-30 18:58:32 -0700
> Flags: readonly
> 
> 1. I have gone through some of the archived emails and have noticed
> people mentioning that
> if received UUID is set, btrfs send propogates this 'received UUID'. But
> in above case, 
> it's different for the same snapshot on all three devices. Is this the
> expected behavior ?
> 
> 2. We want to be able to start backing up from Device A to C, in case B
> goes down or needs 
> to be replaced. If received UUID is expected to differ for the snapshot
> on device B and C, incremental
> backups will not work from A to C without setting received UUID. I have
> seen python-btrfs
> mentioned in a couple of emails; but have anyone of you used it in a
> production environment ?
> 
> This is my first post to this email. Please let me know if I am missing
> any details.
> 
> Thanks,
> Gaurav
> 

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unmountable root partition

2018-07-31 Thread Hans van Kranenburg
Hi,

On 07/31/2018 08:03 PM, Cerem Cem ASLAN wrote:
> Hi, 
> 
> I'm having trouble with my server setup, which contains a BTRFS root
> partition on top of LVM on top of LUKS partition. 
> 
> Yesterday server was shut down unexpectedly. I booted the system with a
> pendrive which contains Debian 4.9.18 and tried to mount the BTRFS root
> partition manually. 
> 
> 1. cryptsetup luksOpen /dev/sda5 
> 
> Seems to decrypt the partition (there are no errors) 
> 
> 2. /dev/mapper/foo--vg-root and /dev/mapper/foo--vg-swap_1 is created
> automatically, so I suppose LVM works correctly. 
> 
> 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo 
> Gives the following error: 
> 
> mount: wrong fs type, bad option, bad superblock on ...
> 
> 4. dmesg | tail 
> Outputs the following: 
> 
> 
> [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0
> 02 00 00 02 00
> [17755.840921] blk_update_request: I/O error, dev sda, sector 507906
> [17755.840941] EXT4-fs (dm-4): unable to read superblock
> [18140.052300] sd 3:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [18140.052305] sd 3:0:0:0: [sda] tag#0 CDB: ATA command pass
> through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
> [18142.991851] sd 3:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [18142.991856] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0
> 80 00 00 08 00
> [18142.991860] blk_update_request: I/O error, dev sda, sector 508032
> [18142.991869] Buffer I/O error on dev dm-4, logical block 16, async
> page read

This points at hardware level failure, regardless if the disk would hold
a btrfs or ext4 or whatever other kind of filesystem. The disk itself
cannot read data back when we ask for it.

> 4. 
> 
> # btrfs restore -i -D /dev/mapper/foo--vg-root /dev/null
> No valid Btrfs found on /dev/mapper/foo--vg-root
> Could not open root, trying backup super
> No valid Btrfs found on /dev/mapper/foo--vg-root
> Could not open root, trying backup super
> No valid Btrfs found on /dev/mapper/foo--vg-root
> Could not open root, trying backup super
> 
> We are pretty sure that no unexpected electric cuts has been happened. 
> 
> At this point I don't know what information I should supply. 
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: python-btrfs & btrfs-heatmap ebuilds now available for Gentoo

2018-07-23 Thread Hans van Kranenburg
On 07/23/2018 04:25 PM, Holger Hoffstätte wrote:
> 
> I wanted to migrate my collection of questionable shell scripts for btrfs
> maintenance/inspection to a more stable foundation and therefore created
> Gentoo ebuilds for Hans van Kranenburg's excellent python-btrfs [1] and
> btrfs-heatmap [2] packages.
> 
> They can be found in my overlay at:
> 
> https://github.com/hhoffstaette/portage/tree/master/sys-fs/python-btrfs
> https://github.com/hhoffstaette/portage/tree/master/sys-fs/btrfs-heatmap

Thanks for doing this!

And now you're making us curious about those
not-so-questionable-any-more scripts. :D

> Both are curently only available for python-3.6 since a change in 3.7
> broke the existing python API [3]. As soon as that is fixed I'll add 3.7
> to the PYTHON_COMPAT entries.

Yes, thanks for testing, I'll have a look at it and will probably also
ask you to help testing the fix.

> btrfs-heatmap properly uses the python-exec wrapper and therefore works
> regardless of currently selected default python version.  :)
> 
> I hope this is useful to someone.

I bet it will. ;-]

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still

2018-06-25 Thread Hans van Kranenburg
extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> 
> At the rate it's going, it'll probably take days though, it's already been 36H

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Add more details while checking tree block

2018-06-22 Thread Hans van Kranenburg
On 06/22/2018 06:25 PM, Nikolay Borisov wrote:
> 
> 
> On 22.06.2018 19:17, Su Yue wrote:
>>
>>
>>
>>> Sent: Friday, June 22, 2018 at 11:26 PM
>>> From: "Hans van Kranenburg" 
>>> To: "Nikolay Borisov" , "Su Yue" 
>>> , linux-btrfs@vger.kernel.org
>>> Subject: Re: [PATCH] btrfs: Add more details while checking tree block
>>>
>>> On 06/22/2018 01:48 PM, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 22.06.2018 04:52, Su Yue wrote:
>>>>> For easier debug, print eb->start if level is invalid.
>>>>> Also make print clear if bytenr found is not expected.
>>>>>
>>>>> Signed-off-by: Su Yue 
>>>>> ---
>>>>>  fs/btrfs/disk-io.c | 8 
>>>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>> index c3504b4d281b..a90dab84f41b 100644
>>>>> --- a/fs/btrfs/disk-io.c
>>>>> +++ b/fs/btrfs/disk-io.c
>>>>> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct 
>>>>> btrfs_io_bio *io_bio,
>>>>>  
>>>>>   found_start = btrfs_header_bytenr(eb);
>>>>>   if (found_start != eb->start) {
>>>>> - btrfs_err_rl(fs_info, "bad tree block start %llu %llu",
>>>>> -  found_start, eb->start);
>>>>> + btrfs_err_rl(fs_info, "bad tree block start want %llu have 
>>>>> %llu",
>>>>
>>>> nit: I'd rather have the want/have in brackets (want %llu have% llu)
>>>
>>> From a user support point of view, this text should really be improved.
>>> There are a few places where 'want' and 'have' are reported in error
>>> strings, and it's totally unclear what they mean.
>>>
>> Yes. The strings are too concise for users to understand errors.
>> Developers always prefer to avoid "unnecessary" words in logs.
>>
>>> Intuitively I'd say when checking a csum, the "want" would be what's on
>>> disk now, since you want that to be correct, and the "have" would be
>>> what you have calculated, but it's actually the other way round, or
>>> wasn't it? Or was it?
>> For csum, you are right at all.
>>
>> In this situation, IIRC bytenr of "have" is read from disk.
>> Bytenr of "want" is assigned while constructing eb, from parent ptr
>> , root item and other things which are also read from disk.
>>>
>>> Every time someone pastes such a message when we help on IRC for
>>> example, there's confusion, and I have to look up the source again,
>>> because I always forget.
>>
>> Me too.
>>>
>>> What about (%llu stored on disk, %llu calculated now) or something similar?
> 
> Wha tabout expected got

No, that's just as horrible as want and found.

Did you 'expect' the same checksum to be on disk disk as what you 'get'
when calculating one? Or did you 'expect' the same thing after
calculation as what you 'got' when reading from disk?

/o\

>>>
>> "(%llu stored on disk " part sounds good to me. But I don't know how to 
>> describe the second part. May the good sleep help me.
>>
>> Thanks,
>> Su
>>>>
>>>>> +  eb->start, found_start);
>>>>>   ret = -EIO;
>>>>>   goto err;
>>>>>   }
>>>>> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct 
>>>>> btrfs_io_bio *io_bio,
>>>>>   }
>>>>>   found_level = btrfs_header_level(eb);
>>>>>   if (found_level >= BTRFS_MAX_LEVEL) {
>>>>> - btrfs_err(fs_info, "bad tree block level %d",
>>>>> -   (int)btrfs_header_level(eb));
>>>>> + btrfs_err(fs_info, "bad tree block level %d on %llu",
>>>>> +   (int)btrfs_header_level(eb), eb->start);
>>>>>   ret = -EIO;
>>>>>   goto err;
>>>>>   }
>>>>>
>>>
>>> -- 
>>> Hans van Kranenburg
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Add more details while checking tree block

2018-06-22 Thread Hans van Kranenburg
On 06/22/2018 01:48 PM, Nikolay Borisov wrote:
> 
> 
> On 22.06.2018 04:52, Su Yue wrote:
>> For easier debug, print eb->start if level is invalid.
>> Also make print clear if bytenr found is not expected.
>>
>> Signed-off-by: Su Yue 
>> ---
>>  fs/btrfs/disk-io.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index c3504b4d281b..a90dab84f41b 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct 
>> btrfs_io_bio *io_bio,
>>  
>>  found_start = btrfs_header_bytenr(eb);
>>  if (found_start != eb->start) {
>> -btrfs_err_rl(fs_info, "bad tree block start %llu %llu",
>> - found_start, eb->start);
>> +btrfs_err_rl(fs_info, "bad tree block start want %llu have 
>> %llu",
> 
> nit: I'd rather have the want/have in brackets (want %llu have% llu)

>From a user support point of view, this text should really be improved.
There are a few places where 'want' and 'have' are reported in error
strings, and it's totally unclear what they mean.

Intuitively I'd say when checking a csum, the "want" would be what's on
disk now, since you want that to be correct, and the "have" would be
what you have calculated, but it's actually the other way round, or
wasn't it? Or was it?

Every time someone pastes such a message when we help on IRC for
example, there's confusion, and I have to look up the source again,
because I always forget.

What about (%llu stored on disk, %llu calculated now) or something similar?

> 
>> + eb->start, found_start);
>>  ret = -EIO;
>>  goto err;
>>  }
>> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct 
>> btrfs_io_bio *io_bio,
>>  }
>>  found_level = btrfs_header_level(eb);
>>  if (found_level >= BTRFS_MAX_LEVEL) {
>> -btrfs_err(fs_info, "bad tree block level %d",
>> -  (int)btrfs_header_level(eb));
>> +btrfs_err(fs_info, "bad tree block level %d on %llu",
>> +  (int)btrfs_header_level(eb), eb->start);
>>  ret = -EIO;
>>  goto err;
>>  }
>>

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About more loose parameter sequence requirement

2018-06-19 Thread Hans van Kranenburg
On 06/18/2018 03:05 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月18日 20:02, David Sterba wrote:
>> On Mon, Jun 18, 2018 at 07:40:44PM +0800, Qu Wenruo wrote:
>>>
>>>
>>> On 2018年06月18日 19:34, David Sterba wrote:
>>>> On Thu, Jun 14, 2018 at 03:17:45PM +0800, Qu Wenruo wrote:
>>>>> I understand that btrfs-progs introduced restrict parameter/option order
>>>>> to distinguish global and sub-command parameter/option.
>>>>>
>>>>> However it's really annoying if one just want to append some new options
>>>>> to previous command:
>>>>>
>>>>> E.g.
>>>>> # btrfs check /dev/data/btrfs
>>>>> # !! --check-data-csum
>>>>>
>>>>> The last command will fail as current btrfs-progs doesn't allow any
>>>>> option after parameter.
>>>>>
>>>>>
>>>>> Despite the requirement to distinguish global and subcommand
>>>>> option/parameter, is there any other requirement for such restrict
>>>>> option-first-parameter-last policy?
>>>>
>>>> I'd say that it's a common and recommended pattern. Getopt is able to
>>>> reorder the parameters so mixed options and non-options are accepted,
>>>> unless POSIXLY_CORRECT (see man getopt(3)) is not set. With the more
>>>> strict requirement, 'btrfs' option parser works the same regardless of
>>>> that.
>>>
>>> But currently it doesn't work.
>>> Just as the example, btrfs check /dev/data/btrfs --check-data-sum won't
>>> work.
>>> It's different from a lot of other common programs.
>>
>> With POSIXLY_CORRECT set, it expectedly won't work. With POSIXLY_CORRECT
>> unset, it would work in general, but not for 'btrfs'.
>>
>> As this is per-application decision I find it ok, besides that I also
>> find the 'options anywhere' pattern bad. Does it really save you that
>> much time while typing the commands with new arguments?
> 
> Personally speaking, yes.
> 
>> There are
>> movement shortcuts to jump by words, you get the previous command by
>> up-arrow. About the same number of keystrokes.
> 
> Not really, normally just "!! ", where I don't even need to
> move my fingers to arrow keys.
> (I'm all ears about extra bash shortcuts on this)

It's called `set -o vi` ;-]

If you remap capslock to be escape, there's not much moving around of
fingers going on. CAPS-k-b-b-a and type your new option. No reaching to
shift and 1 necessary at all.

:]

>>
>> New code needs to be tested, documented and maintained, that's the cost
>> I find too high for something that's convenience for people who are used
>> to some shell builtins.
> 
> The biggest problem is, the behavior isn't even consistent across
> btrfs-progs.
> mkfs.btrfs accept such out-of-order parameters while btrfs not.
> 
> And most common tools, like commands provided by coretutil, they don't
> care about the order.
> The only daily exception is 'scp', which I found it pretty unhandy.
> 
> And just as Paul and Hugo, I think there are quite some users preferring
> out-of-order parameter/options.
> 
> 
> I also understand the maintenance burden, but at least let's try if
> there are better solution for this.
> 
> Thanks,
> Qu
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"

2018-06-04 Thread Hans van Kranenburg
Hi,

On 06/04/2018 03:14 PM, Russell Coker wrote:
> The command "reboot -nffd" (kernel reboot without flushing kernel buffers or 
> writing status) when run on a BTRFS system with SE Linux will often result in 
> /var/log/audit/audit.log being unlabeled.

This recent fix might be what you're looking for:

https://www.spinics.net/lists/linux-btrfs/msg77927.html

>  It also results in some systemd-
> journald files like /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/
> system.journal being unlabeled but that is rarer.  I think that the same 
> problem afflicts both systemd-journald and auditd but it's a race condition 
> that on my systems (both production and test) is more likely to affect auditd.
> 
> root@stretch:/# xattr -l /var/log/audit/audit.log 
> security.selinux: 
>    73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_ 
> 0010   72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s 
> 0020   30 00  0.
> 
> SE Linux uses the xattr "security.selinux", you can see what it's doing with 
> xattr(1) but generally using "ls -Z" is easiest.
> 
> If this issue just affected "reboot -nffd" then a solution might be to just 
> not run that command.  However this affects systems after a power outage.
>  
> I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security 
> update for Debian/Stretch which is the latest supported release of Debian).  
> I 
> have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 
> (the 
> latest from Debian/Unstable).  For testing I reproduced this with a 4G 
> filesystem in a VM, but in production it has happened on BTRFS RAID-1 arrays, 
> both SSD and HDD.
>  
> #!/bin/bash 
> set -e 
> COUNT=$(ps aux|grep [s]bin/auditd|wc -l) 
> date 
> if [ "$COUNT" = "1" ]; then 
>  echo "all good" 
> else 
>  echo "failed" 
>  exit 1 
> fi
> 
> Firstly the above is the script /usr/local/sbin/testit, I test for auditd 
> running because it aborts if the context on it's log file is wrong.  When SE 
> Linux is in enforcing mode an incorrect/missing label on the audit.log file 
> causes auditd to abort.
>  
> root@stretch:~# ls -liZ /var/log/audit/audit.log 
> 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun  
> 1 
> 12:23 /var/log/audit/audit.log
> Above is before I do the tests.
>  
> while ssh stretch /usr/local/sbin/testit ; do 
>  ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 & 
>  sleep 20 
> done
> Above is the shell code I run to do the tests.  Note that the VM in question 
> runs on SSD storage which is why it can consistently boot in less than 20 
> seconds.
>  
> Fri  1 Jun 12:26:13 UTC 2018 
> all good 
> Fri  1 Jun 12:26:33 UTC 2018 
> failed
> Above is the output from the shell code in question.  After the first reboot 
> it fails.  The probability of failure on my test system is greater than 50%.
>  
> root@stretch:~# ls -liZ /var/log/audit/audit.log  
> 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun  1 
> 12:26 /var/log/audit/audit.log
> Now the result.  Note that the Inode has not changed.  I could understand a 
> newly created file missing an xattr, but this is an existing file which 
> shouldn't have had it's xattr changed.  But somehow it gets corrupted.
>  
> The first possibility I considered was that SE Linux code might be at fault.  
> I asked on the SE Linux mailing list (I haven't been involved in SE Linux 
> kernel code for about 15 years) and was informed that this isn't likely at 
> all.  There have been no problems like this reported with other filesystems.
>  
> Does anyone have any ideas of other tests I should run?  Anyone want me to 
> try 
> a different kernel?  I can give root on a VM to anyone who wants to poke at 
> it.
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: fi-usage: fix RAID10 raw disk usage

2018-06-02 Thread Hans van Kranenburg
In case of RAID10, fi usage is reporting half the amount of allocated
space and twice the amount of unallocated disk space. Let's fix this.

For example, a RAID10 chunk of 3GiB with num_stripes 6, half of the
stripes (dev extents) are a mirror of the other half, so the 3GiB of
actual data has to be divided by 3, and not 6 to get the size of each of
those device extents.

Signed-off-by: Hans van Kranenburg 
---
 cmds-fi-usage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index b9a2b1c8..3bd2ccdf 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -660,7 +660,7 @@ static u64 calc_chunk_size(struct chunk_info *ci)
else if (ci->type & BTRFS_BLOCK_GROUP_RAID6)
return ci->size / (ci->num_stripes -2);
else if (ci->type & BTRFS_BLOCK_GROUP_RAID10)
-   return ci->size / ci->num_stripes;
+   return ci->size / (ci->num_stripes / 2);
return ci->size;
 }
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: off-by-one uncompressed invalid ram_bytes corruptions

2018-05-29 Thread Hans van Kranenburg
On 05/21/2018 03:07 AM, Qu Wenruo wrote:
> 
> [...]
> 
> And the problem of wrong "root=" output also makes me pretty curious.
Yeah, there a bunch of error messages in the kernel with always say
root=1, regardless of the actual root the tree block is in.

I think I once tried to find out why, but apparently it wasn't a really
obvious error somewhere.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to change/fix 'Received UUID'

2018-03-06 Thread Hans van Kranenburg
On 05/03/2018 20:47, Marc MERLIN wrote:
> On Mon, Mar 05, 2018 at 10:38:16PM +0300, Andrei Borzenkov wrote:
>>> If I absolutely know that the data is the same on both sides, how do I
>>> either
>>> 1) force back in a 'Received UUID' value on the destination
>>
>> I suppose the most simple is to write small program that does it using
>> BTRFS_IOC_SET_RECEIVED_SUBVOL.
> 
> Understdood.
> Given that I have not worked with the code at all, what is the best 
> tool in btrfs progs, to add this to?
> 
> btrfstune?
> btrfs propery set?
> other?
> 
> David, is this something you'd be willing to add support for?
> (to be honest, it'll be quicker for someone who knows the code to add than
> for me, but if no one has the time, I'l see if I can have a shot at it)

If you want something right now that works, so you can continue doing
your backups, python-btrfs also has the ioctl, since v9, together with
an example of using it:

https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d

P.S. even when coding it in C, the documentation in that commit message
might be useful. :)

fwiw,
Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: incremental send/receive ERROR: cannot find parent subvolume

2018-02-26 Thread Hans van Kranenburg
On 02/26/2018 11:54 PM, Emil.s wrote:
> Hello,
> 
> I'm trying to restore a subvolume from a backup, but I'm failing when
> I try to setup the replication chain again.
> 
> Previously I had disk A and B, where I was sending snapshots from A to
> B using "send -c /disk_a/1 /disk_a/2 | receive /disk_b" and so on.
> Now disk A failed, so I got a new disk: C.
> 
> First I sent the last backup from B to C by running "send disk_b/2 |
> receive /disk_c/".

I can't read all of it right now, my time is a bit limited, but I have a
quick suggestion to at least make the original goal work, I think.

The list of snapshots that were on your A was organized differently than
the list with the same numbers 1, 2, 3 on B.

On A, you always snapshot the main r/w subvolume.

A
\_ A1
\_ A2
\_ A3

On B, every time you receive based on a previous one, it snapshots the
previous one and then puts changes on top:

B1 (received from A1)
\_ B2 (snap from B1, with changes from A1->A2)
   \_ B3 (snap from B2, with changes from A2->A3)
  \_ etc...

If A is gone, and you have new empty C:

> Then my plan was to use disk B as the new main disk, so on disk B, I
> created a new snapshot by running "snapshot -r disk_b/2
> disk_b/my_volume".
> Now I want to send "disk_b/my_volume" to disk C, so that I can set
> disk_b/2 to RW and start using it again, but the last send fails with
> "ERROR: cannot find parent subvolume".

Do it like this:

-# btrfs snap B2 my_volume (so no -r)

now my_volume is a writable clone of B2

-# btrfs snap -r my_volume my_volume1

now you start a new line of snapshots where my_volume has the same role
as A before.

my_volume
\_ my_volume1
\_ my_volume2
\_ my_volume3

Now you can send my_volume3 -p my_volume2 etc...

Then you discard all old Bx whenever you want.

In the following, disk_b/2 and your other example names are suddenly
replaced by real names, which makes it a bit hard to follow quickly.

> Disk B:
> root@tillberga: /backup #> btrfs subvol sh snapshots/user_data_2018-02-17/
> snapshots/user_data_2018-02-17
> Name: user_data_2018-02-17
> UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Parent UUID: 51180286-c202-c94c-b8f9-2ecc8d2b5b7c
> Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10
> Flags: readonly
> Snapshot(s):
> user_data_test
> 
> Disk C:
> root@hedemora: /btrfs #> btrfs subvol sh user_data_2018-02-17/
> user_data_2018-02-17
> Name: user_data_2018-02-17
> UUID: 35361429-a42c-594b-b390-51ffb9725324
> Parent UUID: -
> Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Flags: readonly
> Snapshot(s):
> 
> Disk B has UUID 93433261-d954-9f4b-8319-d450ae079a9a, and disk C has
> "Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a". I think that's
> alright?
> 
> The new snapshot on disk B looks as following:
> root@tillberga: /backup #> btrfs subvol sh user_data_test
> user_data_test
> Name: user_data_test
> UUID: bf88000c-e3a6-434b-8f69-f5ce2174227e
> Parent UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10
> Flags: readonly
> Snapshot(s):
> 
> But when I'm trying to send it, I'm getting the following:
> root@tillberga: /backup #> btrfs send -v -c
> snapshots/user_data_2018-02-17 user_data_test | ssh user@hedemora
> "sudo btrfs receive -v /btrfs/"
> At subvol user_data_test
> BTRFS_IOC_SEND returned 0
> joining genl thread
> receiving snapshot user_data_test
> uuid=014fc004-ae04-0148-9525-1bf556fd4d10, ctransid=52373
> parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10,
> parent_ctransid=52373
> At snapshot user_data_test
> ERROR: cannot find parent subvolume
> 
> Note that the receiver says
> "parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10". Not really sure
> where that comes from, but disk B has the same, so maybe that's the
> UUID of the original snapshot on disk A?
> 
> Is it possible to continue to send incremental snapshots between these
> two file systems, or must I do a full sync?


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Add nossd_spread mount option

2018-02-21 Thread Hans van Kranenburg
On 02/22/2018 12:31 AM, Howard McLauchlan wrote:
> Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
> Presently there is an option to disable all SSD optimizations, but there
> isn't an option to disable just ssd_spread.
> 
> This patch adds a mount option nossd_spread that disables ssd_spread
> only.
> 
> Signed-off-by: Howard McLauchlan <hmclauch...@fb.com>
> ---
>  fs/btrfs/super.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 6e71a2a78363..4c0fcf5b3e7e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -310,10 +310,10 @@ static void btrfs_put_super(struct super_block *sb)
>  enum {
>   Opt_degraded, Opt_subvol, Opt_subvolid, Opt_device, Opt_nodatasum,
>   Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd,
> - Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress,
> - Opt_compress_type, Opt_compress_force, Opt_compress_force_type,
> - Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
> - Opt_space_cache, Opt_space_cache_version, Opt_clear_cache,
> + Opt_nossd, Opt_ssd_spread, Opt_nossd_spread, Opt_thread_pool, Opt_noacl,
> + Opt_compress, Opt_compress_type, Opt_compress_force,
> + Opt_compress_force_type, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
> + Opt_discard, Opt_space_cache, Opt_space_cache_version, Opt_clear_cache,
>   Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid,
>   Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
>   Opt_skip_balance, Opt_check_integrity,
> @@ -353,6 +353,7 @@ static const match_table_t tokens = {
>   {Opt_ssd, "ssd"},
>   {Opt_ssd_spread, "ssd_spread"},
>   {Opt_nossd, "nossd"},
> + {Opt_nossd_spread, "nossd_spread"},
>   {Opt_acl, "acl"},
>   {Opt_noacl, "noacl"},
>   {Opt_notreelog, "notreelog"},

.oO(Why doesn't the enum just have one option per line, so the changelog
is less invasive?)

> @@ -582,6 +583,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   btrfs_clear_and_info(info, SSD_SPREAD,
>"not using spread ssd allocation 
> scheme");
>   break;
> + case Opt_nossd_spread:
> + btrfs_clear_and_info(info, SSD_SPREAD,
> +  "not using spread ssd allocation 
> scheme");
> + break;
>   case Opt_barrier:
>   btrfs_clear_and_info(info, NOBARRIER,
>"turning on barriers");
> 

Related:
* https://www.spinics.net/lists/linux-btrfs/msg64247.html
* https://www.spinics.net/lists/linux-btrfs/msg64277.html
* https://www.spinics.net/lists/linux-btrfs/msg64499.html

Apparently that discussion never resulted in actual changes, so thanks
for continuing it now.

The mount options are a bit weird, because ssd_spread also includes ssd,
but doesn't show it.

I personally don't like all of them at all, and I should really finish
and send my proposal to get them replaced by options that can choose
extent allocator for data and metadata individually (instead of some
setting that changes them both at the same time) because there are
proper real life situations that e.g. benefit from nossd style 'tetris'
data allocator with ssd_spread style 'contiguous' metadata extent allocator.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-21 Thread Hans van Kranenburg
On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote:
> On 02/21/2018 10:03 AM, Hans van Kranenburg wrote:
>> On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:
>>> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>> My suggestion is to use balance to reduce number of block groups, so we
>>>> could do less search at mount time.
>>>>
>>>> It's more like reason 2.
>>>>
>>>> But it only works for case where there are a lot of fragments so a lot
>>>> of chunks are not fully utilized.
>>>> Unfortunately, that's not the case for OP, so my suggestion doesn't
>>>> make
>>>> sense here.
>>>
>>> I ran the balance all the same, and the number of chunks has not
>>> changed.  Before 3454, and after 3454:
>>>   $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>> 3454
>>>
>>> HOWEVER, the time to mount has gone up somewhat significantly, from
>>> 11.537s to 16.553s, which was very unexpected.  Output from previously
>>> run commands shows the extent tree metadata grew about 25% due to the
>>> balance.  Everything else stayed roughly the same, and no additional
>>> data was added to the system (nor snapshots taken, nor additional
>>> volumes added, etc):
>>>
>>> Before balance:
>>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>>> ROOT_TREE   1.14MiB 0(    72) 1( 1)
>>> EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
>>> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
>>> DEV_TREE  272.00KiB 0(    16) 1( 1)
>>> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
>>> CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
>>> QUOTA_TREE    0.00B
>>> UUID_TREE  16.00KiB 0( 1)
>>> FREE_SPACE_TREE   0.00B
>>> DATA_RELOC_TREE    16.00KiB 0( 1)
>>>
>>> After balance:
>>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>>> ROOT_TREE   1.16MiB 0(    73) 1( 1)
>>> EXTENT_TREE   806.50MiB 0( 51419) 1(   196) 2( 1)
>>> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
>>> DEV_TREE  272.00KiB 0(    16) 1( 1)
>>> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
>>> CSUM_TREE   3.49GiB 0(227920) 1(   804) 2( 2) 3( 1)
>>> QUOTA_TREE    0.00B
>>> UUID_TREE  16.00KiB 0( 1)
>>> FREE_SPACE_TREE   0.00B
>>> DATA_RELOC_TREE    16.00KiB 0( 1)
>>
>> Heu, interesting.
>>
>> What's the output of `btrfs fi df /mountpoint` and `grep btrfs
>> /proc/self/mounts` (does it contain 'ssd') and which kernel version is
>> this? (I get a bit lost in the many messages and subthreads in this
>> thread) I also can't find in the threads which command "the balance"
>> means.
> 
> Short recap:
> - I found long mount time for 1.65TB of home dir data at ~4s
> - Doubling this data on the same btrfs fs to 3.3TB increased mount time
> to 11s
> - Qu et. al. suggested balance might reduce chunks, which came in around
> 3400, and the chunk walk on mount was the driving factor in terms of time
> - I ran balance
> - Mount time went up to 16s, and all else remains the same except the
> extent tree.
> 
> $ sudo btrfs fi df /mnt/btrfs
> Data, single: total=3.32TiB, used=3.32TiB
> System, DUP: total=8.00MiB, used=384.00KiB
> Metadata, DUP: total=16.50GiB, used=15.82GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

Ah, so allocated data space is 100% filled with data. That's very good
yes. And it explains why you can't lower the amount of chunks by
balancing. You're just moving around data and replacing full chunks with
new full chunks. :]

Doesn't explain why it blows up the size of the extent tree though. I
have no idea why that is.

> $ sudo grep btrfs /proc/self/mounts
> /dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0

Ok, no 'ssd', good.

>  $ uname -a
> Linux  4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
> 
> I plan to rerun this on a newer kernel, but haven't had time to spin up
> another machine with a modern kernel yet, and this machine is also being
> used for other things right now so I can't just upgrade it.
> 
>> And what does this tell you?
>>
>> https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py
>>
> 
> $ sudo ./show_free_space_fragmentation.py /mnt/btrfs
> No Free Space Tree (space_cache=v2) found!
> Falling back to using the extent tree to determine free space extents.
> vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1
> score 0
> Skipped because of usage > 90%: 3397 chunks

Good.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-21 Thread Hans van Kranenburg
On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:
> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>>>> 3454
>>>>>
>> Increasing node size may reduce extent tree size. Although at most
>> reduce one level AFAIK.
>>
>> But considering that the higher the node is, the more chance it's
>> cached, reducing tree height wouldn't bring much performance impact
>> AFAIK.
>>
>> If one could do real world benchmark to beat or prove my assumption, it
>> would be much better though.
> 
> I'm willing to try this if you tell me exactly what you'd like me to do.
>  I've not mucked with nodesize before, so I'd like to avoid changing it
> to something absurd.
> 
>>> Qu's suggestion is actually independent of all the above reasons, but
>>> does kind of fit in with the fourth as another case of preventative
>>> maintenance.
>>
>> My suggestion is to use balance to reduce number of block groups, so we
>> could do less search at mount time.
>>
>> It's more like reason 2.
>>
>> But it only works for case where there are a lot of fragments so a lot
>> of chunks are not fully utilized.
>> Unfortunately, that's not the case for OP, so my suggestion doesn't make
>> sense here.
> 
> I ran the balance all the same, and the number of chunks has not
> changed.  Before 3454, and after 3454:
>  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> HOWEVER, the time to mount has gone up somewhat significantly, from
> 11.537s to 16.553s, which was very unexpected.  Output from previously
> run commands shows the extent tree metadata grew about 25% due to the
> balance.  Everything else stayed roughly the same, and no additional
> data was added to the system (nor snapshots taken, nor additional
> volumes added, etc):
> 
> Before balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE   1.14MiB 0(    72) 1( 1)
> EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
> DEV_TREE  272.00KiB 0(    16) 1( 1)
> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
> CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
> QUOTA_TREE    0.00B
> UUID_TREE  16.00KiB 0( 1)
> FREE_SPACE_TREE   0.00B
> DATA_RELOC_TREE    16.00KiB 0( 1)
> 
> After balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE   1.16MiB 0(    73) 1( 1)
> EXTENT_TREE   806.50MiB 0( 51419) 1(   196) 2( 1)
> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
> DEV_TREE  272.00KiB 0(    16) 1( 1)
> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
> CSUM_TREE   3.49GiB 0(227920) 1(   804) 2( 2) 3( 1)
> QUOTA_TREE    0.00B
> UUID_TREE  16.00KiB 0( 1)
> FREE_SPACE_TREE   0.00B
> DATA_RELOC_TREE    16.00KiB 0( 1)

Heu, interesting.

What's the output of `btrfs fi df /mountpoint` and `grep btrfs
/proc/self/mounts` (does it contain 'ssd') and which kernel version is
this? (I get a bit lost in the many messages and subthreads in this
thread) I also can't find in the threads which command "the balance" means.

And what does this tell you?

https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py

Just to make sure you're not pointlessly shovelling data around on a
filesystem that is already in bad shape.

>> BTW, if OP still wants to try something to possibly to reduce mount time
>> with same the fs, I could try some modification to current block group
>> iteration code to see if it makes sense.
> 
> I'm glad to try anything if it's helpful to improving BTRFS.  Just let
> me know.
> 
> Best,
> 
> ellis


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-02-17 Thread Hans van Kranenburg
On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote:
>> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS 
>> volumes are all SSD
> 
> I have recently done some SSD corruption experiments on small set of
> workloads, so I thought I would share my experience.
> 
> While creating btrfs using mkfs.btrfs command for SSDs, by default the
> metadata duplication option is disabled. this renders btrfs-scrubbing
> ineffective, as there are no redundant metadata to restore corrupted
> metadata from.
> So if there are any errors during read operation on SSD, unlike HDD
> where the corruptions would be handled by btrfs scrub on the fly while
> detecting checksum error, for SSD the read would fail as uncorrectable
> error.

First of all, the ssd mount option does not have anything to do with
having single or DUP metadata.

Well, both the things that happen by default (mkfs using single, mount
enabling the ssd option) are happening because of the lookup result on
the rotational flag, but that's all.

> Could you confirm if metadata DUP is enabled for your system by
> running the following cmd:
> 
> $btrfs fi df /mnt # mount is the mount point
> Data, single: total=8.00MiB, used=64.00KiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=168.00MiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> If metadata is single in your case as well (and not DUP), that may be
> the problem for btrfs-scrub not working effectively on the fly
> (mid-stream bit-rot correction), causing reliability issues. A couple
> of such bugs that are observed specifically for SSDs is reported here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198463
> https://bugzilla.kernel.org/show_bug.cgi?id=198807

Here you show that when you have 'single' metadata, there's no copy to
recover from. This is expected.

Also, instead of physically damaging flash cells inside your SSD, you
are writing data to a perfectly working one. This is a different failure
scenario.

One of the reasons to turn off DUP for metadata by default on SSD is
(from man mkfs.btrfs):

"The controllers may put data written in a short timespan into the
same physical storage unit (cell, block etc). In case this unit dies,
both copies are lost. BTRFS does not add any artificial delay between
metadata writes." .. "The traditional rotational hard drives usually
fail at the sector level."

And, of course, in case of EBS, you don't have any idea at all where the
data actually ends up, since it's talking to a black box service, and
not an SSD.

In any case, using DUP instead of single obviously increases the chance
of recovery in case of failures that corrupt one copy of the data when
it's travelling between system memory and disk, while you're sending two
of them right after each other, so you're totally right that it's better
to enable.

> These do not occur for HDD, and I believe should not occur when
> filesystem is mounted with nossd mode.

So to reiterate, mounting nossd does not make your metadata writes DUP.

> On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
>> excerpted:
>>
>>> This will probably sound like an odd question, but does BTRFS think your
>>> storage devices are SSD's or not?  Based on what you're saying, it
>>> sounds like you're running into issues resulting from the
>>> over-aggressive SSD 'optimizations' that were done by BTRFS until very
>>> recently.
>>>
>>> You can verify if this is what's causing your problems or not by either
>>> upgrading to a recent mainline kernel version (I know the changes are in
>>> 4.15, I don't remember for certain if they're in 4.14 or not, but I
>>> think they are), or by adding 'nossd' to your mount options, and then
>>> seeing if you still have the problems or not (I suspect this is only
>>> part of it, and thus changing this will reduce the issues, but not
>>> completely eliminate them).  Make sure and run a full balance after
>>> changing either item, as the aforementioned 'optimizations' have an
>>> impact on how data is organized on-disk (which is ultimately what causes
>>> the issues), so they will have a lingering effect if you don't balance
>>> everything.
>>
>> According to the wiki, 4.14 does indeed have the ssd changes.
>>
>> According to the bug, he's running 4.13.x on one server and 4.14.x on
>> two.  So upgrading the one to 4.14.x should mean all will have that fix.
>>
>> However, without a full balance it /will/ take some time to settle down
>> (again, assuming btrfs was using ssd mode), so the lingering effect could
>> still be creating problems on the 4.14 kernel servers for the moment.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Hans van Kranenburg
On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote:
> On 02/15/2018 08:55 PM, Qu Wenruo wrote:
>> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>>> Very helpful information.  Thank you Qu and Hans!
>>>
>>> I have about 1.7TB of homedir data newly rsync'd data on a single
>>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>>
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>>> total bytes 6001175126016
>>> bytes used 1832557875200
>>>
>>> Hans' (very cool) tool reports:
>>> ROOT_TREE 624.00KiB 0(    38) 1( 1)
>>> EXTENT_TREE   327.31MiB 0( 20881) 1(    66) 2( 1)
>>
>> Extent tree is not so large, a little unexpected to see such slow mount.
>>
>> BTW, how many chunks do you have?
>>
>> It could be checked by:
>>
>> # btrfs-debug-tree -t chunk  | grep CHUNK_ITEM | wc -l
> 
> Since yesterday I've doubled the size by copying the homdir dataset in
> again.  Here are new stats:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
> total bytes 6001175126016
> bytes used 3663525969920
> 
> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE   1.14MiB 0(    72) 1( 1)
> EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
> CHUNK_TREE    384.00KiB 0(    23) 1( 1)
> DEV_TREE  272.00KiB 0(    16) 1( 1)
> FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
> CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
> QUOTA_TREE    0.00B
> UUID_TREE  16.00KiB 0( 1)
> FREE_SPACE_TREE   0.00B
> DATA_RELOC_TREE    16.00KiB 0( 1)
> 
> The old mean mount time was 4.319s.  It now takes 11.537s for the
> doubled dataset.  Again please realize this is on an old version of
> BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still
> like to understand this delay more.  Should I expect this to scale in
> this way all the way up to my proposed 60-80TB filesystem so long as the
> file size distribution stays roughly similar?  That would definitely be
> in terms of multiple minutes at that point.

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.

>>> Taking 100 snapshots (no changes between snapshots however) of the above
>>> subvolume doesn't appear to impact mount/umount time.
>>
>> 100 unmodified snapshots won't affect mount time.
>>
>> It needs new extents, which can be created by overwriting extents in
>> snapshots.
>> So it won't really cause much difference if all these snapshots are all
>> unmodified.
> 
> Good to know, thanks!
> 
>>> Snapshot creation
>>> and deletion both operate at between 0.25s to 0.5s.
>>
>> IIRC snapshot deletion is delayed, so the real work doesn't happen when
>> "btrfs sub del" returns.
> 
> I was using btrfs sub del -C for the deletions, so I believe (if that
> command truly waits for the subvolume to be utterly gone) it captures
> the entirety of the snapshot.
> 
> Best,
> 
> ellis


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-15 Thread Hans van Kranenburg
(  7876) 1(28) 2( 1)
CHUNK_TREE112.00KiB 0( 6) 1( 1)
DEV_TREE   80.00KiB 0( 4) 1( 1)
FS_TREE  1016.34MiB 0( 64113) 1(   881) 2(52)
CSUM_TREE 777.42MiB 0( 49571) 1(   183) 2( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   336.00KiB 0(20) 1( 1)
DATA_RELOC_TREE16.00KiB 0( 1)

> 
> Thanks,
> Qu
> 
>>
>>
>>>
>>> Note that I'm not sensitive to multi-second mount delays.  I am
>>> sensitive to multi-minute mount delays, hence why I'm bringing this up.
>>>
>>> FWIW: I am currently populating a machine we have with 6TB drives in it
>>> with real-world home dir data to see if I can replicate the mount issue.
>>>
>>> Thanks,
>>>
>>> ellis
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: alloc_chunk: fix DUP stripe size handling

2018-02-14 Thread Hans van Kranenburg
On 02/14/2018 03:49 PM, David Sterba wrote:
> On Mon, Feb 05, 2018 at 05:45:11PM +0100, Hans van Kranenburg wrote:
>> In case of using DUP, we search for enough unallocated disk space on a
>> device to hold two stripes.
>>
>> The devices_info[ndevs-1].max_avail that holds the amount of unallocated
>> space found is directly assigned to stripe_size, while it's actually
>> twice the stripe size.
>>
>> Later on in the code, an unconditional division of stripe_size by
>> dev_stripes corrects the value, but in the meantime there's a check to
>> see if the stripe_size does not exceed max_chunk_size. Since during this
>> check stripe_size is twice the amount as intended, the check will reduce
>> the stripe_size to max_chunk_size if the actual correct to be used
>> stripe_size is more than half the amount of max_chunk_size.
>>
>> The unconditional division later tries to correct stripe_size, but will
>> actually make sure we can't allocate more than half the max_chunk_size.
>>
>> Fix this by moving the division by dev_stripes before the max chunk size
>> check, so it always contains the right value, instead of putting a duct
>> tape division in further on to get it fixed again.
>>
>> Since in all other cases than DUP, dev_stripes is 1, this change only
>> affects DUP.
>>
>> Other attempts in the past were made to fix this:
>> * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
>> to fix the same problem, but still resulted in part of the code acting
>> on a wrongly doubled stripe_size value.
>> * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
>> broke this fix again.
>>
>> The real problem was already introduced with the rest of the code in
>> 73c5de0051.
>>
>> The user visible result however will be that the max chunk size for DUP
>> will suddenly double, while it's actually acting according to the limits
>> in the code again like it was 5 years ago.
>>
>> Reported-by: Naohiro Aota <naohiro.a...@wdc.com>
>> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
>> Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
>> Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
>> Signed-off-by: Hans van Kranenburg <hans.van.kranenb...@mendix.com>
>> Cc: Naohiro Aota <naohiro.a...@wdc.com>
>> Cc: Arne Jansen <sensi...@gmx.net>
>> Cc: Chris Mason <chris.ma...@fusionio.com>
> 
> I guess half of the addresses have bounced :) Have you used the
> get_maintainer.pl script?

Yes, indeed, :s I was copy/pasting addresses too quickly from old
commits which touched the relevant pieces of the code in the past and
only looked at the names. :|

And no, I didn't use get_maintainer.pl, thanks for pointing at it.

Instead of sending out even more spam when I saw it shortly after, I
waited for first feedback.

I corrected (I think) addresses in the To of this mail now. Can you fix
them up at the end of the commit message? Your other comments indicate
you're not asking for a v2.

> The fix is short, I had to read the allocator code again so it took me
> longer to review it. Your description in the changelog was really
> helpful.

Thanks. Yes, it's already a while ago the other mail thread about this
was going on, and it didn't end up with a final patch yet.

I started digging into the issue back then because it matched some
behaviour I'd seen before: 512MB dup metadata chunks being replaced with
the same number of 1024MB chunks when doing a conversion to single,
which was a bit weird... And this explained it.

>> ---
>>  fs/btrfs/volumes.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 4006b2a1233d..a50bd02b7ada 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4737,7 +4737,7 @@ static int __btrfs_alloc_chunk(struct 
>> btrfs_trans_handle *trans,
>>   * the primary goal is to maximize the number of stripes, so use as many
>>   * devices as possible, even if the stripes are not maximum sized.
>>   */
>> -stripe_size = devices_info[ndevs-1].max_avail;
>> +stripe_size = div_u64(devices_info[ndevs-1].max_avail, dev_stripes);
> 
> I'll enhance the comment above with more explanation why do it here,

Because it  *is* the actual stripe_size. :) Feel free to enhance!

> otherwise consider this
> 
> Reviewed-by: David Sterba <dste...@suse.com>

Yay!

>>  num_stripes = ndevs * dev_stripes;
>>  
>>  /*
>> @@ -4772,8 +4772,6 @@ static int __btrfs_alloc_chunk(

Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Hans van Kranenburg
On 02/12/2018 03:45 PM, Ellis H. Wilson III wrote:
> On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:
>>> 3. I need to look at the code to understand the interplay between
>>> qgroups, snapshots, and foreground I/O performance as there isn't
>>> existing architecture documentation to point me to that covers this
>>
>> Well, the excellent write-up of Qu this morning shows some explanation
>> from the design point of view.
> 
> Sorry, I may have missed this email.  Or perhaps you are referring to a
> wiki or blog post of some kind I'm not following actively?  Either way,
> if you can forward me the link, I'd greatly appreciate it.

You are in the To: of it:

https://www.spinics.net/lists/linux-btrfs/msg74737.html

>> nocow only keeps the cows on a distance as long as you don't start
>> snapshotting (or cp --reflink) those files... If you take a snapshot,
>> then you force btrfs to keep the data around that is referenced by the
>> snapshot. So, that means that every next write will be cowed once again,
>> moo, so small writes will be redirected to a new location, causing
>> fragmentation again. The second and third write can go in the same (new)
>> location of the first new write, but as soon as you snapshot again, this
>> happens again.
> 
> Ah, very interesting.  Thank you for clarifying!

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Hans van Kranenburg
On 02/11/2018 04:59 PM, Ellis H. Wilson III wrote:
> Thanks Tomasz,
> 
> Comments in-line:
> 
> On 02/10/2018 05:05 PM, Tomasz Pala wrote:
>> You won't have anything close to "accurate" in btrfs - quotas don't
>> include space wasted by fragmentation, which happens to allocate from
>> tens
>> to thousands times (sic!) more space than the files itself.
>> Not in some worst-case scenarios, but in real life situations...
>> I got 10 MB db-file which was eating 10 GB of space after a week of
>> regular updates - withOUT snapshotting it. All described here.
> 
> The underlying filesystem this is replacing was an in-house developed
> COW filesystem, so we're aware of the difficulties of fragmentation. I'm
> more interested in an approximate space consumed across snapshots when
> considering CoW.  I realize it will be approximate.  Approximate is ok
> for us -- no accounting for snapshot space consumed is not.

If your goal is to have an approximate idea for accounting, and you
don't need to be able to actually enforce limits, and if the filesystems
that you are using are as small as the 40GiB example you gave...

Why not just use `btrfs fi du   ` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?

> Also, I don't see the thread you mentioned.  Perhaps you forgot to
> mention it, or an html link didn't come through properly?
> 
>>> course) or how many subvolumes/snapshots there are.  If I know that
>>> above N snapshots per subvolume performance tanks by M%, I can apply
>>> limits on the use-case in the field, but I am not aware of those kinds
>>> of performance implications yet.
>>
>> This doesn't work like this. It all depends on data that are subject of
>> snapshots, especially how they are updated. How exactly, including write
>> patterns.
>>
>> I think you expect answers that can't be formulated - with fs
>> architecture so
>> advanced as ZFS or btrfs it's behavior can't be analyzed for simple
>> answers like 'keep less than N snapshots'.
> 
> I was using an extremely simple heuristic to drive at what I was looking
> to get out of this.  I should have been more explicit that the example
> was not to be taken literally.
> 
>> This is an exception of easy-answer: btrfs doesn't handle databases with
>> CoW. Period. Doesn't matter if snapshotted or not, ANY database files
>> (systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
>> slow down entire system to the speed of cheap SD card.
> 
> I will keep this in mind, thank you.  We do have a higher level above
> BTRFS that stages data.  I will consider implementing an algorithm to
> add the nocow flag to the file if it has been written to sufficiently to
> indicate it will be a bad fit for the BTRFS COW algorithm.

Adding nocow attribute to a file only works when it's just created and
not written to yet or when setting it on the containing directory and
letting it inherit for new files. You can't just turn it on for existing
files with content.

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

>> Actually, if you do not use compression and don't need checksums of data
>> blocks, you may want to mount all the btrfs with nocow by default.
>> This way the quotas would be more accurate (no fragmentation _between_
>> snapshots) and you'll have some decent performance with snapshots.
>> If that is all you care.
> 
> CoW is still valuable for us as we're shooting to support on the order
> of hundreds of snapshots per subvolume,

Hundreds will get you into trouble even without qgroups.

> and without it (if BTRFS COW
> works the same as our old COW FS) that's going to be quite expensive to
> keep snapshots around.  So some hybrid solution is required here.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Hans van Kranenburg
On 02/11/2018 05:15 PM, Ellis H. Wilson III wrote:
> Thanks Hans.  Sorry for the top-post, but I'm boiling things down here
> so I don't have a clear line-item to respond to.  The take-aways I see
> here to my original queries are:
> 
> 1. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITHOUT qgroups enabled on foreground I/O performance
> 2. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITH qgroups enabled on foreground I/O performance

It's more that there is no simple list of clear-cut answers that apply
to every possible situation and type/pattern of work that you can throw
at a btrfs filesystem.

> 3. I need to look at the code to understand the interplay between
> qgroups, snapshots, and foreground I/O performance as there isn't
> existing architecture documentation to point me to that covers this

Well, the excellent write-up of Qu this morning shows some explanation
from the design point of view.

> 4. I should be cautioned that CoW in BTRFS can exhibit pathological (if
> expected) capacity consumption for very random-write-oriented datasets
> with or without snapshots, and nocow (or in my case transparently
> absorbing and coalescing writes at a higher tier) is my friend.

nocow only keeps the cows on a distance as long as you don't start
snapshotting (or cp --reflink) those files... If you take a snapshot,
then you force btrfs to keep the data around that is referenced by the
snapshot. So, that means that every next write will be cowed once again,
moo, so small writes will be redirected to a new location, causing
fragmentation again. The second and third write can go in the same (new)
location of the first new write, but as soon as you snapshot again, this
happens again.

> 5. I should be cautioned that CoW is broken across snapshots when
> defragmentation is run.
> 
> I will update a test system to the most recent kernel and will perform
> tests to answer #1 and #2.  I will plan to share it when I'm done.  If I
> have time to write-up my findings for #3 I will similarly share that.
> 
> Thanks to all for your input on this issue.

Have fun!

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-10 Thread Hans van Kranenburg
Hey,

On 02/10/2018 07:29 PM, Ellis H. Wilson III wrote:
> Thank you very much for your response Hans.  Comments in-line, but I did
> want to handle one miscommunication straight-away:
> 
> I'm a huge fan of BTRFS.  If I came off like I was complaining, my
> sincere apologies.   To be completely transparent we are using BTRFS in
> a very large project at my company, which I am lead architect on, and
> while I have read the academic papers, perused a subset of the source
> code, and been following it's development in the background, I now need
> to deeply understand where there might be performance hiccups.

I'd suggest just trying to do what you want to do for real, finding out
what the problems are and then finding out what to do about them, but I
think that's already almost exactly what you've started doing now. :)

If you ask 100 different btrfs users about your specific situation, you
probably get 100 different answers. So, I'll just throw some of my own
thoughts in here, which may or may not make sense for you.

> All of
> our foreground I/O testing with BTRFS in RAID0/RAID1/single across
> different SSDs and HDDs has been stellar, but we haven't dug too far
> into snapshot performance, balancing, and other more background-oriented
> performance.  Hence my interest in finding documentation and analysis I
> can read and grok myself on the implications of snapshot operations on
> foreground I/O if such exists.

> More in-line below:
> 
> On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:
>>> This has proven thus far less than helpful, as
>>> the response tends to be "use less snapshots," or "disable quotas," both
>>> of which strike me as intellectually unsatisfying answers
>>
>> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
>> didn't even realize I had those, and now the problem is gone."
> 
> I meant less than helpful for me, since for my project I need detailed
> and fairly accurate capacity information per sub-volume, and the
> relationship between qgroups and subvolume performance wasn't being
> spelled out in the responses.  Please correct me if I am wrong about
> needing qgroups enabled to see detailed capacity information
> per-subvolume (including snapshots).

Aha, so you actually want to use qgroups.

>>> the former in a filesystem where snapshots are supposed to be
>>> "first-class citizens."

They are. But if you put extra optional feature X Y and Z on top which
kill your performance, then they are still supposed to be first-class
citizens, but feature X Y and Z might start blurring it a bit.

The problem is that qgroups and quota etc is still in development and if
you ask the developers, they are probably honest about the fact that you
cannot just enable that part of the functionality without some expected
and unexpected performance side effects.

>> Throwing complaints around is also not helpful.
> 
> Sorry about this.  It wasn't directed in any way at BTRFS developers,
> but rather some of the suggestions for solution proposed in random
> forums online.
> As mentioned I'm a fan of BTRFS, especially as my
> project requires the snapshots to truly be first-class citizens in that
> they are writable and one can roll-back to them at-will, unlike in ZFS
> and other filesystems.  I was just saying it seemed backwards to suggest
> having less snapshots was a solution in a filesystem where the
> architecture appears to treat them as a core part of the design.

And I was just saying that subvolumes and snapshots are fine, and that
you shouldn't blame them while your problems might be more likely
qgroups/quota related.

>> The "performance implications" are highly dependent on your specific
>> setup, kernel version, etc, so it really makes sense to share:
>>
>> * kernel version
>> * mount options (from /proc/mounts|grep btrfs)
>> * is it ssd? hdd? iscsi lun?
>> * how big is the FS
>> * how many subvolumes/snapshots? (how many snapshots per subvolume)
> 
> I will answer the above, but would like to reiterate my previous comment
> that I still would like to understand the fundamental relationships here
> as in my project kernel version is very likely to change (to more
> recent), along with mount options and underlying device media.  Once
> this project hits the field I will additionally have limited control
> over how large the FS gets (until physical media space is exhausted of
> course) or how many subvolumes/snapshots there are.  If I know that
> above N snapshots per subvolume performance tanks by M%, I can apply
> limits on the use-case in the field, but I am not aware of those kinds
> of performance implications yet.
> 
> My present situation is the following

[PATCH] btrfs: alloc_chunk: fix DUP stripe size handling

2018-02-05 Thread Hans van Kranenburg
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.

The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.

Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.

The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.

Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.

Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.

Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.

The real problem was already introduced with the rest of the code in
73c5de0051.

The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.

Reported-by: Naohiro Aota <naohiro.a...@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenb...@mendix.com>
Cc: Naohiro Aota <naohiro.a...@wdc.com>
Cc: Arne Jansen <sensi...@gmx.net>
Cc: Chris Mason <chris.ma...@fusionio.com>
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4006b2a1233d..a50bd02b7ada 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4737,7 +4737,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
 * the primary goal is to maximize the number of stripes, so use as many
 * devices as possible, even if the stripes are not maximum sized.
 */
-   stripe_size = devices_info[ndevs-1].max_avail;
+   stripe_size = div_u64(devices_info[ndevs-1].max_avail, dev_stripes);
num_stripes = ndevs * dev_stripes;
 
/*
@@ -4772,8 +4772,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
stripe_size = devices_info[ndevs-1].max_avail;
}
 
-   stripe_size = div_u64(stripe_size, dev_stripes);
-
/* align to BTRFS_STRIPE_LEN */
stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/26] btrfs-progs: use libbtrfsutil for subvol show

2018-02-02 Thread Hans van Kranenburg
On 01/26/2018 07:41 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> Now implemented with btrfs_util_subvolume_path(),
> btrfs_util_subvolume_info(), and subvolume iterators.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  cmds-subvolume.c | 150 
> ---
>  utils.c  | 118 ---
>  utils.h  |   5 --
>  3 files changed, 99 insertions(+), 174 deletions(-)
> 
> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> index c5e03011..b969fc88 100644
> --- a/cmds-subvolume.c
> +++ b/cmds-subvolume.c
>
>
> [...]
>   } else
>   strcpy(tstr, "-");
>   printf("\tCreation time: \t\t%s\n", tstr);
>  
> - printf("\tSubvolume ID: \t\t%llu\n", get_ri.root_id);
> - printf("\tGeneration: \t\t%llu\n", get_ri.gen);
> - printf("\tGen at creation: \t%llu\n", get_ri.ogen);
> - printf("\tParent ID: \t\t%llu\n", get_ri.ref_tree);
> - printf("\tTop level ID: \t\t%llu\n", get_ri.top_id);
> + printf("\tSubvolume ID: \t\t%" PRIu64 "\n", subvol.id);
> + printf("\tGeneration: \t\t%" PRIu64 "\n", subvol.generation);
> + printf("\tGen at creation: \t%" PRIu64 "\n", subvol.otransid);
> + printf("\tParent ID: \t\t%" PRIu64 "\n", subvol.parent_id);
> + printf("\tTop level ID: \t\t%" PRIu64 "\n", subvol.parent_id);
>  
> - if (get_ri.flags & BTRFS_ROOT_SUBVOL_RDONLY)
> + if (subvol.flags & BTRFS_ROOT_SUBVOL_RDONLY)
>   printf("\tFlags: \t\t\treadonly\n");
>   else
>   printf("\tFlags: \t\t\t-\n");
>  
>   /* print the snapshots of the given subvol if any*/
>   printf("\tSnapshot(s):\n");
> - filter_set = btrfs_list_alloc_filter_set();
> - btrfs_list_setup_filter(_set, BTRFS_LIST_FILTER_BY_PARENT,
> - (u64)(unsigned long)get_ri.uuid);
> - btrfs_list_setup_print_column(BTRFS_LIST_PATH);
>  
> - fd = open_file_or_dir(fullpath, );
> - if (fd < 0) {
> - fprintf(stderr, "ERROR: can't access '%s'\n", fullpath);
> - goto out;
> + err = btrfs_util_f_create_subvolume_iterator(fd,
> +  BTRFS_FS_TREE_OBJECTID,
> +  0, );
> +
> [...]
When you have enough subvolumes in a filesystem, let's say 10 (yes,
that sometimes happens), the current btrfs sub list is quite unusable,
which is kind of expected. But, currently, sub show is also unusable
because it still starts loading a list of all subvolumes to be able to
print the 'snapshots, if any'.

So I guess that this new sub show will at least print the info first and
then use the iterator and start crawling through the list, which can be
interrupted? At least you get the relevant info first then. :-)

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/26] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"

2018-02-02 Thread Hans van Kranenburg
Hi,

On 01/26/2018 07:51 PM, Hugo Mills wrote:
> On Fri, Jan 26, 2018 at 10:40:48AM -0800, Omar Sandoval wrote:
>> From: Omar Sandoval <osan...@fb.com>
>>
>> One of the features requests I get most often is a library to do the
>> sorts of operations that we do with btrfs-progs. We can shell out to
>> btrfs-progs, but the output format isn't always easily parsasble, and
>> shelling out isn't always ideal. There's libbtrfs, but it's very
>> incomplete, doesn't have a well thought out API, and is licensed under
>> the GPL, making it hard to use for many projects.
>>
>> libbtrfsutil is a new library written from scratch to address these
>> issues. The implementation is completely independent of the existing
>> btrfs-progs code, including kerncompat.h, and has a clean API and naming
>> scheme. It is licensed under the LGPL. It also includes Python bindings
>> by default. I will maintain the library code.

Insert mandatory picture here:

https://memegenerator.net/img/instances/12076366/python-all-the-things.jpg

>*speechless*
> 
>That's awesome, Omar (although with the python bindings, you've
> probably just ruined Hans' day ;) ).

Not quite, or should I say, quite the opposite, actually.

I can think of a few different python libs related to btrfs right now...

1. This one, which works on the same level as the btrfs-progs command
line utils and provides the same set of functionality, with the same
input and output, only way easier to handle when automating things. The
audience is the same audience as who would else try to 'shell out' to
btrfs-progs and then do painful parsing of the output.

2. A lib for advanced btrfs users (more of a niche audience) who want to
get a peek behind the curtains of what is going on in their online
filesystem. Directly do searches in metadata, directly call an ioctl
etc, either for educational purposes, or to e.g. do things like write a
small special purpose dedupe program, visualize things etc. This is the
python3-btrfs lib that I've been working on for some time now.

3. A lib for btrfs developers who want to script working on offline
fileystems. This one does not exist yet. Well, it's a bit of an unborn
baby now, my first brain dump:
  https://github.com/knorrie/btrfs-progs/blob/python/python3/README.md
This would be the lib that exposes the (to be revised and improved)
internal libbtrfs (I guess).

All of these are not competitors. They serve different purpose and a
different audience. So I'm happy to see number 1 also happening. :-)

>> Patch 1 is a preparation cleanup which can go in independently. Patch 2
>> adds the build system stuff for the library, and patch 3 does the same
>> for the Python bindings. Patches 4-14 implement the library helpers,
>> currently subvolume helpers and the sync ioctls. Patches 15-26 replace
>> the btrfs-progs and libbtrfs code to use libbtrfsutil instead. I took
>> care to preserve backwards-compatibility. `btrfs subvol list` in
>> particular had some buggy behaviors for -o and -a that I emulated in the
>> new code, see the comments in the code.

+1 for starting to clean up that code. Especially the whole subvolume
list part looks like a great improvement.

>> These patches are also available on my GitHub:
>> https://github.com/osandov/btrfs-progs/tree/libbtrfsutil. That branch
>> will rebase as I update this series.
>>
>> Please share feedback regarding the API, implementation, or anything
>> else.

Great work, and I'm of course also interested in what you think about
the lib nr.3 idea, since I didn't throw it around on the list before
yet. :-)

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblock update: Is there really any benefits of updating synchronously?

2018-01-24 Thread Hans van Kranenburg
On 01/24/2018 07:54 PM, waxhead wrote:
> Hans van Kranenburg wrote:
>> On 01/23/2018 08:51 PM, waxhead wrote:
>>> Nikolay Borisov wrote:
>>>> On 23.01.2018 16:20, Hans van Kranenburg wrote:
>>
>> [...]
>>
>>>>>
>>>>> We also had a discussion about the "backup roots" that are stored
>>>>> besides the superblock, and that they are "better than nothing" to help
>>>>> maybe recover something from a borken fs, but never ever guarantee you
>>>>> will get a working filesystem back.
>>>>>
>>>>> The same holds for superblocks from a previous generation. As soon as
>>>>> the transaction for generation X succesfully hits the disk, all space
>>>>> that was occupied in generation X-1 but no longer in X is available to
>>>>> be overwritten immediately.
>>>>>
>>> Ok so this means that superblocks with a older generation is utterly
>>> useless and will lead to corruption (effectively making my argument
>>> above useless as that would in fact assist corruption then).
>>
>> Mostly, yes.
>>
>>> Does this means that if disk space was allocated in X-1 and is freed in
>>> X it will unallocated if you roll back to X-1 e.g. writing to
>>> unallocated storage.
>>
>> Can you reword that? I can't follow that sentence.
> Sure why not. I'll give it a go:
> 
> Does this mean that if...
> * Superblock generation N-1 have range 1234-2345 allocated and used.
> 
> and
> 
> * Superblock generation N-0 (the current) have range 1234-2345 free 
> because someone deleted a file or something

Ok, so I assume that with current you mean the one on disk now.

> Then
> 
> It is no point in rolling back to generation N-1 because that refers to 
> what is no essentially free "memory" which may or may have not been 
> written over by generation N-0.

If space that was used in N-1 turned into free space during N-0, then
N-0 will never have reused that space already, since if writing out N-0
had crashed halfway, so the superblock as seen when mounting is still
N-1, then you need to be able to fully use N-1.

It can be used immediately by N+1 however after the N-0 superblock is
safe on disk.

> And therefore N-1 which still thinks 
> range 1234-2345 is allocated may point to the wrong data.

So, at least for disk space used by metadata blocks:

1234-2345 - N-1 - in use
1234-2345 - N-0 - not in use, but can't be overwritten yet
1234-2345 - N+1 - can start writing whatever it wants in that disk
location any time

> I hope that was easier to follow - if not don't hold back on the 
> explicitives! :)
> 
>>
>>> I was under the impression that a superblock was like a "snapshot" of
>>> the entire filesystem and that rollbacks via pre-gen superblocks was
>>> possible. Am I mistaking?
>>
>> Yes. The first fundamental thing in Btrfs is COW which makes sure that
>> everything referenced from transaction X, from the superblock all the
>> way down to metadata trees and actual data space is never overwritten by
>> changes done in transaction X+1.
>>
> Perhaps a tad off topic, but assuming the (hopefully) better explanation 
> above clear things up a bit. What happens if a block is freed?! in X+1 
> --- which must mean that it can be overwritten in transaction X+1 (which 
> I assume means a new superblock generation). After all without freeing 
> and overwriting data there is no way to re-use space.

Freed in X you mean? Or not? But you write "freed?! in X+1".

For actual data disk space, it's the same pattern as above (so space
freed up during a transaction can only be reused in the next one), but
implemented a bit differently.

For metadata trees which do not have reference counting, (e.g. the
extent tree), there's the pinned extent (metadata block disk locations)
list I mentioned already.

For data, we have the filesystem (subvolume) trees which reference all
files and the data extents that they use data from, and via the links to
the extent tree they keep all locations where actual data is on disk as
occupied.

Now comes the different part. Because the filesystem trees already
implement the extra reference counting functionality, this is being used
to prevent freed up data space from already being overwritten in the
same transaction.

How does this work? Well, that's the rest of the wiki section I linked
below. :-D So you're asking exactly the right next question here I guess.

When making changes to a subvolume tree (normal file create, write
content, rename delete etc), btrfs is secretly just cloning the tree
into a new subvolume with the same subvolume ID. Wait, what? Whoa

Re: Superblock update: Is there really any benefits of updating synchronously?

2018-01-23 Thread Hans van Kranenburg
On 01/23/2018 08:51 PM, waxhead wrote:
> Nikolay Borisov wrote:
>> On 23.01.2018 16:20, Hans van Kranenburg wrote:

[...]

>>>
>>> We also had a discussion about the "backup roots" that are stored
>>> besides the superblock, and that they are "better than nothing" to help
>>> maybe recover something from a borken fs, but never ever guarantee you
>>> will get a working filesystem back.
>>>
>>> The same holds for superblocks from a previous generation. As soon as
>>> the transaction for generation X succesfully hits the disk, all space
>>> that was occupied in generation X-1 but no longer in X is available to
>>> be overwritten immediately.
>>>
> Ok so this means that superblocks with a older generation is utterly
> useless and will lead to corruption (effectively making my argument
> above useless as that would in fact assist corruption then).

Mostly, yes.

> Does this means that if disk space was allocated in X-1 and is freed in
> X it will unallocated if you roll back to X-1 e.g. writing to
> unallocated storage.

Can you reword that? I can't follow that sentence.

> I was under the impression that a superblock was like a "snapshot" of
> the entire filesystem and that rollbacks via pre-gen superblocks was
> possible. Am I mistaking?

Yes. The first fundamental thing in Btrfs is COW which makes sure that
everything referenced from transaction X, from the superblock all the
way down to metadata trees and actual data space is never overwritten by
changes done in transaction X+1.

For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the
way this is done is actually quite simple. If a block is cowed, the old
location is added to a 'pinned extents' list (in memory), which is used
as a blacklist for choosing space to put new writes in. After a
transaction is completed on disk, that list with pinned extents is
emptied and all that space is available for immediate reuse. This way we
make sure that if the transaction that is ongoing is aborted, the
previous one (latest one that is completely on disk) is always still
there. If the computer crashes and the in memory list is lost, no big
deal, we just continue from the latest completed transaction again after
a reboot. (ignoring extra log things for simplicity)

So, the only situation in which you can fully use an X-1 superblock is
when none of that previously pinned space has actually been overwritten
yet afterwards.

And if any of the space was overwritten already, you can go play around
with using an older superblock and your filesystem mounts and everything
might look fine, until you hit that distant corner and BOOM!

 >8  Extra!! Moar!!  >8 

But, doing so does not give you snapshot functionality yet! It's more
like a poor mans snapshot that only can prevent from messing up the
current version.

Snapshot functionality is implemented only for filesystem trees
(subvolumes) by adding reference counting (which does end up on disk) to
the metadata blocks, and then COW trees as a whole.

If you make a snapshot of a filesystem tree, the snapshot gets a whole
new tree ID! It's not a previous version of the same subvolume you're
looking at, it's a clone!

This is a big difference. The extent tree is always tree 2. The chunk
tree is always tree 3. But your subvolume snapshot gets a new tree number.

Technically, it would maybe be possible to implement reference counting
and snapshots to all of the metadata trees, but it would probably mean
that the whole filesystem would get stuck in rewriting itself all day
instead of doing any useful work. The current extent tree already has
such amount of rumination problems that the added work of keeping track
of reference counts would make it completely unusable.

In the wiki, it's here:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging

Actually, I just paraphrased the first two of those six alineas... The
subvolume trees actually having a previous version of themselves again
(wh!) is another thing... ;]

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblock update: Is there really any benefits of updating synchronously?

2018-01-23 Thread Hans van Kranenburg
On 01/23/2018 10:03 AM, Nikolay Borisov wrote:
> 
> On 23.01.2018 09:03, waxhead wrote:
>> Note: This have been mentioned before, but since I see some issues
>> related to superblocks I think it would be good to bring up the question
>> again.
>>
>> [...]
>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
>>
>> The superblocks are updated synchronously on HDD's and one after each
>> other on SSD's.
> 
> There is currently no distinction in the code whether we are writing to
> SSD or HDD.

So what does that line in the wiki mean, and why is it there? "btrfs
normally updates all superblocks, but in SSD mode it will update only
one at a time."

> Also what do you mean by synchronously, if you inspect the
> code in write_all_supers you will see what for every device we issue
> writes for every available copy of the superblock and then wait for all
> of them to be finished via the 'wait_dev_supers'. In that regard sb
> writeout is asynchronous.
> 
>> Superblocks are also (to my knowledge) not protected by copy-on-write
>> and are read-modify-update.
>>
>> On a storage device with >256GB there will be three superblocks.
>>
>> BTRFS will always prefer the superblock with the highest generation
>> number providing that the checksum is good.
> 
> Wrong. On mount btrfs will only ever read the first superblock at 64k.
> If that one is corrupted it will refuse to mount, then it's expected the
> user will initiate recovery procedure with btrfs-progs which reads all
> supers and replaces them with the "newest" one (as decided by the
> generation number)

So again, the line "The superblock with the highest generation is used
when reading." in the wiki needs to go away then?

>> On the list there seem to be a few incidents where the superblocks have
>> gone toast and I am pondering what (if any) benefits there is by
>> updating the superblocks synchronously.
>>
>> The superblock is checkpoint'ed every 30 seconds by default and if
>> someone pulls the plug (poweroutage) on HDD's then a synchronous write
>> depending on (the quality of) your hardware may perhaps ruin all the
>> superblock copies in one go. E.g. Copy A,B and C will all be updated at
>> 30s.
>>
>> On SSD's, since one superblock is updated after other it would mean that
>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s
>
> As explained previously there is no notion of "SSD vs HDD" modes.

We also had a discussion about the "backup roots" that are stored
besides the superblock, and that they are "better than nothing" to help
maybe recover something from a borken fs, but never ever guarantee you
will get a working filesystem back.

The same holds for superblocks from a previous generation. As soon as
the transaction for generation X succesfully hits the disk, all space
that was occupied in generation X-1 but no longer in X is available to
be overwritten immediately.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-11 Thread Hans van Kranenburg
On 01/10/2018 05:38 AM, Duncan wrote:
> [...]
> 
> And I've definitely noticed an effect since the ssd option stopped using 
> the 2 MiB spreading algorithm in 4.14.

Glad to hear. :-)

> In particular, while chunk usage 
> was generally stable before that and I only occasionally needed to run 
> balance to clear out empty chunks, now, balance with the usage filter 
> will apparently actively fill in empty space in existing chunks, so while 
> previously a usage-filtered balance that only rewrote one chunk didn't 
> actually free anything, simply allocating a new chunk to replace the one 
> it freed, so at least two chunks needed rewritten to actually free space 
> back to unallocated...
> 
> Now, usage-filtered rewrites of only a single chunk routinely frees the 
> allocated space, because it writes that small bit of data in the freed 
> chunk into existing free space in other chunks.

And that back-filling the existing chunks indeed also means a decrease
in total work that needs to be done. But this probably also means that
the free space gaps you have/had were rather small so they got ignored
in the past. Large free gaps would always get data written in them by
balance already. It probably also means that now you're on 4.14, much
less of these small free space gaps will be left behind, because they're
already immediately reused by new small writes.

> At least I /presume/ that new balance-usage behavior is due to the ssd 
> changes.

Most probably, yes.

>  Maybe it's due to other patches.  Either way, it's an 
> interesting and useful change. =:^)

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-11 Thread Hans van Kranenburg
On 01/10/2018 10:37 PM, waxhead wrote:
> As just a regular user I would think that the first thing you would need
> is an analyze that can tell you if it is a good idea to balance or not
> in the first place.

Tooling to create that is available. Btrfs allows you to read a lot of
different data to analyze, and then you can experiment with your own
algorithms to find out which blockgroup you're going to feed to balance
next.

There's two language options...
* C -> in this case you're extending btrfs-progs to build new tools
* Python -> python-btrfs has everything in it to quickly throw together
things like this, and examples are available with the source.

For example:
- balance_least_used.py -> balance starting from the least used chunk
and work up towards a max of X% used.
- show_free_space_fragmentation.py -> find out which chunks have badly
fragmented free space. Remember: if you have a 1GiB chunk with usage
50%, that doesn't tell you if it has only a handful of extents, filling
up the first 500MiB, with the rest empty, or if it's thousands of
alternating pieces of 4KiB used space and 4KiB free space. ;-] [0]

In the same way you can program something new, like a balance algorithm
that cleans up blocks with high free space fragmentation first.

Or, another thing you could do is first count the number of extents in a
block group and add it to the algorithm. Balance of a block group with a
few extents is much faster than thousands of extents with a lot of
reflinks, like highly deduped data.

Or... look at generation of metadata to find out which parts of data on
your disk have been touched recently, and which weren't... Too many fun
things to play around with. \:D/

As always, first thing to do is make sure you're on 4.14 or otherwise
use nossd, otherwise you might keep shoveling data around forever.

And if your filesystem has been treated badly by <4.14 kernel in ssd
mode for a long time, then first get that cleaned up:

https://www.spinics.net/lists/linux-btrfs/msg70622.html

> Scrub seems like a great place to start - e.g. scrub could auto-analyze
> and report back need to balance. I also think that scrub should
> optionally autobalance if needed.
> 
> Balance may not be needed, but if one can determine that balancing would
> speed up things a bit I don't see why this as an option can't be
> scheduled automatically. Ideally there should be a "scrub and polish"
> option that would scrub, balance and perhaps even defragment in one go.
> 
> In fact, the way I see it btrfs should idealy by itself keep track on
> each data/metadata chunk and it should know , when was this chunk last
> affected by a scrub, balance, defrag etc and perform the required
> operations by itself based on a configuration or similar. Some may
> disagree for good reasons , but for me this is my wishlist for a
> filesystem :) e.g. a pool that just works and only annoys you with the
> need of replacing a bad disk every now and then :)

I don't think these kind of things will ever end up in kernel code.

[0] There's a version in the devel branch in git that also works without
free space tree, taking a slower detour via the extent tree.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance problems

2017-12-29 Thread Hans van Kranenburg
On 12/28/2017 12:15 PM, Nikolay Borisov wrote:
> 
> On 23.12.2017 13:19, James Courtier-Dutton wrote:
>>
>> During a btrfs balance, the process hogs all CPU.
>> Or, to be exact, any other program that wishes to use the SSD during a
>> btrfs balance is blocked for long periods. Long periods being more
>> than 5 seconds.
>> Is there any way to multiplex SSD access while btrfs balance is
>> operating, so that other applications can still access the SSD with
>> relatively low latency?
>>
>> My guess is that btrfs is doing a transaction with a large number of
>> SSD blocks at a time, and thus blocking other applications.
>>
>> This makes for atrocious user interactivity as well as applications
>> failing because they cannot access the disk in a relatively low latent
>> manner.
>> For, example, this is causing a High Definition network CCTV
>> application to fail.
>>
>> What I would really like, is for some way to limit SSD bandwidths to
>> applications.
>> For example the CCTV app always gets the bandwidth it needs, and all
>> other applications can still access the SSD, but are rate limited.
>> This would fix my particular problem.
>> We have rate limiting for network applications, why not disk access also?
> 
> So how are you running btrfs balance?

Or, to again take one step further back...

*Why* are you running btrfs balance at all?

:)

> Are you using any filters
> whatsoever? The documentation
> [https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-balance] has the
> following warning:
> 
> Warning: running balance without filters will take a lot of time as it
> basically rewrites the entire filesystem and needs to update all block
> pointers.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: Fix progs_extra build dependencies

2017-12-23 Thread Hans van Kranenburg
The Makefile does not have a dependency path that builds dependencies
for tools listed in progs_extra.

E.g. doing make btrfs-show-super in a clean build environment results in:
gcc: error: cmds-inspect-dump-super.o: No such file or directory
Makefile:389: recipe for target 'btrfs-show-super' failed

Signed-off-by: Hans van Kranenburg <h...@knorrie.org>
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 30a0ee22..390b138f 100644
--- a/Makefile
+++ b/Makefile
@@ -220,7 +220,7 @@ cmds_restore_cflags = 
-DBTRFSRESTORE_ZSTD=$(BTRFSRESTORE_ZSTD)
 CHECKER_FLAGS += $(btrfs_convert_cflags)
 
 # collect values of the variables above
-standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
btrfs-%, $(progs,$($(dep)))
+standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
btrfs-%, $(progs) $(progs_extra,$($(dep)))
 
 SUBDIRS =
 BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS))
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: Fix build of btrfs-calc-size

2017-12-23 Thread Hans van Kranenburg
Build would fail because it couldn't find the usage function.

Signed-off-by: Hans van Kranenburg <h...@knorrie.org>
---
 btrfs-calc-size.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c
index 1ac7c785..d2d68ab2 100644
--- a/btrfs-calc-size.c
+++ b/btrfs-calc-size.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "utils.h"
 #include "commands.h"
+#include "help.h"
 
 int main(int argc, char **argv)
 {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] btrfs-progs: Fix progs_extra build failures

2017-12-23 Thread Hans van Kranenburg
Alternatively, it might be a better idea to just remove the deprecated source
files, since this is not the first time build failures in them went unnoticed.

Hans van Kranenburg (2):
  btrfs-progs: Fix progs_extra build dependencies
  btrfs-progs: Fix build of btrfs-calc-size

 Makefile  | 2 +-
 btrfs-calc-size.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-19 Thread Hans van Kranenburg
On 12/19/2017 06:08 PM, Rich Rauenzahn wrote:
> What's also confusing is I just ran a manual balance on the fs using
> defaults (which are aggressive) and it completed with no problems.
> It smells more like a race condition than a particular corruption.

Just wild first guess... are you also using btrfs send/receive
functionality where the system having problems is the sending part?

> On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahn <rraue...@gmail.com> wrote:
>> I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight.
>>
>> Every night I'm getting a kernel hang, sometimes caught by my
>> watchdog, sometimes not.  Last night's hang was on the balance of DATA
>> on / at 70.
>>
>> I'm not sure how to further trace this down to help you -- the console
>> by the time I notice just has lots of messages on it without the
>> initial ones.

Capturing more logs is definitely the first thing to do.

Look if the output of `dmesg` still shows the btrfs errors. Otherwise,
if something is spamming there, turn that off, or if you don't have the
errors in a log file because the log files are on the same btrfs, then
you have to find out another way to capture them. E.g. make the kernel
buffer for messages bigger, use netconsole or just pragmatic things like
ssh from another server and `dmesg -w` and store it on the other machine.

>>   The last items in /var/log/message aren't helpful, but
>> I'm pretty sure it is the nightly balance.
>>
>> I've run btrfs check on / with no issues recently.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.14.3] btrfs out of space error

2017-12-15 Thread Hans van Kranenburg
On 12/15/2017 09:53 AM, Qu Wenruo wrote:
> 
> 
> On 2017年12月15日 16:36, Ian Kumlien wrote:
>> On Fri, Dec 15, 2017 at 6:34 AM, Roman Mamedov <r...@romanrm.net> wrote:
>>> On Fri, 15 Dec 2017 01:39:03 +0100
>>> Ian Kumlien <ian.kuml...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Running a 4.14.3 kernel, this just happened, but there should have
>>>> been another 20 gigs or so available.
>>>>
>>>> The filesystem seems fine after a reboot though
>>>
>>> What are your mount options, and can you show the output of "btrfs fi
>>> df" and "btrfs fi us" for the filesystem? And what does
>>> "cat /sys/block/sdb/queue/rotational" return.
>>
>> It's a btrfs raid1 mirror of two ssd:s
>>
>> mount options was:
>> defaults,acl,noatime,space_cache,autodefrag,compress=lzo
>>
>> btrfs fi df /
>> Data, RAID1: total=459.25GiB, used=372.42GiB
>> Data, single: total=8.00MiB, used=0.00B
>> System, RAID1: total=8.00MiB, used=80.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, RAID1: total=6.00GiB, used=3.69GiB
> 
> Both meta and data has a lot of space let.

The real question is how fragmented that free space is.

Here's a good starting point to read up about the -o ssd behavior pre 4.14:

https://www.spinics.net/lists/linux-btrfs/msg70622.html

>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> btrfs fi us /
>> Overall:
>> Device size: 930.54GiB
>> Device allocated: 930.53GiB
>> Device unallocated:   20.05MiB
>> Device missing:  0.00B
>> Used: 752.22GiB
>> Free (estimated):   86.84GiB (min: 86.84GiB)
>> Data ratio:   2.00
>> Metadata ratio:   2.00
>> Global reserve: 512.00MiB (used: 0.00B)
>>
>> Data,single: Size:8.00MiB, Used:0.00B
>>/dev/sdb28.00MiB
>>
>> Data,RAID1: Size:459.25GiB, Used:372.42GiB
>>/dev/sdb2 459.25GiB
>>/dev/sdc2 459.25GiB
>>
>> Metadata,single: Size:8.00MiB, Used:0.00B
>>/dev/sdb28.00MiB
>>
>> Metadata,RAID1: Size:6.00GiB, Used:3.69GiB
>>/dev/sdb26.00GiB
>>/dev/sdc26.00GiB
>>
>> System,single: Size:4.00MiB, Used:0.00B
>>/dev/sdb24.00MiB
>>
>> System,RAID1: Size:8.00MiB, Used:80.00KiB
>>/dev/sdb28.00MiB
>>/dev/sdc28.00MiB
>>
>> Unallocated:
>>/dev/sdb2   24.00KiB
>>/dev/sdc2   20.02MiB
> 
> Well, at least no new chunk can be allocated.

Exactly.

> In v4.15, Josef introduced a new inode reservation system, which I think
> it would enhance metadata related reservation.
> 
> If you're hitting the problem quite frequently, please consider to try
> v4.15-rc* to see if it solves the problem.
> 
> Despite of that, there should be no damage to your fs.
> (except some unwritten data in buffer)
> 
> Thanks,
> Qu
> 
>>
>> And as expected:
>> cat /sys/block/sdb/queue/rotational
>> 0
>>
>>> I wonder if it's the same old "ssd allocation scheme" problem, and no
>>> balancing done in a long time or at all.

Looks like it. So, yay, you're on 4.14 already. Now just do a full
balance of your entire filesystem, only once (data only, metadata not
needed) and then you can forget about this again.

>> I had something similar happen on a laptop a while ago - took a while
>> before i could get it back in order
>> (in that case i think it was actually a oops --- it kept saying "no
>> space left" and switched to read only even
>> if you removed a lot of data, invalidated the space cache and so on)

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: self heal from SB fail

2017-12-08 Thread Hans van Kranenburg
On 12/08/2017 09:17 AM, Qu Wenruo wrote:
> 
> 
> On 2017年12月08日 15:57, Anand Jain wrote:
>> -EXPERIMENTAL-
>> As of now when primary SB fails we won't self heal and would fail mount,
>> this is an experimental patch which thinks why not go and read backup
>> copy.
> 
> Just curious about in which real world case that backup super block can
> help.
> At least from what I see in mail list, only few cases where backup super
> helps.

That's about using 'backup roots' with old info about previous
generations than the current one on disk, which is a different thing
than using one of the other copies of the super block itself.

> Despite that self super heal seems good, although I agree with David, we
> need a weaker but necessary check (magic and fsid from primary super?)
> to ensure it's a valid btrfs before we use the backup supers.
> 
> Thanks,
> Qu
> 
> 
>>
>> Signed-off-by: Anand Jain <anand.j...@oracle.com>
>> ---
>>  fs/btrfs/disk-io.c |  8 +++-
>>  fs/btrfs/volumes.c | 10 +++---
>>  2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 9b20c1f3563b..a791b8dfe8a8 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3190,7 +3190,7 @@ struct buffer_head *btrfs_read_dev_super(struct 
>> block_device *bdev)
>>   * So, we need to add a special mount option to scan for
>>   * later supers, using BTRFS_SUPER_MIRROR_MAX instead
>>   */
>> -for (i = 0; i < 1; i++) {
>> +for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>>  ret = btrfs_read_dev_one_super(bdev, i, );
>>  if (ret)
>>  continue;
>> @@ -4015,11 +4015,17 @@ static int btrfs_check_super_valid(struct 
>> btrfs_fs_info *fs_info)
>>  ret = -EINVAL;
>>  }
>>  
>> +#if 0
>> +/*
>> + * Need a way to check for any copy of SB, as its not a
>> + * strong check, just ignore this for now.
>> + */
>>  if (btrfs_super_bytenr(sb) != BTRFS_SUPER_INFO_OFFSET) {
>>  btrfs_err(fs_info, "super offset mismatch %llu != %u",
>>btrfs_super_bytenr(sb), BTRFS_SUPER_INFO_OFFSET);
>>  ret = -EINVAL;
>>  }
>> +#endif
>>  
>>  /*
>>   * Obvious sys_chunk_array corruptions, it must hold at least one key
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 9fa2539a8493..f368db94d62b 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1369,7 +1369,7 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>  {
>>  struct btrfs_super_block *disk_super;
>>  struct block_device *bdev;
>> -struct page *page;
>> +struct buffer_head *sb_bh;
>>  int ret = -EINVAL;
>>  u64 devid;
>>  u64 transid;
>> @@ -1392,8 +1392,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>  goto error;
>>  }
>>  
>> -if (btrfs_read_disk_super(bdev, bytenr, , _super))
>> +sb_bh = btrfs_read_dev_super(bdev);
>> +if (IS_ERR(sb_bh)) {
>> +ret = PTR_ERR(sb_bh);
>>  goto error_bdev_put;
>> +}
>> +disk_super = (struct btrfs_super_block *) sb_bh->b_data;
>>  
>>  devid = btrfs_stack_device_id(_super->dev_item);
>>  transid = btrfs_super_generation(disk_super);
>> @@ -1413,7 +1417,7 @@ int btrfs_scan_one_device(const char *path, fmode_t 
>> flags, void *holder,
>>  if (!ret && fs_devices_ret)
>>  (*fs_devices_ret)->total_devices = total_devices;
>>  
>> -btrfs_release_disk_super(page);
>> +brelse(sb_bh);
>>  
>>  error_bdev_put:
>>  blkdev_put(bdev, flags);
>>
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 06:57 PM, Holger Hoffstätte wrote:
> On 12/01/17 18:34, Matt McKinnon wrote:
>> Thanks, I'll give space_cache=v2 a shot.
> 
> Yes, very much recommended.
> 
>> My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/
> 
> Turn autodefrag off and use noatime instead of relatime.
> 
> Your filesystem also seems very full,

We don't know. btrfs fi df only displays allocated space. And that being
full is good, it means not too much free space fragments everywhere.

> that's bad with every filesystem but
> *especially* with btrfs because the allocator has to work really hard to find
> free space for COWing. Really consider deleting stuff or adding more space.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 05:31 PM, Matt McKinnon wrote:
> Sorry, I missed your in-line reply:
> 
> 
>> 1) The one right above, btrfs_write_out_cache, is the write-out of the
>> free space cache v1. Do you see this for multiple seconds going on, and
>> does it match the time when it's writing X MB/s to disk?
>>
> 
> It seems to only last until the next watch update.
> 
> [] io_schedule+0x16/0x40
> [] get_request+0x23e/0x720
> [] blk_queue_bio+0xc1/0x3a0
> [] generic_make_request+0xf8/0x2a0
> [] submit_bio+0x75/0x150
> [] btrfs_map_bio+0xe5/0x2f0 [btrfs]
> [] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
> [] submit_one_bio+0x63/0xa0 [btrfs]
> [] flush_epd_write_bio+0x3b/0x50 [btrfs]
> [] flush_write_bio+0xe/0x10 [btrfs]
> [] btree_write_cache_pages+0x379/0x450 [btrfs]
> [] btree_writepages+0x5d/0x70 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
> [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80
> [btrfs]
> [] btrfs_commit_transaction+0x665/0x900 [btrfs]
> [] transaction_kthread+0x18a/0x1c0 [btrfs]
> [] kthread+0x109/0x140
> [] ret_from_fork+0x25/0x30
> 
> The last three lines will stick around for a while.  Is switching to
> space cache v2 something that everyone should be doing?  Something that
> would be a good test at least?

Yes. Read on.

>> 2) How big is this filesystem? What does your `btrfs fi df
>> /mountpoint` say?
>>
> 
> # btrfs fi df /export/
> Data, single: total=30.45TiB, used=30.25TiB
> System, DUP: total=32.00MiB, used=3.62MiB
> Metadata, DUP: total=66.50GiB, used=65.08GiB
> GlobalReserve, single: total=512.00MiB, used=53.69MiB

Multi-TiB filesystem, check. total/used ratio looks healthy.

>> 3) What kind of workload are you running? E.g. how can you describe it
>> within a range from "big files which just sit there" to "small writes
>> and deletes all over the place all the time"?
> 
> It's a pretty light workload most of the time.  It's a file system that
> exports two NFS shares to a small lab group.  I believe it is more small
> reads all over a large file (MRI imaging) rather than small writes.

Ok.

>> 4) What kernel version is this? `uname -a` output?
> 
> # uname -a
> Linux machine_name 4.12.8-custom #1 SMP Tue Aug 22 10:15:01 EDT 2017
> x86_64 x86_64 x86_64 GNU/Linux
> 

Yes, I'd recommend switching to space_cache v2, which stores the free
space information in a tree instead of separate blobs, and does not
block the transaction while writing out all info of all touched parts of
the filesystem again.

Here's of course the famous presentation with all kinds of info why:

http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf

How:

* umount the filesystem
* btrfsck --clear-space-cache v1 /block/device
* do a rw mount with the space_cache=v2 option added (only needed
explicitly once)

During that mount, it will generate the free space tree by reading the
extent tree and writing the inverse of it. This will take some time,
depending on how fast your storage can do random reads with a cold disk
cache.

For x86_64, using the free space cache v2 is fine since linux 4.5. Up to
4.9, there was a bug for big-endian systems. So, with your kernel it's
absolutely fine.

Why isn't this the default yet? It's because btrfs-progs don't have
support to update the free space tree when doing offline modifications
(like check --repair or btrfstune, which you hopefully don't need often
anyway). So, until that's fully added, you need to do an `btrfsck
--clear-space-cache v2`, then do the offline r/w action and then
generate the tree again on next mount.

Additional tips (forgot to ask for your /proc/mounts before):
* Use the noatime mount option, so that only accessing files does not
lead to changes in metadata, which lead to writes, which lead to cowing
and writes in a new place, which lead to updates of the free space
administration etc...

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 04:24 PM, Matt McKinnon wrote:
> Thanks for this.  Here's what I get:

Ok, and which one is displaying most of the time?

> [...]
> 
> [] io_schedule+0x16/0x40
> [] get_request+0x23e/0x720
> [] blk_queue_bio+0xc1/0x3a0
> [] generic_make_request+0xf8/0x2a0
> [] submit_bio+0x75/0x150
> [] btrfs_map_bio+0xe5/0x2f0 [btrfs]
> [] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
> [] submit_one_bio+0x63/0xa0 [btrfs]
> [] flush_epd_write_bio+0x3b/0x50 [btrfs]
> [] flush_write_bio+0xe/0x10 [btrfs]
> [] btree_write_cache_pages+0x379/0x450 [btrfs]
> [] btree_writepages+0x5d/0x70 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
> [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80
> [btrfs]
> [] btrfs_commit_transaction+0x665/0x900 [btrfs]
> 
> [...]
> 
> [] io_schedule+0x16/0x40
> [] wait_on_page_bit+0xe8/0x120
> [] read_extent_buffer_pages+0x1cd/0x2e0 [btrfs]
> [] btree_read_extent_buffer_pages+0x9f/0x100 [btrfs]
> [] read_tree_block+0x32/0x50 [btrfs]
> [] read_block_for_search.isra.32+0x120/0x2e0 [btrfs]
> [] btrfs_next_old_leaf+0x215/0x400 [btrfs]
> [] btrfs_next_leaf+0x10/0x20 [btrfs]
> [] btrfs_lookup_csums_range+0x12e/0x410 [btrfs]
> [] csum_exist_in_range.isra.49+0x2a/0x81 [btrfs]
> [] run_delalloc_nocow+0x9b2/0xa10 [btrfs]
> [] run_delalloc_range+0x68/0x340 [btrfs]
> [] writepage_delalloc.isra.47+0xf0/0x140 [btrfs]
> [] __extent_writepage+0xc7/0x290 [btrfs]
> [] extent_write_cache_pages.constprop.53+0x2b5/0x450
> [btrfs]
> [] extent_writepages+0x4d/0x70 [btrfs]
> [] btrfs_writepages+0x28/0x30 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_fdatawrite_range+0x20/0x50 [btrfs]
> [] __btrfs_write_out_cache+0x3d9/0x420 [btrfs]
> [] btrfs_write_out_cache+0x86/0x100 [btrfs]
> [] btrfs_write_dirty_block_groups+0x261/0x390 [btrfs]
> [] commit_cowonly_roots+0x1fb/0x290 [btrfs]
> [] btrfs_commit_transaction+0x434/0x900 [btrfs]

1) The one right above, btrfs_write_out_cache, is the write-out of the
free space cache v1. Do you see this for multiple seconds going on, and
does it match the time when it's writing X MB/s to disk?

2) How big is this filesystem? What does your `btrfs fi df /mountpoint` say?

3) What kind of workload are you running? E.g. how can you describe it
within a range from "big files which just sit there" to "small writes
and deletes all over the place all the time"?

4) What kernel version is this? `uname -a` output?


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 03:25 PM, Matt McKinnon wrote:
> 
> Is there any way to figure out what exactly btrfs-transacti is chugging
> on?  I have a few file systems that seem to get wedged for days on end
> with this process pegged around 100%.  I've stopped all snapshots, made
> sure no quotas were enabled, turned on autodefrag in the mount options,
> tried manual defragging, kernel upgrades, yet still this brings my
> system to a crawl.
> 
> Network I/O to the system seems very tiny.  The only I/O I see to the
> disk is btrfs-transacti writing a couple M/s.
> 
> # time touch foo
> 
> real    2m54.303s
> user    0m0.000s
> sys 0m0.002s
> 
> # uname -r
> 4.12.8-custom
> 
> # btrfs --version
> btrfs-progs v4.13.3
> 
> Yes, I know I'm a bit behind there...

One of the simple things you can do is watch the stack traces of the
kernel thread.

watch 'cat /proc//stack'

where  is the pid of the btrfs-transaction process.

In there, you will see a pattern of reoccuring things, like, it's
searching for free space, it's writing out free space cache, or other
things. Correlate this with the disk write traffic and see if we get a
step further.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About 'key type for persistent [...] items'

2017-11-28 Thread Hans van Kranenburg
Ok, just want to add one more thing :)

On 11/28/2017 08:12 PM, David Sterba wrote:
> On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote:
>> On 11/28/2017 06:34 PM, David Sterba wrote:
>>> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
>>>> Last week, when implementing the automatic classifier to dynamically
>>>> create tree item data objects by key type in python-btrfs, I ran into
>>>> the following commits in btrfs-progs:
>>>>
>>>>   commit 8609c8bad68528f668d9ce564b868aa4828107a0
>>>>   btrfs-progs: print-tree: factor out temporary_item dump
>>>> and
>>>>   commit a4b65f00d53deb1b495728dd58253af44fcf70df
>>>>   btrfs-progs: print-tree: factor out persistent_item dump
>>>>
>>>> ...which are related to kernel...
>>>>
>>>>   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
>>>>   btrfs: introduce key type for persistent permanent items
>>>> and
>>>>   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
>>>>   btrfs: introduce key type for persistent temporary items
>>>>
>>>> Afaics the goal is to overload types because there can be only 256 in
>>>> total. However, I'm missing the design decisions behind the
>>>> implementation of it. It's not in the commit messages, and it hasn't
>>>> been on the mailing list.
>>>
>>> The reason is avoid wasting key types but still allow to store new types
>>> of data to the btrees, if they were not part of the on-disk format.
>>>
>>> I'm not sure if this has been discussed or mentioned under some patches
>>> or maybe unrelated patches. I do remember that I discussed that with
>>> Chris in private on IRC and have the logs, dated 2015-09-02.
>>>
>>> At that time the balance item and dev stats item were introduced, maybe
>>> also the qgroup status item type. This had me alarmed enough to
>>> reconsider how the keys are allocated.
>>>
>>>> Before, there was an 1:1 mapping from key types to data structures. Now,
>>>> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
>>>> which use this type can be using any data structure they want, so it's
>>>> some kind of YOLO_ITEM_KEY.
>>>
>>> In some sense it is, so it's key+objectid to determine the structure.
>>>
>>>> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
>>>> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
>>>> case BTRFS_DEV_STATS_OBJECTID.
>>>>
>>>> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
>>>> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
>>>> 0, and I'm storing a btrfs_kebab_item struct indexed at
>>>> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
>>>> will try to parse the data by calling print_dev_stats?
>>>
>>> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
>>> case.
>>
>> (I'm just thinking out loud here, if you think I'm wasting your time
>> just say.)
>>
>> Yes, so the objectid numbers have to be "registered" / "reserved" in the
>> documentation, and they have to be unique over all trees.
> 
> Right.
> 
>> Maybe the information I was looking for is... in what cases should or
>> shouldn't this be used? Because that limits the possible usage quite a
>> bit. Or is it only for very very specific things.
> 
> The keys are not free for use, they need to have a defined meaning and
> are sort of part of the on-disk format. There must be some usecase and
> reasoning why it's necessary to be done that way, and not in another.
> Like xattr.
> 
> I don't have an example now, but I could imagine some per-tree
> information that can be tracked and updated at commit time.
> 
>> E.g. if I wanted to (just a random idea) add per device statistics, and
>> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
>> I have multiple devices. That's already a no go if there's anyone in any
>> other tree that is doing anything with any objectid in the range of
>> valid device numbers.
> 
> In that case there would be a new objectid PER_DEVICE_OBJECTID, with the
> persistent key, and all the device ids can go to the offset field. The
> objectid field should not be dynamic, by design.

Ok, didn't think of that yet, clear.

So... just a random idea...

How cool would it be to have the b

Re: About 'key type for persistent [...] items'

2017-11-28 Thread Hans van Kranenburg
On 11/28/2017 06:34 PM, David Sterba wrote:
> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote:
>> Last week, when implementing the automatic classifier to dynamically
>> create tree item data objects by key type in python-btrfs, I ran into
>> the following commits in btrfs-progs:
>>
>>   commit 8609c8bad68528f668d9ce564b868aa4828107a0
>>   btrfs-progs: print-tree: factor out temporary_item dump
>> and
>>   commit a4b65f00d53deb1b495728dd58253af44fcf70df
>>   btrfs-progs: print-tree: factor out persistent_item dump
>>
>> ...which are related to kernel...
>>
>>   commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
>>   btrfs: introduce key type for persistent permanent items
>> and
>>   commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
>>   btrfs: introduce key type for persistent temporary items
>>
>> Afaics the goal is to overload types because there can be only 256 in
>> total. However, I'm missing the design decisions behind the
>> implementation of it. It's not in the commit messages, and it hasn't
>> been on the mailing list.
> 
> The reason is avoid wasting key types but still allow to store new types
> of data to the btrees, if they were not part of the on-disk format.
> 
> I'm not sure if this has been discussed or mentioned under some patches
> or maybe unrelated patches. I do remember that I discussed that with
> Chris in private on IRC and have the logs, dated 2015-09-02.
> 
> At that time the balance item and dev stats item were introduced, maybe
> also the qgroup status item type. This had me alarmed enough to
> reconsider how the keys are allocated.
> 
>> Before, there was an 1:1 mapping from key types to data structures. Now,
>> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
>> which use this type can be using any data structure they want, so it's
>> some kind of YOLO_ITEM_KEY.
> 
> In some sense it is, so it's key+objectid to determine the structure.
> 
>> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
>> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
>> case BTRFS_DEV_STATS_OBJECTID.
>>
>> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
>> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
>> 0, and I'm storing a btrfs_kebab_item struct indexed at
>> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
>> will try to parse the data by calling print_dev_stats?
> 
> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that
> case.

(I'm just thinking out loud here, if you think I'm wasting your time
just say.)

Yes, so the objectid numbers have to be "registered" / "reserved" in the
documentation, and they have to be unique over all trees.

Maybe the information I was looking for is... in what cases should or
shouldn't this be used? Because that limits the possible usage quite a
bit. Or is it only for very very specific things.

E.g. if I wanted to (just a random idea) add per device statistics, and
use this, I'd need to use the key also with objectid 1, 2, 3, etc... if
I have multiple devices. That's already a no go if there's anyone in any
other tree that is doing anything with any objectid in the range of
valid device numbers.

>> What's the idea behind that? Instead of having the key type field define
>> the struct and meaning, we now suddenly need the tuple (tree, objectid,
>> type), and we need all three to determine what's inside the item data?
>> So, the code in print_tree.c would also need to know about the tree
>> number and pass that into the different functions.
> 
> No, all key types, even the persistent/temporary are independent of the
> tree type. So it's only type <-> structure mapping, besides
> persistent/temporary types.

Yeah, I wasn't explicit about that, I meant only for the
persistent/temporary case yes.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: dump_tree: remove superfluous _TREE

2017-11-24 Thread Hans van Kranenburg
-# btrfs inspect-internal dump-tree -t fs /dev/block/device
ERROR: unrecognized tree id: fs

Without this fix I can't dump-tree fs, but I can dump-tree fs_tree and
also fs_tree_tree, which is a bit silly.
---
 cmds-inspect-dump-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 4e72c08a..df44bb63 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -143,7 +143,7 @@ static u64 treeid_from_string(const char *str, const char 
**end)
{ "CHUNK", BTRFS_CHUNK_TREE_OBJECTID },
{ "DEVICE", BTRFS_DEV_TREE_OBJECTID },
{ "DEV", BTRFS_DEV_TREE_OBJECTID },
-   { "FS_TREE", BTRFS_FS_TREE_OBJECTID },
+   { "FS", BTRFS_FS_TREE_OBJECTID },
{ "CSUM", BTRFS_CSUM_TREE_OBJECTID },
{ "CHECKSUM", BTRFS_CSUM_TREE_OBJECTID },
{ "QUOTA", BTRFS_QUOTA_TREE_OBJECTID },
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About 'key type for persistent [...] items'

2017-11-24 Thread Hans van Kranenburg
Hi,

Last week, when implementing the automatic classifier to dynamically
create tree item data objects by key type in python-btrfs, I ran into
the following commits in btrfs-progs:

  commit 8609c8bad68528f668d9ce564b868aa4828107a0
  btrfs-progs: print-tree: factor out temporary_item dump
and
  commit a4b65f00d53deb1b495728dd58253af44fcf70df
  btrfs-progs: print-tree: factor out persistent_item dump

...which are related to kernel...

  commit 50c2d5abe64c1726b48d292a2ab04f60e8238933
  btrfs: introduce key type for persistent permanent items
and
  commit 0bbbccb17fea86818e1a058faf5903aefd20b31a
  btrfs: introduce key type for persistent temporary items

Afaics the goal is to overload types because there can be only 256 in
total. However, I'm missing the design decisions behind the
implementation of it. It's not in the commit messages, and it hasn't
been on the mailing list.

Before, there was an 1:1 mapping from key types to data structures. Now,
with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items
which use this type can be using any data structure they want, so it's
some kind of YOLO_ITEM_KEY.

The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For
example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with
case BTRFS_DEV_STATS_OBJECTID.

However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means
that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also
0, and I'm storing a btrfs_kebab_item struct indexed at
(BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c
will try to parse the data by calling print_dev_stats?

What's the idea behind that? Instead of having the key type field define
the struct and meaning, we now suddenly need the tuple (tree, objectid,
type), and we need all three to determine what's inside the item data?
So, the code in print_tree.c would also need to know about the tree
number and pass that into the different functions.

Am I missing something, or is my observation correct?

Thanks,--
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


python-btrfs release v9

2017-11-23 Thread Hans van Kranenburg
I just tagged v9 of the python btrfs library.

https://github.com/knorrie/python-btrfs

Q: What is python-btrfs?
A: python-btrfs is a python3 implementation of the data structures for
btrfs metadata, and an implementation of the kernel API (via ioctl
calls) that btrfs provides to userland.

Q: Ok, whatever, dude, but what would I need that for?
A: You probably won't need this.

Q: I'm a btrfs developer, what is this?
A: It provides quick hackable ways to inspect a running filesystem. You
can look up things in the current kernel metadata memory, play around
with ioctls, and quickly cobble up scripts for that.

Here's the summary of changes:

python-btrfs v9, Nov 23 2017
  * IOCTLs: IOC_SET_RECEIVED_SUBVOL, IOC_LOGICAL_INO_V2
  * Add crc32c checksum function for data, thanks Nikolay Borisov
  * Recognize zstd compression type
  * Add a tree object pretty printer to dump data for any tree item
  * Examples added:
- Show default subvolume id
- Lookup a block checksum in the checksum tree
- Determine bad free space fragmentation score for block groups
- Set the received_uuid e.a. of a subvolume
- Dump an entire metadata tree using the pretty printer
  * Fixes:
- crc32c: add implicit seed truncation to 32 bits value
  * Small fixes and cosmetic changes

Note: the IOC_SET_RECEIVED_SUBVOL is added 'because we can'. It's not
added because it's a very good to mess around with it except for
educational purpose in a test environment.

As usual, detailed explanations of all changes are in the git commit
messages.

Feedback from Nikolay Borisov pushed me to have a look at how the
checksum storage is done.

My favorite one for this release is the tree object pretty printer,
which allows us to dump a single tree object displaying all its
attributes, but also allows us to stream a search query into it to
output a full dump of a metadata tree. See the examples/dump_tree.py
about this!

Tomorrow I'll update pypi and will prepare debian packages for unstable
and stretch-backports.

-- 
Have fun,
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-18 Thread Hans van Kranenburg
On 11/18/2017 12:48 PM, Hans van Kranenburg wrote:
> 
> So, who wants to help?
> 
> 1. Find a test system that you can crash.
> 2. Create a test filesystem with some data.
> 3. Run with 4.14? (makes the most sense I think)
> 4. Continuously feed the data to balance and send everything to /dev/null
> 5. Collect stack traces and borken filesystem images.

I moved it to a ramdisk:

-# modprobe brd rd_nr=1 rd_size=4194304
-# mkfs.btrfs -m single -d single /dev/ram0
-# mount -o noatime /dev/ram0 /btrfs
-# cd /btrfs
-# btrfs sub create moo
-# rsync -av /usr/ moo/
-# btrfs sub snap -r moo/ moo-ro

Now remove part of the files, yolo style
-# rm $(find moo -type f | shuf | head -n 5000)
Put them back, so we have some differences for incremental send
-# rsync -av /usr/ moo/
-# btrfs sub snap -r moo/ moo-ro2

Now again:

-# while true; do btrfs balance start --full-balance /btrfs; done
and
-# while true; do btrfs send --no-data /btrfs/moo-ro/ | wc -c;
   btrfs send --no-data -p /btrfs/moo-ro/ /btrfs/moo-ro4/ |wc -c;
   date; done

Now I got rid of the disk traffic, and kernel cpu time goes to >300%.

The error seen before is easily triggered. It happens both on normal and
on incremental send. It happens both when using --no-data and when not
using that option. send aborts with "ERROR: send ioctl failed with -5:
Input/output error", and dmesg shows:

[...]
[17094.578876] BTRFS error (device ram0): did not find backref in
send_root. inode=28151, offset=0, disk_byte=8769369178112 found
extent=8769369178112
[17328.368458] BTRFS error (device ram0): did not find backref in
send_root. inode=23264, offset=0, disk_byte=8902861979648 found
extent=8902861979648
[17352.779099] BTRFS error (device ram0): did not find backref in
send_root. inode=17392, offset=0, disk_byte=8917230010368 found
extent=8917230010368
[18012.009357] BTRFS error (device ram0): did not find backref in
send_root. inode=29245, offset=0, disk_byte=9295300538368 found
extent=9295300538368
[18193.218649] BTRFS error (device ram0): did not find backref in
send_root. inode=16437, offset=0, disk_byte=9400309366784 found
extent=9400309366784
[18604.697898] BTRFS error (device ram0): did not find backref in
send_root. inode=24508, offset=0, disk_byte=9635165790208 found
extent=9635165790208
[18621.053722] BTRFS error (device ram0): did not find backref in
send_root. inode=10039, offset=0, disk_byte=9644150468608 found
extent=9644150468608
[19039.051399] BTRFS error (device ram0): did not find backref in
send_root. inode=29411, offset=0, disk_byte=9883807432704 found
extent=9883807432704
[19373.297701] BTRFS error (device ram0): did not find backref in
send_root. inode=7946, offset=0, disk_byte=10074868215808 found
extent=10074868215808
[19573.432255] BTRFS error (device ram0): did not find backref in
send_root. inode=26743, offset=0, disk_byte=10190374899712 found
extent=10190374899712
[19682.305240] BTRFS error (device ram0): did not find backref in
send_root. inode=24823, offset=0, disk_byte=10252750929920 found
extent=10252750929920
[20012.420346] BTRFS error (device ram0): did not find backref in
send_root. inode=25763, offset=0, disk_byte=10441684029440 found
extent=10441684029440
[20430.100411] BTRFS error (device ram0): did not find backref in
send_root. inode=14572, offset=0, disk_byte=10680836050944 found
extent=10680836050944
[21328.821766] BTRFS error (device ram0): did not find backref in
send_root. inode=11756, offset=0, disk_byte=11195322470400 found
extent=11195322470400
[...]

After a few hours I have a long list of those, but that's all so far. No
other big explosions.

So, if anyone has an idea of what to try next? Maybe it needs more than
1 block group each for data and metadata? Maybe speeding it up (with a
small amount of data and the ramdisk) does not increase the chance of
triggering something, but just decreases it?

Welcome to the world of trying to reproduce errors... :D

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-18 Thread Hans van Kranenburg
On 11/18/2017 12:48 PM, Hans van Kranenburg wrote:
> 
> So, who wants to help?
> 
> 1. Find a test system that you can crash.
> 2. Create a test filesystem with some data.
> 3. Run with 4.14? (makes the most sense I think)
> 4. Continuously feed the data to balance and send everything to /dev/null
> 5. Collect stack traces and borken filesystem images.

Ok, what I just did:

-# mkfs.btrfs --nodiscard /dev/xvdb
-# mount -o noatime /dev/xvdb /btrfs
-# cd btrfs/
-# btrfs sub create usr
-# rsync -av /usr/ usr/
-# btrfs sub snap -r usr/ usro

And then, both at the same time:

-# while true; do btrfs send usro/ > /dev/null; date; echo 3 >
/proc/sys/vm/drop_caches; done

and

-# while true; do btrfs balance start --full-balance /btrfs; done

After a few minutes we have the first error:


[ 1605.484627] BTRFS error (device xvdb): did not find backref in
send_root. inode=4260, offset=0, disk_byte=130793844736 found
extent=130793844736

Sat Nov 18 13:25:28 CET 2017
At subvol usro/
ERROR: send ioctl failed with -5: Input/output error

No big stack trace this time.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-17 Thread Hans van Kranenburg
On 11/18/2017 01:49 AM, Tomasz Chmielewski wrote:
> I'm getting the following BUG when running balance on one of my systems:
> 
> 
> [ 3458.698704] BTRFS info (device sdb3): relocating block group
> 306045779968 flags data|raid1
> [ 3466.892933] BTRFS info (device sdb3): found 2405 extents
> [ 3495.408630] BTRFS info (device sdb3): found 2405 extents
> [ 3498.161144] [ cut here ]
> [ 3498.161150] kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
> [ 3498.161264] invalid opcode:  [#1] SMP
> [ 3498.161363] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common
> xt_LOG xt_multiport xt_conntrack xt_nat binfmt_misc veth ip6table_filter
> xt_CHECKSUM iptable_mangle xt_tcpudp ip6t_MASQUERADE
> nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> nf_nat_ipv6 ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack iptable_filter ip_tables x_tables bridge stp llc intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd glue_helper cryptd intel_cstate hci_uart
> intel_rapl_perf btbcm input_leds serdev serio_raw btqca btintel
> bluetooth intel_pch_thermal intel_lpss_acpi intel_lpss mac_hid acpi_pad
> [ 3498.162060]  ecdh_generic acpi_als kfifo_buf industrialio autofs4
> btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy
> async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath
> linear raid1 e1000e psmouse ptp ahci pps_core libahci wmi
> pinctrl_sunrisepoint i2c_hid video pinctrl_intel hid
> [ 3498.162386] CPU: 7 PID: 29041 Comm: btrfs Not tainted
> 4.14.0-041400-generic #201711122031
> [ 3498.162545] Hardware name: FUJITSU  /D3401-H2, BIOS V5.0.0.12 R1.5.0
> for D3401-H2x 02/27/2017
> [ 3498.162723] task: 8d7858e82f00 task.stack: b4ee47d5c000
> [ 3498.162890] RIP: 0010:read_node_slot+0xd7/0xe0 [btrfs]
> [ 3498.163027] RSP: 0018:b4ee47d5fb88 EFLAGS: 00010246
> [ 3498.163156] RAX: 8d78c8bb7000 RBX: 8d8124abd380 RCX:
> 0001
> [ 3498.163290] RDX: 0048 RSI: 8d7ae1fef6f8 RDI:
> 8d8124aa
> [ 3498.163422] RBP: b4ee47d5fba8 R08: 0001 R09:
> 8d8124abd384
> [ 3498.163555] R10: 0001 R11: 00114000 R12:
> 0002
> [ 3498.163689] R13: b4ee47d5fc66 R14: b4ee47d5fc50 R15:
> 
> [ 3498.163825] FS:  7fa4c9a998c0() GS:8d816e5c()
> knlGS:
> [ 3498.163990] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3498.164120] CR2: 56410155a028 CR3: 0009c194c002 CR4:
> 003606e0
> [ 3498.164255] DR0:  DR1:  DR2:
> 
> [ 3498.164390] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [ 3498.164523] Call Trace:
> [ 3498.164694]  tree_advance+0x16e/0x1d0 [btrfs]
> [ 3498.164874]  btrfs_compare_trees+0x2da/0x6a0 [btrfs]
> [ 3498.165078]  ? process_extent+0x1580/0x1580 [btrfs]
> [ 3498.165264]  btrfs_ioctl_send+0xe94/0x1120 [btrfs]

It's using send + balance at the same time. There's something that makes
btrfs explode when you do that.

It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various
different explosions in kernel log. Since that happened, I made sure I
never did those two things at the same time.

> [ 3498.165450]  btrfs_ioctl+0x93c/0x1f00 [btrfs]
> [ 3498.165587]  ? enqueue_task_fair+0xa8/0x6c0
> [ 3498.165724]  do_vfs_ioctl+0xa5/0x600
> [ 3498.165854]  ? do_vfs_ioctl+0xa5/0x600
> [ 3498.165979]  ? _do_fork+0x144/0x3a0
> [ 3498.166103]  SyS_ioctl+0x79/0x90
> [ 3498.166234]  entry_SYSCALL_64_fastpath+0x1e/0xa9
> [ 3498.166368] RIP: 0033:0x7fa4c8b17f07
> [ 3498.166488] RSP: 002b:7ffd33644e38 EFLAGS: 0202 ORIG_RAX:
> 0010
> [ 3498.166653] RAX: ffda RBX: 7fa4c8a1a700 RCX:
> 7fa4c8b17f07
> [ 3498.166787] RDX: 7ffd33644f30 RSI: 40489426 RDI:
> 0004
> [ 3498.166921] RBP: 7ffd33644dc0 R08:  R09:
> 7fa4c8a1a700
> [ 3498.167055] R10: 7fa4c8a1a9d0 R11: 0202 R12:
> 
> [ 3498.167190] R13: 7ffd33644dbf R14: 7fa4c8a1a9c0 R15:
> 0129f020
> [ 3498.167326] Code: 48 c7 c3 fb ff ff ff e8 f8 5c 05 00 48 89 d8 5b 41
> 5c 41 5d 41 5e 5d c3 48 c7 c3 fe ff ff ff 48 89 d8 5b 41 5c 41 5d 41 5e
> 5d c3 <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41
> [ 3498.167690] RIP: read_node_slot+0xd7/0xe0 [btrfs] RSP: b4ee47d5fb88
> [ 3498.167892] ---[ end trace 6a751a3020dd3086 ]---
> [ 3499.572729] BTRFS info

Re: Theoretical Question about commit=n

2017-11-12 Thread Hans van Kranenburg
On 11/13/2017 01:41 AM, Qu Wenruo wrote:
> 
> On 2017年11月13日 06:01, Hans van Kranenburg wrote:
>> On 11/12/2017 09:58 PM, Robert White wrote:
>>> Is the commit interval monotonic, or is it seconds after sync?
>>>
>>> What I mean is that if I manually call sync(2) does the commit timer
>>> reset? I'm thinking it does not, but I can imagine a workload where it
>>> ideally would.
>>
>> The magic happens inside the transaction kernel thread:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/disk-io.c?h=v4.14#n1925
>>
>> You can see the delay being computed:
>> delay = HZ * fs_info->commit_interval;
>>
>> Almost at the end of the function, you see:
>> schedule_timeout(delay)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/time/timer.c?h=v4.14#n1676
>>
>> This schedule_timeout function sets a timer and then the thread goes to
>> sleep. If nothing happens, the kernel will wake up the thread after the
>> timer expires (can be later, but not earlier) and then it will redo the
>> loop.
>>
>> If something else wakes up the transaction thread, the timer is
>> discarded if it's not expired yet.
> 
> So far so good.
> 
>>
>> So it works like you would want.
> 
> Not exactly.

Ah, interesting.

> Sync or commit_transaction won't wake up transaction_kthread.
> 
> transaction_kthread will mostly be woken by trans error, remount or
> under certain case of btrfs_end_transaction.
> 
> So manually sync will not (at least not always) interrupt commit interval.

The fun thing is, when I just do sync, I see that the time it takes for
a next generation bump to happen is reset (while doing something simple
like touch x in a loop in another terminal).

> And even more, transaction_kthread will only commit transaction, which
> means it will only ensure metadata consistent.
> 
> It won't ensure buffered write to reach disk if its extent is not
> allocated yet (delalloc).

Hm, I have seen things like that in BTRFS_IOC_SYNC...

Actually, I first responded on the timer reset question, because that
one was easy to answer. I don't know if I want to descend the path
further into (f)sync. I heard it can get really messy down there. :]

> 
> Thanks,
> Qu
>>
>> You can test this yourself by looking at the "generation" number of your
>> filesystem. It's in the output of btrfs inspect dump-super:
>>
>> This is the little test filesystem I just used:
>>
>> -# btrfs inspect dump-super /dev/dorothy/mekker | grep ^generation
>> generation   35
>>
>> If you print the number in a loop, like every second, you can see it
>> going up after a transaction happened. Now play around with other things
>> and see when it changes.
>>
>>> (Again, this is purely theoretical, I have no such workload as I am
>>> about to describe.)
>>>
>>> So suppose I have some sort of system, like a database, that I know will
>>> do scattered writes and extends through some files and then call some
>>> variant of sync(2). And I know that those sync() calls will be every
>>> forty-to-sixty seconds because of reasons. It would be "neat" to be able
>>> to set the commit=n to some high value, like 90, and then "normally" the
>>> sync() behaviours would follow the application instead of the larger
>>> commit interval.
>>>
>>> The value would be that the file system would tend _not_ to go into sync
>>> while the application was still skittering about in the various files.
>>>
>>> Of course any other applications could call sync from their own contexts
>>> for their own reasons. And there's an implicit fsync() on just about any
>>> close() (at least if everything is doing its business "correctly")
>>>
>>> It may be a strange idea but I can think of some near realtime
>>> applications might be able to leverage a modicum of control over the
>>> sync event. There is no API, and not strong reason to desire one, for
>>> controlling the commit via (low privelege) applications.
>>>
>>> But if the plumbing exists, then having a mode where sync() or fsync()
>>> (which I think causes a general sync because of the journal) resets the
>>> commit timer could be really interesting.
>>>
>>> With any kind of delayed block choice/mapping it could actually reduce
>>> the entropy of the individual files for repeated random small writes.
>>> The application would have to be reasonably aware, of course.
>>>
>>> Since something is causing a sync() the commit=N guarantee is still
>>> being met for the whole system for any N, but applications could tend to
>>> avoid mid-write commits by planing their sync()s.
>>>
>>> Just a thought.
>>
>>


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Theoretical Question about commit=n

2017-11-12 Thread Hans van Kranenburg
On 11/12/2017 09:58 PM, Robert White wrote:
> Is the commit interval monotonic, or is it seconds after sync?
> 
> What I mean is that if I manually call sync(2) does the commit timer
> reset? I'm thinking it does not, but I can imagine a workload where it
> ideally would.

The magic happens inside the transaction kernel thread:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/disk-io.c?h=v4.14#n1925

You can see the delay being computed:
delay = HZ * fs_info->commit_interval;

Almost at the end of the function, you see:
schedule_timeout(delay)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/time/timer.c?h=v4.14#n1676

This schedule_timeout function sets a timer and then the thread goes to
sleep. If nothing happens, the kernel will wake up the thread after the
timer expires (can be later, but not earlier) and then it will redo the
loop.

If something else wakes up the transaction thread, the timer is
discarded if it's not expired yet.

So it works like you would want.

You can test this yourself by looking at the "generation" number of your
filesystem. It's in the output of btrfs inspect dump-super:

This is the little test filesystem I just used:

-# btrfs inspect dump-super /dev/dorothy/mekker | grep ^generation
generation  35

If you print the number in a loop, like every second, you can see it
going up after a transaction happened. Now play around with other things
and see when it changes.

> (Again, this is purely theoretical, I have no such workload as I am
> about to describe.)
> 
> So suppose I have some sort of system, like a database, that I know will
> do scattered writes and extends through some files and then call some
> variant of sync(2). And I know that those sync() calls will be every
> forty-to-sixty seconds because of reasons. It would be "neat" to be able
> to set the commit=n to some high value, like 90, and then "normally" the
> sync() behaviours would follow the application instead of the larger
> commit interval.
> 
> The value would be that the file system would tend _not_ to go into sync
> while the application was still skittering about in the various files.
> 
> Of course any other applications could call sync from their own contexts
> for their own reasons. And there's an implicit fsync() on just about any
> close() (at least if everything is doing its business "correctly")
> 
> It may be a strange idea but I can think of some near realtime
> applications might be able to leverage a modicum of control over the
> sync event. There is no API, and not strong reason to desire one, for
> controlling the commit via (low privelege) applications.
> 
> But if the plumbing exists, then having a mode where sync() or fsync()
> (which I think causes a general sync because of the journal) resets the
> commit timer could be really interesting.
> 
> With any kind of delayed block choice/mapping it could actually reduce
> the entropy of the individual files for repeated random small writes.
> The application would have to be reasonably aware, of course.
> 
> Since something is causing a sync() the commit=N guarantee is still
> being met for the whole system for any N, but applications could tend to
> avoid mid-write commits by planing their sync()s.
> 
> Just a thought.


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs: Remove received_uuid during received snapshot ro->rw switch

2017-11-12 Thread Hans van Kranenburg
On 10/05/2017 10:22 AM, Nikolay Borisov wrote:
> Currently when a read-only snapshot is received and subsequently its ro 
> property
> is set to false i.e. switched to rw-mode the received_uuid of that subvol 
> remains
> intact. However, once the received volume is switched to RW mode we cannot
> guaranteee that it contains the same data, so it makes sense to remove the
> received uuid. The presence of the received_uuid can also cause problems when
> the volume is being send.
> 
> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
> Suggested-by: David Sterba <dste...@suse.cz>
> ---
> 
> v4: 
>  * Put the uuid tree removal code after the lightweight 'send in progress' 
>  check and don't move btrfs_start_transaction as suggested by David
>  
> v3:
>  * Rework the patch considering latest feedback from David Sterba i.e. 
>   explicitly use btrfs_end_transaction 
> 
>  fs/btrfs/ioctl.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ee4ee7cbba72..9328c091854b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1775,6 +1775,7 @@ static noinline int btrfs_ioctl_subvol_setflags(struct 
> file *file,
>   struct btrfs_trans_handle *trans;
>   u64 root_flags;
>   u64 flags;
> + bool clear_received_uuid = false;
>   int ret = 0;
>  
>   if (!inode_owner_or_capable(inode))
> @@ -1824,6 +1825,7 @@ static noinline int btrfs_ioctl_subvol_setflags(struct 
> file *file,
>   btrfs_set_root_flags(>root_item,
>root_flags & ~BTRFS_ROOT_SUBVOL_RDONLY);
>   spin_unlock(>root_item_lock);
> + clear_received_uuid = true;
>   } else {
>   spin_unlock(>root_item_lock);
>   btrfs_warn(fs_info,
> @@ -1840,6 +1842,24 @@ static noinline int btrfs_ioctl_subvol_setflags(struct 
> file *file,
>   goto out_reset;
>   }
>  
> + if (clear_received_uuid) {
> + if (!btrfs_is_empty_uuid(root->root_item.received_uuid)) {
> + ret = btrfs_uuid_tree_rem(trans, fs_info,
> + root->root_item.received_uuid,
> + BTRFS_UUID_KEY_RECEIVED_SUBVOL,
> + root->root_key.objectid);
> +
> + if (ret && ret != -ENOENT) {
> + btrfs_abort_transaction(trans, ret);
> + btrfs_end_transaction(trans);
> + goto out_reset;
> + }
> +
> + memset(root->root_item.received_uuid, 0,
> + BTRFS_UUID_SIZE);

Shouldn't we also wipe the other related fields here, like stime, rtime,
stransid, rtransid?

> + }
> + }
> +
>   ret = btrfs_update_root(trans, fs_info->tree_root,
>   >root_key, >root_item);
>   if (ret < 0) {
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >