Re: BTRFS Mount Delay Time Graph
roups in extent-tree.c: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982 What the code is doing here is starting at the beginning of the extent tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which is not that far away), and then based on the information in it, computes where the next one will be (just after the end of the vaddr+length of it), and then jumps over all normal extent items and searches again near where the next block group item has to be. So, yes, that means that they depend on each other. Two possible ways to improve this: 1. Instead, walk the chunk tree (which has all related items packed together) instead to find out at which locations in the extent tree the block group items are located and then start getting items in parallel. If you have storage with a lot of rotating rust that can deliver much more random reads if you ask for more of them at the same time, then this can already cause a massive speedup. 2. Move the block group items somewhere else, where they can nicely be grouped together, so that the amount of metadata pages that has to be looked up is minimal. Quoting from the link below, "slightly tricky [...] but there are no fundamental obstacles". https://www.spinics.net/lists/linux-btrfs/msg71766.html I think the main obstacle here is finding a developer with enough experience and time to do it. :) For fun, you can also just read the block group metadata after dropping caches each time, which should give similar relative timing results as mounting the filesystem again. (Well, if disk IO wait is the main slowdown of course.) Attached are two example programs, using python-btrfs. * bg_after_another.py does the same thing as the kernel code I just linked. * bg_via_chunks.py looks them up based on chunk tree info. The time that it takes after option 2 above would be implemented should be very similar to just reading the chunk tree. (remove the block group lookup from bg_via_chunks and run that). Now what's still missing is changing the bg_via_chunks one to start kicking off the block group searches in parallel, and then you can predict how long it would take if 1 would be implemented. \:D/ -- Hans van Kranenburg #!/usr/bin/python3 import btrfs import sys if len(sys.argv) < 2: print("Usage: {} ".format(sys.argv[0])) sys.exit(1) tree = btrfs.ctree.EXTENT_TREE_OBJECTID min_key = btrfs.ctree.Key(0, 0, 0) bufsize = btrfs.utils.SZ_4K def first_block_group_after(fs, key): for header, data in btrfs.ioctl.search_v2(fs.fd, tree, min_key, buf_size=bufsize): if header.type == btrfs.ctree.BLOCK_GROUP_ITEM_KEY: return header fs = btrfs.FileSystem(sys.argv[1]) while True: header = first_block_group_after(fs, min_key) if header is None: break min_key = btrfs.ctree.Key(header.objectid + header.offset, btrfs.ctree.BLOCK_GROUP_ITEM_KEY, 0) print('.', end='', flush=True) print() #!/usr/bin/python3 import btrfs import sys if len(sys.argv) < 2: print("Usage: {} ".format(sys.argv[0])) sys.exit(1) fs = btrfs.FileSystem(sys.argv[1]) for chunk in fs.chunks(): fs.block_group(chunk.vaddr, chunk.length) print('.', end='', flush=True) print()
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On 12/1/18 1:19 AM, Qu Wenruo wrote: > > > On 2018/12/1 上午12:52, Josef Bacik wrote: >> From: Josef Bacik >> >> When debugging some weird extent reference bug I suspected that we were >> changing a snapshot while we were deleting it, which could explain my >> bug. > > May I ask under which case we're going to modify an unlinked snapshot? > > Maybe metadata relocation? I have some old logging from a filesystem that got corrupted when doing a subvolume delete while doing metadata balance: https://syrinx.knorrie.org/~knorrie/btrfs/domokun-balance-skinny-lockup.txt This is one of the very few moments when I really lost a btrfs filesystem and had to mkfs again to move on. I'm wondering if this one looks related. I never found an explanation for it. Hans > > Thanks, > Qu > >> This was indeed what was happening, and this patch helped me >> verify my theory. It is never correct to modify the snapshot once it's >> being deleted, so mark the root when we are deleting it and make sure we >> complain about it when it happens. >> >> Signed-off-by: Josef Bacik >> --- >> fs/btrfs/ctree.c | 3 +++ >> fs/btrfs/ctree.h | 1 + >> fs/btrfs/extent-tree.c | 9 + >> 3 files changed, 13 insertions(+) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index 5912a97b07a6..5f82f86085e8 100644 >> --- a/fs/btrfs/ctree.c >> +++ b/fs/btrfs/ctree.c >> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle >> *trans, >> u64 search_start; >> int ret; >> >> +if (test_bit(BTRFS_ROOT_DELETING, >state)) >> +WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being >> dropped\n"); >> + >> if (trans->transaction != fs_info->running_transaction) >> WARN(1, KERN_CRIT "trans %llu running %llu\n", >> trans->transid, >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index facde70c15ed..5a3a94ccb65c 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1199,6 +1199,7 @@ enum { >> BTRFS_ROOT_FORCE_COW, >> BTRFS_ROOT_MULTI_LOG_TASKS, >> BTRFS_ROOT_DIRTY, >> +BTRFS_ROOT_DELETING, >> }; >> >> /* >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 581c2a0b2945..dcb699dd57f3 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, >> if (block_rsv) >> trans->block_rsv = block_rsv; >> >> +/* >> + * This will help us catch people modifying the fs tree while we're >> + * dropping it. It is unsafe to mess with the fs tree while it's being >> + * dropped as we unlock the root node and parent nodes as we walk down >> + * the tree, assuming nothing will change. If something does change >> + * then we'll have stale information and drop references to blocks we've >> + * already dropped. >> + */ >> +set_bit(BTRFS_ROOT_DELETING, >state); >> if (btrfs_disk_key_objectid(_item->drop_progress) == 0) { >> level = btrfs_header_level(root->node); >> path->nodes[level] = btrfs_lock_root_node(root); >> > -- Hans van Kranenburg signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups
On 11/13/18 4:03 PM, David Sterba wrote: > On Thu, Oct 11, 2018 at 07:40:22PM +0000, Hans van Kranenburg wrote: >> On 10/11/2018 05:13 PM, David Sterba wrote: >>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote: >>>> This patch set contains an additional fix for a newly exposed bug after >>>> the previous attempt to fix a chunk allocator bug for new DUP chunks: >>>> >>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 >>>> >>>> The DUP fix is "fix more DUP stripe size handling". I did that one >>>> before starting to change more things so it can be applied to earlier >>>> LTS kernels. >>>> >>>> Besides that patch, which is fixing the bug in a way that is least >>>> intrusive, I added a bunch of other patches to help getting the chunk >>>> allocator code in a state that is a bit less error-prone and >>>> bug-attracting. >>>> >>>> When running this and trying the reproduction scenario, I can now see >>>> that the created DUP device extent is 827326464 bytes long, which is >>>> good. >>>> >>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create >>>> a list of related use cases and test more things to at least walk >>>> through a bunch of obvious use cases to see if there's nothing exploding >>>> too quickly with these changes. However, I'm quite confident about it, >>>> so I'm sharing all of it already. >>>> >>>> Any feedback and review is appreciated. Be gentle and keep in mind that >>>> I'm still very much in a learning stage regarding kernel development. >>> >>> The patches look good, thanks. Problem is explained, preparatory work is >>> separated from the fix itself. >> >> \o/ >> >>>> The stable patches handling workflow is not 100% clear to me yet. I >>>> guess I have to add a Fixes: in the DUP patch which points to the >>>> previous commit 92e222df7b. >>> >>> Almost nobody does it right, no worries. If you can identify a single >>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable >>> with version where it makes sense & applies is enough. I do that check >>> myself regardless of what's in the patch. >> >> It's 92e222df7b and the thing I'm not sure about is if it also will >> catch the same patch which was already backported to LTS kernels since >> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19, >> 4.14, 4.9, 4.4, 3.16... >> >>> I ran the patches in a VM and hit a division-by-zero in test >>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by >>> patch 3/6. >> >> Ah interesting, dev replace. >> >> I'll play around with replace and find out how to run the tests properly >> and then reproduce this. >> >> The code introduced in patch 3 is removed again in patch 6, so I don't >> suspect that one. :) >> >> But, I'll find out. >> >> Thanks for testing. > > I've merged patches 1-5 to misc-next as they seem to be safe and pass > fstests, please let me know when you have updates to the last one. > Thanks. I'll definitely follow up on that. It has not happened because something about todo lists and ordering work and not doing too many things at the same time. Thanks for already confirming it was not patch 3 but 6 that has the problem, I already suspected that. For patch 3, the actual fix, how do we get that on top of old stable kernels where the previous fix (92e222df7b) is in? Because of the Fixes: line that one was picked again in 4.14, 4.9, 4.4 and even 3.16. How does this work? Does it need all the other commit ids from those branches in Fixes lines? Or is there magic somewhere that does this? From your "development cycle open" mails, I understand that for testing myself, I would test based on misc-next, for-next or for-x.y in that order depending on where the first 5 are yet at that moment? Hans
Re: [PATCH 1/5] btrfs: Ensure at least 1g is free for balance
On 10/26/18 2:16 PM, Nikolay Borisov wrote: > > (Adding Chris to CC since he is the original author of the code) > > On 26.10.2018 15:09, Hans van Kranenburg wrote: >> On 10/26/18 1:43 PM, Nikolay Borisov wrote: >>> The first part of balance operation is to shrink every constituting >>> device to ensure there is free space for chunk allocation. >> >> A very useful thing to be able to do if there's no unallocated raw disk >> space left, is use balance to rewrite some blockgroup into the free >> space of other ones. >> >> This does not need any raw space for a chunk allocation. Requiring so >> makes it unneccesarily more complicated for users to fix the situation >> they got in. > > The current logic of the code doesn't preclude this use case since if we > can't shrink we just proceed to relocation. So even if 1g is replaced > with 5eb and shrink_device always fails balancing will still proceed. > > However, all of this really makes me wonder do we even need this "first > stage" code in balancing (Chris, any thoughts?). Having something that tries to free up 1G on each device can be very useful when converting to another profile. I guess that's why it is there. And shrink device + grow device is a nice way to try packing some data into other existing block groups. Only... if shrink+grow is done for each device after each other, and there's one with 1G unallocated left, and it can't easily move it into existing free space, then it may just be moving data around to the next disk all the time I guess, ending up with the same situation as before. :D >>> However, the code >>> has been buggy ever since its introduction since calculating the space to >>> shrink >>> the device by was bounded by 1 mb. Most likely the original intention was to >>> have an upper bound of 1g and not 1m, since the largest chunk size is 1g. >>> This >>> means the first stage in __btrfs_balance so far has been a null op since it >>> effectively freed just a single megabyte. >>> >>> Fix this by setting an upper bound of size_to_free of 1g. >>> >>> Signed-off-by: Nikolay Borisov >>> --- >>> fs/btrfs/volumes.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index f435d397019e..8b0fd7bf3447 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -3467,7 +3467,7 @@ static int __btrfs_balance(struct btrfs_fs_info >>> *fs_info) >>> list_for_each_entry(device, devices, dev_list) { >>> old_size = btrfs_device_get_total_bytes(device); >>> size_to_free = div_factor(old_size, 1); >>> - size_to_free = min_t(u64, size_to_free, SZ_1M); >>> + size_to_free = min_t(u64, size_to_free, SZ_1G); >>> if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) || >>> btrfs_device_get_total_bytes(device) - >>> btrfs_device_get_bytes_used(device) > size_to_free || >>> >> >> -- Hans van Kranenburg
Re: [PATCH 1/5] btrfs: Ensure at least 1g is free for balance
On 10/26/18 1:43 PM, Nikolay Borisov wrote: > The first part of balance operation is to shrink every constituting > device to ensure there is free space for chunk allocation. A very useful thing to be able to do if there's no unallocated raw disk space left, is use balance to rewrite some blockgroup into the free space of other ones. This does not need any raw space for a chunk allocation. Requiring so makes it unneccesarily more complicated for users to fix the situation they got in. > However, the code > has been buggy ever since its introduction since calculating the space to > shrink > the device by was bounded by 1 mb. Most likely the original intention was to > have an upper bound of 1g and not 1m, since the largest chunk size is 1g. > This > means the first stage in __btrfs_balance so far has been a null op since it > effectively freed just a single megabyte. > > Fix this by setting an upper bound of size_to_free of 1g. > > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/volumes.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index f435d397019e..8b0fd7bf3447 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3467,7 +3467,7 @@ static int __btrfs_balance(struct btrfs_fs_info > *fs_info) > list_for_each_entry(device, devices, dev_list) { > old_size = btrfs_device_get_total_bytes(device); > size_to_free = div_factor(old_size, 1); > - size_to_free = min_t(u64, size_to_free, SZ_1M); > + size_to_free = min_t(u64, size_to_free, SZ_1G); > if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) || > btrfs_device_get_total_bytes(device) - > btrfs_device_get_bytes_used(device) > size_to_free || > -- Hans van Kranenburg
Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups
On 10/11/2018 09:40 PM, Hans van Kranenburg wrote: > On 10/11/2018 05:13 PM, David Sterba wrote: >> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote: >>> This patch set contains an additional fix for a newly exposed bug after >>> the previous attempt to fix a chunk allocator bug for new DUP chunks: >>> >>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 >>> >>> The DUP fix is "fix more DUP stripe size handling". I did that one >>> before starting to change more things so it can be applied to earlier >>> LTS kernels. >>> >>> Besides that patch, which is fixing the bug in a way that is least >>> intrusive, I added a bunch of other patches to help getting the chunk >>> allocator code in a state that is a bit less error-prone and >>> bug-attracting. >>> >>> When running this and trying the reproduction scenario, I can now see >>> that the created DUP device extent is 827326464 bytes long, which is >>> good. >>> >>> I wrote and tested this on top of linus 4.19-rc5. I still need to create >>> a list of related use cases and test more things to at least walk >>> through a bunch of obvious use cases to see if there's nothing exploding >>> too quickly with these changes. However, I'm quite confident about it, >>> so I'm sharing all of it already. >>> >>> Any feedback and review is appreciated. Be gentle and keep in mind that >>> I'm still very much in a learning stage regarding kernel development. >> >> The patches look good, thanks. Problem is explained, preparatory work is >> separated from the fix itself. > > \o/ > >>> The stable patches handling workflow is not 100% clear to me yet. I >>> guess I have to add a Fixes: in the DUP patch which points to the >>> previous commit 92e222df7b. >> >> Almost nobody does it right, no worries. If you can identify a single >> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable >> with version where it makes sense & applies is enough. I do that check >> myself regardless of what's in the patch. > > It's 92e222df7b and the thing I'm not sure about is if it also will > catch the same patch which was already backported to LTS kernels since > 92e222df7b also has Fixes in it... So by now the new bug is in 4.19, > 4.14, 4.9, 4.4, 3.16... > >> I ran the patches in a VM and hit a division-by-zero in test >> fstests/btrfs/011, stacktrace below. First guess is that it's caused by >> patch 3/6. > > Ah interesting, dev replace. > > I'll play around with replace and find out how to run the tests properly > and then reproduce this. > > The code introduced in patch 3 is removed again in patch 6, so I don't > suspect that one. :) Actually, while writing this I realize that this means it should be tested separately (like, older kernel with only 3), because, well, obvious I guess. > But, I'll find out. > > Thanks for testing. > > Hans > >> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid >> 1 transid 5 /dev/vdb >> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid >> 2 transid 5 /dev/vdc >> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled >> [ 3116.088644] BTRFS info (device vdb): has skinny extents >> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature >> [ 3116.093971] BTRFS info (device vdb): checking UUID tree >> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) >> to /dev/vdd started >> [ 3125.860269] divide error: [#1] PREEMPT SMP >> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ >> #288 >> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 1.0.0-prebuilt.qemu-project.org 04/01/2014 >> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs] >> [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206 >> [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: >> 0002 >> [ 3125.873587] RDX: RSI: 94e017818c80 RDI: >> 0200 >> [ 3125.874677] RBP: 8080 R08: R09: >> 0002 >> [ 3125.875816] R10: 0003 R11: 8090 R12: >> >> [ 3125.876742] R13: 0001 R14: 0001 R15: >> 0002 >> [ 3125.877657] FS: 7f6de3
Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups
On 10/11/2018 05:13 PM, David Sterba wrote: > On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote: >> This patch set contains an additional fix for a newly exposed bug after >> the previous attempt to fix a chunk allocator bug for new DUP chunks: >> >> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 >> >> The DUP fix is "fix more DUP stripe size handling". I did that one >> before starting to change more things so it can be applied to earlier >> LTS kernels. >> >> Besides that patch, which is fixing the bug in a way that is least >> intrusive, I added a bunch of other patches to help getting the chunk >> allocator code in a state that is a bit less error-prone and >> bug-attracting. >> >> When running this and trying the reproduction scenario, I can now see >> that the created DUP device extent is 827326464 bytes long, which is >> good. >> >> I wrote and tested this on top of linus 4.19-rc5. I still need to create >> a list of related use cases and test more things to at least walk >> through a bunch of obvious use cases to see if there's nothing exploding >> too quickly with these changes. However, I'm quite confident about it, >> so I'm sharing all of it already. >> >> Any feedback and review is appreciated. Be gentle and keep in mind that >> I'm still very much in a learning stage regarding kernel development. > > The patches look good, thanks. Problem is explained, preparatory work is > separated from the fix itself. \o/ >> The stable patches handling workflow is not 100% clear to me yet. I >> guess I have to add a Fixes: in the DUP patch which points to the >> previous commit 92e222df7b. > > Almost nobody does it right, no worries. If you can identify a single > patch that introduces a bug then it's for Fixes:, otherwise a CC: stable > with version where it makes sense & applies is enough. I do that check > myself regardless of what's in the patch. It's 92e222df7b and the thing I'm not sure about is if it also will catch the same patch which was already backported to LTS kernels since 92e222df7b also has Fixes in it... So by now the new bug is in 4.19, 4.14, 4.9, 4.4, 3.16... > I ran the patches in a VM and hit a division-by-zero in test > fstests/btrfs/011, stacktrace below. First guess is that it's caused by > patch 3/6. Ah interesting, dev replace. I'll play around with replace and find out how to run the tests properly and then reproduce this. The code introduced in patch 3 is removed again in patch 6, so I don't suspect that one. :) But, I'll find out. Thanks for testing. Hans > [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid > 1 transid 5 /dev/vdb > [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid > 2 transid 5 /dev/vdc > [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled > [ 3116.088644] BTRFS info (device vdb): has skinny extents > [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature > [ 3116.093971] BTRFS info (device vdb): checking UUID tree > [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) > to /dev/vdd started > [ 3125.860269] divide error: [#1] PREEMPT SMP > [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ > #288 > [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs] > [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206 > [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: > 0002 > [ 3125.873587] RDX: RSI: 94e017818c80 RDI: > 0200 > [ 3125.874677] RBP: 8080 R08: R09: > 0002 > [ 3125.875816] R10: 0003 R11: 8090 R12: > > [ 3125.876742] R13: 0001 R14: 0001 R15: > 0002 > [ 3125.877657] FS: 7f6de34208c0() GS:94e07d60() > knlGS: > [ 3125.878862] CS: 0010 DS: ES: CR0: 80050033 > [ 3125.880080] CR2: 7ffe963d5ce8 CR3: 7659b000 CR4: > 06e0 > [ 3125.881485] Call Trace: > [ 3125.882105] do_chunk_alloc+0x266/0x3e0 [btrfs] > [ 3125.882841] btrfs_inc_block_group_ro+0x10e/0x160 [btrfs] > [ 3125.883875] scrub_enumerate_chunks+0x18b/0x5d0 [btrfs] > [ 3125.884658] ? is_module_address+0x11/0x30 > [ 3125.885271] ? wait_for_completion+0x160/0x190 > [ 3125.885928] btrfs_scrub_dev+0x1b8/0x5a0 [btrfs] >
Re: Scrub aborts due to corrupt leaf
On 10/10/2018 07:44 PM, Chris Murphy wrote: > On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte > wrote: >> On 10/10/18 17:44, Larkin Lowrey wrote: >> (..) >>> >>> About once a week, or so, I'm running into the above situation where >>> FS seems to deadlock. All IO to the FS blocks, there is no IO >>> activity at all. I have to hard reboot the system to recover. There >>> are no error indications except for the following which occurs well >>> before the FS freezes up: >>> >>> BTRFS warning (device dm-3): block group 78691883286528 has wrong amount >>> of free space >>> BTRFS warning (device dm-3): failed to load free space cache for block >>> group 78691883286528, rebuilding it now >>> >>> Do I have any options other the nuking the FS and starting over? >> >> >> Unmount cleanly & mount again with -o space_cache=v2. > > I'm pretty sure you have to umount, and then clear the space_cache > with 'btrfs check --clear-space-cache=v1' and then do a one time mount > with -o space_cache=v2. The --clear-space-cache=v1 is optional, but recommended, if you are someone who do not likes to keep accumulated cruft. The v2 mount (rw mount!!!) does not remove the v1 cache. If you just mount with v2, the v1 data keeps being there, doing nothing any more. > But anyway, to me that seems premature because we don't even know > what's causing the problem. > > a. Freezing means there's a kernel bug. Hands down. > b. Is it freezing on the rebuild? Or something else? > c. I think the devs would like to see the output from btrfs-progs > v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in > particular something not related to free space cache. > > Rebuilding either version of space cache requires successfully reading > (and parsing) the extent tree. > > -- Hans van Kranenburg
Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes
On 10/09/2018 03:14 AM, Qu Wenruo wrote: > > > On 2018/10/9 上午6:20, Hans van Kranenburg wrote: >> On 10/08/2018 02:30 PM, Qu Wenruo wrote: >>> Obviously, used bytes can't be larger than total bytes. >>> >>> Signed-off-by: Qu Wenruo >>> --- >>> check/mode-lowmem.c | 5 + >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c >>> index 07c03cad77af..1173b963b8f3 100644 >>> --- a/check/mode-lowmem.c >>> +++ b/check/mode-lowmem.c >>> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info >>> *fs_info, >>> used = btrfs_device_bytes_used(eb, dev_item); >>> total_bytes = btrfs_device_total_bytes(eb, dev_item); >>> >>> + if (used > total_bytes) { >>> + error("device %llu has incorrect used bytes %llu > total bytes >>> %llu", >>> + dev_id, used, total_bytes); >>> + return ACCOUNTING_MISMATCH; >> >> The message and return code point at an error in accounting logic. > > One of the biggest problem in lowmem is we don't always have the error > code we really want. > > And that's the case for this error message. > It's indeed not an accounting error, as in that case (just like that > test case image) the used bytes is correct accounted. > > The good news is, the return value is never really used to classify the > error. > So as long as the error message makes sense, it's not a big problem. Aha. Clear, thanks for explaining. > > Thanks, > Qu > >> >> However, if you have a fully allocated device and a DUP chunk ending >> beyond device, then having used > total_bytes is expected... >> >> So maybe there's two possibilities... There's an error in the accounting >> logic, or there's an "over-allocation", which is another type of issue >> which produces used > total with correct accounting logic. >> >>> + } >>> key.objectid = dev_id; >>> key.type = BTRFS_DEV_EXTENT_KEY; >>> key.offset = 0; >>> >> >> > -- Hans van Kranenburg
Re: merge status of per-chunk degradable check [was Re: Which device is missing ?]
On 10/09/2018 02:08 AM, Nicholas D Steeves wrote: > On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote: >> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote: >>> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar). >>> >>> But I have stranges status or errors about "missing devices" and I >>> do not understand the current situation : > [...] >>Note that, since the main FS is missing a device, it will probably >> need to be mounted in degraded mode (-o degraded), and that on kernels >> earlier than (IIRC) 4.14, this can only be done *once* without the FS >> becoming more or less permanently read-only. On recent kernels, it >> _should_ be OK. >> >> *WARNING ENDS* > > I think this was the patch that addressed this?: > https://www.spinics.net/lists/linux-btrfs/msg47283.html > https://patchwork.kernel.org/patch/7226931/ > > In my notes it wasn't present in <= 4.14.15, but my notes might be > wrong. Does this patch resolve the one-shot -o degraded, reboot, > forever read-only behaviour, or is something else required? When was > it merged? Has it been or will it be backported to 4.14.x? I'm > guessing 4.9.x is too far back, but it would be really nice to see it > there too :-) > > Also, will this issue be resolved for linux-4.19? If so I'd like to > update the Debian btrfs wiki with this good news :-) [...] > P.S. Please let me know if you'd prefer for me to shift this > documentation effort to btrfs.wiki.kernel.org. Yes, absolutely. This is not specific to how we do things for Debian. Upstream documentation can help all distros. -- Hans van Kranenburg signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups
On 10/08/2018 03:19 PM, Hans van Kranenburg wrote: > On 10/08/2018 08:43 AM, Qu Wenruo wrote: >> >> >> On 2018/10/5 下午6:58, Hans van Kranenburg wrote: >>> On 10/05/2018 09:51 AM, Qu Wenruo wrote: >>>> >>>> >>>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote: >>>>> This patch set contains an additional fix for a newly exposed bug after >>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks: >>>>> >>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 >>>> >>>> For that bug, did you succeeded in reproducing the bug? >>> >>> Yes, open the above link and scroll to "Steps to reproduce". >> >> That's beyond device boundary one. Also reproduced here. >> And hand-crafted a super small image as test case for btrfs-progs. >> >> But I'm a little curious about the dev extent overlapping case. >> Have you got one? > > Ah ok, I see. No, I didn't do that yet. > > By using unmodified tooling, I think this can be done by a combination > of a few resizings and using very specific balance to cause a fs of > exactly 7880MiB again with a single 1578MiB gap inside... > > I'll try later today to see if I can come up with a recipe for this. Ok, this is actually pretty simple to do: >8 -# mkdir bork -# cd bork -# dd if=/dev/zero of=image bs=1 count=0 seek=1024M 0+0 records in 0+0 records out 0 bytes copied, 0.000183343 s, 0.0 kB/s -# mkfs.btrfs -d dup -m dup image -# losetup -f image -# mount -o space_cache=v2 /dev/loop0 mountpoint -# dd if=/dev/zero of=mountpoint/flapsie bs=1M dd: error writing 'mountpoint/flapsie': No space left on device 453+0 records in 452+0 records out 474185728 bytes (474 MB, 452 MiB) copied, 4.07663 s, 116 MB/s >8 -# ./show_usage.py /bork/mountpoint/ Target profile for SYSTEM (chunk tree): DUP Target profile for METADATA: DUP Target profile for DATA: DUP Mixed groups: False Virtual space usage by block group type: | | typetotal used | - | Data452.31MiB452.22MiB | System8.00MiB 16.00KiB | Metadata 51.19MiB656.00KiB Total raw filesystem size: 1.00GiB Total raw allocated bytes: 1023.00MiB Allocatable bytes remaining: 1.00MiB Unallocatable raw bytes: 0.00B Unallocatable bytes that can be reclaimed by balancing: 0.00B Estimated virtual space left to use for metadata: 51.05MiB Estimated virtual space left to use for data: 96.00KiB Allocated raw disk bytes by chunk type. Parity is a reserved part of the allocated bytes, limiting the amount that can be used for data or metadata: | | flags allocated used parity | - - -- | DATA|DUP904.62MiB904.44MiB0.00B | SYSTEM|DUP 16.00MiB 32.00KiB0.00B | METADATA|DUP102.38MiB 1.28MiB0.00B Allocated bytes per device: | | devid total sizeallocated path | - --- | 1 1.00GiB 1023.00MiB /dev/loop0 Allocated bytes per device, split up per chunk type. Parity bytes are again part of the total amount of allocated bytes. | | Device ID: 1 | | flags allocated parity | | - - -- | | DATA|DUP904.62MiB0.00B | | SYSTEM|DUP 16.00MiB0.00B | | METADATA|DUP102.38MiB0.00B Unallocatable raw bytes per device: | | devidunallocatable | -- | 1 0.00B >8 Now we're gonna cause some neat 1578MiB to happen that we're going to free up later to reproduce the failure: -# dd if=/dev/zero of=image bs=1 count=0 seek=2602M 0+0 records in 0+0 records out 0 bytes copied, 0.000188621 s, 0.0 kB/s -# losetup -c /dev/loop0 -# btrfs fi resize max mountpoint/ Resize 'mountpoint/' of 'max' -# dd if=/dev/zero of=mountpoint/1578MiB bs=1M dd: error writing 'mountpoint/1578MiB': No space left on device 790+0 records in 789+0 records out 827326464 bytes (827 MB, 789 MiB) copied, 12.2452 s, 67.6 MB/s >8 -# python3 import btrfs fs = btrfs.FileSystem('/bork/mountpoint/') for d in fs.dev_extents(): print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length, d.vaddr)) start 1048576 end 11534336 vaddr 547880960 start 11534336 end 22020096 vaddr 547880960 start 22020096 end 30408704 vaddr 22020096 start 30408704 end 38797312 vaddr 22020096 start 38797312 end 92471296 vaddr 30408704 start 92471296 end 146145280 vaddr 30408704 start 146145280 end 213254144 vaddr 84082688 start 213254144 end 280363008 vaddr 84082688 start 280363008 end 397803520 vaddr
Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes
On 10/08/2018 02:30 PM, Qu Wenruo wrote: > Obviously, used bytes can't be larger than total bytes. > > Signed-off-by: Qu Wenruo > --- > check/mode-lowmem.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c > index 07c03cad77af..1173b963b8f3 100644 > --- a/check/mode-lowmem.c > +++ b/check/mode-lowmem.c > @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info > *fs_info, > used = btrfs_device_bytes_used(eb, dev_item); > total_bytes = btrfs_device_total_bytes(eb, dev_item); > > + if (used > total_bytes) { > + error("device %llu has incorrect used bytes %llu > total bytes > %llu", > + dev_id, used, total_bytes); > + return ACCOUNTING_MISMATCH; The message and return code point at an error in accounting logic. However, if you have a fully allocated device and a DUP chunk ending beyond device, then having used > total_bytes is expected... So maybe there's two possibilities... There's an error in the accounting logic, or there's an "over-allocation", which is another type of issue which produces used > total with correct accounting logic. > + } > key.objectid = dev_id; > key.type = BTRFS_DEV_EXTENT_KEY; > key.offset = 0; > -- Hans van Kranenburg
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/2018 06:37 PM, Holger Hoffstätte wrote: > On 10/08/18 17:46, Hans van Kranenburg wrote: > >> fs.devices() also looks for dev_items in the chunk tree: >> >> https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481 >> >> So, BOOM! you need root. >> >> Or just start a 0, ignore errors and start trying all devids until you >> found num_devices amount of them that work, yolo. > > Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look > at the entries in /sys/fs/btrfs//devices/ and query them all > directly. But, you still need root for that right? The progs code does a RO open directly on the block device. -$ btrfs dev stats /dev/xvdb ERROR: cannot open /dev/xvdb: Permission denied ERROR: '/dev/xvdb' is not a mounted btrfs device stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0 stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0 open("/dev/loop0", O_RDONLY)= -1 EACCES (Permission denied) But: -# btrfs dev stats /dev/xvdb [/dev/xvdb].write_io_errs0 [/dev/xvdb].read_io_errs 0 [/dev/xvdb].flush_io_errs0 [/dev/xvdb].corruption_errs 0 [/dev/xvdb].generation_errs 0 > The skeleton btrfs_exporter already responds to http requests and > returns dummy metrics, using the official python client library. > I've found a nice little python sysfs scraper and now only need to > figure out how best to map the btrfs info in sysfs to useful metrics. > The privileged access issue would only have come into play much later, > but it seems I can avoid it after all, which is great. > I'm also (re-)learning python in the process, so that's the actual > thing slowing me down.. :) -- Hans van Kranenburg
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/2018 05:29 PM, Holger Hoffstätte wrote: > On 10/08/18 16:40, Hans van Kranenburg wrote: >>> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both >>> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN. >> >> That's the tree search ioctl, for reading arbitrary metadata. >> >> The device stats ioctl is IOC_GET_DEV_STATS... > > Yeah..OK, that is clear and gave me the hint to ask: why is it > calling this in the first place? And as it turns out [1] is where > it seems to go wrong, as is_block_device() returns 0 (as it should), > fi_args.num_devices is never set (remains 0) and it proceeds to call > everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in > #1712. And that works fine: > > 1711 if (fi_args->num_devices != 1) { > (gdb) s > 1712 ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args); > (gdb) s > 1713 if (ret < 0) { > (gdb) p ret > $28 = 0 > (gdb) p *fi_args > $30 = { > max_id = 1, > num_devices = 1, > fsid = "z%:\371\315\033A\203\267.\200\255;FH\221", > nodesize = 16384, > sectorsize = 4096, > clone_alignment = 4096, > reserved32 = 0, > reserved = {0 } > } > > It's only when it goes into search_chunk_tree_for_fs_info() > where things fail due to CAP_SYS_ADMIN. > > And all this explains the actual bug: when I call btrfs device stats > not on the mountpoint (as I've been trying all this time) but rather > on the device, it works properly right away as regular user: > > (gdb) set args device stats /dev/loop0 > (gdb) r > Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0", > fi_args=fi_args@entry=0x7fffd400, > di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652 > 1652 { > (gdb) c > Continuing. > [/dev/loop0].write_io_errs 0 > [/dev/loop0].read_io_errs 0 > [/dev/loop0].flush_io_errs 0 > [/dev/loop0].corruption_errs 0 > [/dev/loop0].generation_errs 0 > [Inferior 1 (process 2805) exited normally] > > So this is simply a discrepancy in handling a device vs. the device(s) > for a mountpoint. Apparently based on what you point it at, it does a different thing. If you point it at a block device, it will try opening the block device to find out which devid it has (from the superblock), find out where it is mounted and then only ask for stats for this one device. -# btrfs dev stats /dev/xvdc [/dev/xvdc].write_io_errs0 [/dev/xvdc].read_io_errs 0 [/dev/xvdc].flush_io_errs0 [/dev/xvdc].corruption_errs 0 [/dev/xvdc].generation_errs 0 If you point it at a mounted filesystem, it lists stats for all devices. Since there's no way to get a list of devids, it directly searches the chunk tree directly for dev_item objects. -# btrfs dev stats /mountpoint [/dev/xvdb].write_io_errs0 [/dev/xvdb].read_io_errs 0 [/dev/xvdb].flush_io_errs0 [/dev/xvdb].corruption_errs 0 [/dev/xvdb].generation_errs 0 [/dev/xvdc].write_io_errs0 [/dev/xvdc].read_io_errs 0 [/dev/xvdc].flush_io_errs0 [/dev/xvdc].corruption_errs 0 [/dev/xvdc].generation_errs 0 [/dev/xvdd].write_io_errs0 [/dev/xvdd].read_io_errs 0 [/dev/xvdd].flush_io_errs0 [/dev/xvdd].corruption_errs 0 [/dev/xvdd].generation_errs 0 I do the same thing in the nagios plugin: https://github.com/knorrie/python-btrfs/blob/master/examples/nagios/plugins/check_btrfs#L131 fs.devices() also looks for dev_items in the chunk tree: https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481 So, BOOM! you need root. Or just start a 0, ignore errors and start trying all devids until you found num_devices amount of them that work, yolo. >> I can do the device stats ioctl as normal user: >> >> import btrfs >> fs = btrfs.FileSystem('/') >> btrfs.utils.pretty_print(fs.dev_stats(1)) >> >> >> devid: 1 >> nr_items: 5 >> flags: 0 >> write_errs: 0 >> read_errs: 0 >> flush_errs: 0 >> generation_errs: 0 >> corruption_errs: 0 > > That works for me too, and that's actually the important part. \o/ > Glad we talked about it. :} > > -h > > [1] > https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666 > > -- Hans van Kranenburg
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/2018 04:40 PM, Hans van Kranenburg wrote: > On 10/08/2018 04:27 PM, Holger Hoffstätte wrote: >> (moving the discussion here from GH [1]) >> >> Apparently there is something weird going on with the device stats >> ioctls. I cannot get them to work as regular user, while they work >> for David. A friend confirms the same issue on his system - no access >> as non-root. >> >> So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with >> debug symbols and stepped into search_chunk_tree_for_fs_info(). >> Everything is fine, all args are correct, right until: >> >> (gdb) s >> 1614 ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args); >> (gdb) s >> 1615 if (ret < 0) >> (gdb) p ret >> $4 = -1 >> (gdb) p search_args >> $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset >> = 1, >> max_offset = 18446744073709551615, min_transid = 0, max_transid = >> 18446744073709551615, >> min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0, >> unused2 = 0, >> unused3 = 0, unused4 = 0}, buf = '\000' } >> >> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both >> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN. > > That's the tree search ioctl, for reading arbitrary metadata. > > The device stats ioctl is IOC_GET_DEV_STATS... > > I can do the device stats ioctl as normal user: > > import btrfs > fs = btrfs.FileSystem('/') > btrfs.utils.pretty_print(fs.dev_stats(1)) > > > devid: 1 > nr_items: 5 > flags: 0 > write_errs: 0 > read_errs: 0 > flush_errs: 0 > generation_errs: 0 > corruption_errs: 0 By the way, I can also do BTRFS_IOC_FS_INFO, BTRFS_IOC_DEV_INFO and BTRFS_IOC_SPACE_INFO as normal user. However, while fs_info tells me that there are num_devices devices, there's no place where you can actually get which devids these are, and you need to provide them one by one to dev_info and dev_stats... : btrfs.utils.pretty_print(fs.fs_info()) max_id: 1 num_devices: 1 nodesize: 4096 sectorsize: 4096 clone_alignment: 4096 fsid: 91077ca5-6559-4a90-9d03-912d3a33412e btrfs.utils.pretty_print(fs.dev_info(1)) devid: 1 bytes_used: 60699967488 total_bytes: 107374182400 path: /dev/xvda uuid: 7e998baa-b533-4476-9132-d7d748d28044 btrfs.utils.pretty_print(fs.space_info()) - flags: Data, single total_bytes: 54.00GiB used_bytes: 53.27GiB - flags: System, single total_bytes: 32.00MiB used_bytes: 12.00KiB - flags: Metadata, single total_bytes: 2.50GiB used_bytes: 1.30GiB - flags: GlobalReserve, single total_bytes: 181.02MiB used_bytes: 0.00B > >> So why can Dave get his dev stats as unprivileged user? >> Does this work for anybody else? And why? :) >> >> cheers >> Holger >> >> [1] >> https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190 >> > > -- Hans van Kranenburg
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/2018 04:27 PM, Holger Hoffstätte wrote: > (moving the discussion here from GH [1]) > > Apparently there is something weird going on with the device stats > ioctls. I cannot get them to work as regular user, while they work > for David. A friend confirms the same issue on his system - no access > as non-root. > > So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with > debug symbols and stepped into search_chunk_tree_for_fs_info(). > Everything is fine, all args are correct, right until: > > (gdb) s > 1614 ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args); > (gdb) s > 1615 if (ret < 0) > (gdb) p ret > $4 = -1 > (gdb) p search_args > $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset > = 1, > max_offset = 18446744073709551615, min_transid = 0, max_transid = > 18446744073709551615, > min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0, > unused2 = 0, > unused3 = 0, unused4 = 0}, buf = '\000' } > > Looking at the kernel side of things in fs/btrfs/ioctl.c I see both > BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN. That's the tree search ioctl, for reading arbitrary metadata. The device stats ioctl is IOC_GET_DEV_STATS... I can do the device stats ioctl as normal user: import btrfs fs = btrfs.FileSystem('/') btrfs.utils.pretty_print(fs.dev_stats(1)) devid: 1 nr_items: 5 flags: 0 write_errs: 0 read_errs: 0 flush_errs: 0 generation_errs: 0 corruption_errs: 0 > So why can Dave get his dev stats as unprivileged user? > Does this work for anybody else? And why? :) > > cheers > Holger > > [1] > https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190 > -- Hans van Kranenburg
Re: python-btrfs v10 preview... detailed usage reporting and a tutorial
Hi, On 09/24/2018 01:19 AM, Adam Borowski wrote: > On Sun, Sep 23, 2018 at 11:54:12PM +0200, Hans van Kranenburg wrote: >> Two examples have been added, which use the new code. I would appreciate >> extra testing. Please try them and see if the reported numbers make sense: >> >> space_calculator.py >> --- >> Best to be initially described as a CLI version of the well-known >> webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at >> it, choose data and metadata profile and see how much space you would >> get to store actual data. >> >> See commit message "Add example to calculate usable and wasted space" >> for example output. >> >> show_usage.py >> - >> The contents of the old show_usage.py example that simply showed a list >> of block groups are replaced with a detailed usage report of an existing >> filesystem. > > I wonder, perhaps at least some of the examples could be elevated to > commands meant to be run by end-user? Ie, installing them to /usr/bin/, > dropping the extension? They'd probably need less generic names, though. Some of the examples are very useful, and I keep using them frequently. That's actually also the reason that I for now just have copied the examples/ to /usr/share/doc/python3-btrfs/examples for the Debian package, so that they're easily available on all systems that I work on. Currently the examples collection is serving a few purposes. It's my poor mans testing framework, which covers all functionality of the lib. It displays all the things that you can do. There's a rich git commit message history on them, which I plan to transform into documentation and tutorial stuff later. So, yes, a bunch of the things are quite useful actually. The new show_usage and space_calculator are examples of things that are possible which start to ascend the small thingies on debugging level. So what would be candidates to be promoted to 'official' utils? 0) Ah, btrfs-heatmap Yeah, that's the thing it all started with. I started writing all of the code to be able to debug why my filesystems were allocating raw disk space all the time and not reusing the free already allocated space. But, that one is already done. https://github.com/knorrie/btrfs-heatmap/ 1) Custom btrfs balance If really needed (and luckily, the need for it is mostly removed after solving the -o ssd issues) I always use balance_least_used.py instead of regular btrfs balance. I think it totally makes sense to do the analysis of what blockgroups to feed to balance in what order in user space. I also used another custom script to feed block groups with highly fragmented free space to balance to try repairing filesystems that had been using the cluster data extent allocator. That's not in examples, but when you combine show_free_space_fragmentation with parts of balance_least_used, you get the idea. The best example I can think of here is a program that uses the new usage information to find out how to feed block groups to balance to actually get a balanced filesystem with minimal amount of wasted raw space, and then do exactly that in the quickest way possible while providing interesting progress information, instead of just brute force rewriting all of the data and having no idea what's actually happening. 2) Advanced usage reporting Something like the new show_usage, but hey, when using python with some batteries included, I guess we can relatively easily do a nice html or pdf output with pie and bar charts which provide the user with information about the filesystem. Just having users run that when they're asking for help on IRC and share the result would be nice. :o) 3) The space calculator Yup, obviously. 4) Maybe show_orphan_cleaner_progress I use that one now and then to get a live view on mass-removal of subvolumes (backup snapshot expiry), but it's very close to a debug tool. Or maybe I'm already spoiled and used to it now, and I don't realize any more how frustrating it must be to see disk IO and cpu go all places and have no idea about what btrfs is doing. 5) So much more... So... the examples are just basic test coverage. There is so much more that can be done. And yes, to be able to write a small thingie that uses the lib, you already have to know a lot about btrfs. -> That's why I started writing the tutorial. And yes, when promoting things like the new show_usage example to programs that are easily available, users will probably start parsing the output of them with sed and awk which is a total abomination and the absolute opposite of the purpose of the library. So be it. Let it go. :D "The code never bothered me any way". The interesting question that remains is where the result should go. btrfs-heatmap is a thing of its own now, but it's a bit of the "show case" example using the lib, with it
Re: [PATCH 5/6] btrfs: introduce nparity raid_attr
On 10/05/2018 04:42 PM, Nikolay Borisov wrote: > > > On 5.10.2018 00:24, Hans van Kranenburg wrote: >> Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an >> nparity field in raid_attr. >> >> Signed-off-by: Hans van Kranenburg >> --- >> fs/btrfs/volumes.c | 18 +++--- >> fs/btrfs/volumes.h | 2 ++ >> 2 files changed, 13 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index d82b3d735ebe..453046497ac8 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -37,6 +37,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 1, >> .devs_increment = 2, >> .ncopies= 2, >> +.nparity= 0, >> .raid_name = "raid10", >> .bg_flag= BTRFS_BLOCK_GROUP_RAID10, >> .mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET, >> @@ -49,6 +50,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 1, >> .devs_increment = 2, >> .ncopies= 2, >> +.nparity= 0, >> .raid_name = "raid1", >> .bg_flag= BTRFS_BLOCK_GROUP_RAID1, >> .mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET, >> @@ -61,6 +63,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 0, >> .devs_increment = 1, >> .ncopies= 2, >> +.nparity= 0, >> .raid_name = "dup", >> .bg_flag= BTRFS_BLOCK_GROUP_DUP, >> .mindev_error = 0, >> @@ -73,6 +76,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 0, >> .devs_increment = 1, >> .ncopies= 1, >> +.nparity= 0, >> .raid_name = "raid0", >> .bg_flag= BTRFS_BLOCK_GROUP_RAID0, >> .mindev_error = 0, >> @@ -85,6 +89,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 0, >> .devs_increment = 1, >> .ncopies= 1, >> +.nparity= 0, >> .raid_name = "single", >> .bg_flag= 0, >> .mindev_error = 0, >> @@ -97,6 +102,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 1, >> .devs_increment = 1, >> .ncopies= 1, >> +.nparity= 2, >> .raid_name = "raid5", >> .bg_flag= BTRFS_BLOCK_GROUP_RAID5, >> .mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, >> @@ -109,6 +115,7 @@ const struct btrfs_raid_attr >> btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { >> .tolerated_failures = 2, >> .devs_increment = 1, >> .ncopies= 1, >> +.nparity= 2, >> .raid_name = "raid6", >> .bg_flag= BTRFS_BLOCK_GROUP_RAID6, >> .mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, >> @@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >> int devs_min; /* min devs needed */ >> int devs_increment; /* ndevs has to be a multiple of this */ >> int ncopies;/* how many copies to data has */ >> +int nparity;/* number of stripes worth of bytes to >> + store parity information */ >> int ret; >> u64 max_stripe_size; >> u64 max_chunk_size; >> @@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >> devs_min = btrfs_raid_array[index].devs_min; >> devs_increment = btrfs_raid_array[index].devs_increment; >> ncopies = btrfs_raid_array[index].ncopies; >> +nparity = btrfs_raid_array[index].nparity; >> >> if (type & BTRFS_BLOCK_GROUP_DATA) { >> max_stripe_size = SZ_1G; >> @@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct >
Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups
On 10/05/2018 09:51 AM, Qu Wenruo wrote: > > > On 2018/10/5 上午5:24, Hans van Kranenburg wrote: >> This patch set contains an additional fix for a newly exposed bug after >> the previous attempt to fix a chunk allocator bug for new DUP chunks: >> >> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 > > For that bug, did you succeeded in reproducing the bug? Yes, open the above link and scroll to "Steps to reproduce". o/ > I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and > btrfs-progs. > I think it could help to detect such problem. > > Thanks, > Qu > >> >> The DUP fix is "fix more DUP stripe size handling". I did that one >> before starting to change more things so it can be applied to earlier >> LTS kernels. >> >> Besides that patch, which is fixing the bug in a way that is least >> intrusive, I added a bunch of other patches to help getting the chunk >> allocator code in a state that is a bit less error-prone and >> bug-attracting. >> >> When running this and trying the reproduction scenario, I can now see >> that the created DUP device extent is 827326464 bytes long, which is >> good. >> >> I wrote and tested this on top of linus 4.19-rc5. I still need to create >> a list of related use cases and test more things to at least walk >> through a bunch of obvious use cases to see if there's nothing exploding >> too quickly with these changes. However, I'm quite confident about it, >> so I'm sharing all of it already. >> >> Any feedback and review is appreciated. Be gentle and keep in mind that >> I'm still very much in a learning stage regarding kernel development. >> >> The stable patches handling workflow is not 100% clear to me yet. I >> guess I have to add a Fixes: in the DUP patch which points to the >> previous commit 92e222df7b. >> >> Moo!, >> Knorrie >> >> Hans van Kranenburg (6): >> btrfs: alloc_chunk: do not refurbish num_bytes >> btrfs: alloc_chunk: improve chunk size variable name >> btrfs: alloc_chunk: fix more DUP stripe size handling >> btrfs: fix ncopies raid_attr for RAID56 >> btrfs: introduce nparity raid_attr >> btrfs: alloc_chunk: rework chunk/stripe calculations >> >> fs/btrfs/volumes.c | 84 +++--- >> fs/btrfs/volumes.h | 4 ++- >> 2 files changed, 45 insertions(+), 43 deletions(-) >> > -- Hans van Kranenburg signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/6] btrfs: introduce nparity raid_attr
On 10/04/2018 11:24 PM, Hans van Kranenburg wrote: > Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an > nparity field in raid_attr. > > Signed-off-by: Hans van Kranenburg > --- > fs/btrfs/volumes.c | 18 +++--- > fs/btrfs/volumes.h | 2 ++ > 2 files changed, 13 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index d82b3d735ebe..453046497ac8 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -37,6 +37,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 1, > .devs_increment = 2, > .ncopies= 2, > + .nparity= 0, > .raid_name = "raid10", > .bg_flag= BTRFS_BLOCK_GROUP_RAID10, > .mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET, > @@ -49,6 +50,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 1, > .devs_increment = 2, > .ncopies= 2, > + .nparity= 0, > .raid_name = "raid1", > .bg_flag= BTRFS_BLOCK_GROUP_RAID1, > .mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET, > @@ -61,6 +63,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 0, > .devs_increment = 1, > .ncopies= 2, > + .nparity= 0, > .raid_name = "dup", > .bg_flag= BTRFS_BLOCK_GROUP_DUP, > .mindev_error = 0, > @@ -73,6 +76,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 0, > .devs_increment = 1, > .ncopies= 1, > + .nparity= 0, > .raid_name = "raid0", > .bg_flag= BTRFS_BLOCK_GROUP_RAID0, > .mindev_error = 0, > @@ -85,6 +89,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 0, > .devs_increment = 1, > .ncopies= 1, > + .nparity= 0, > .raid_name = "single", > .bg_flag= 0, > .mindev_error = 0, > @@ -97,6 +102,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 1, > .devs_increment = 1, > .ncopies= 1, > + .nparity= 2, Ahem, this obviously needs to be 1. Another reorganize/rebase-fail. > .raid_name = "raid5", > .bg_flag= BTRFS_BLOCK_GROUP_RAID5, > .mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, > @@ -109,6 +115,7 @@ const struct btrfs_raid_attr > btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .tolerated_failures = 2, > .devs_increment = 1, > .ncopies= 1, > + .nparity= 2, > .raid_name = "raid6", > .bg_flag= BTRFS_BLOCK_GROUP_RAID6, > .mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, > @@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > int devs_min; /* min devs needed */ > int devs_increment; /* ndevs has to be a multiple of this */ > int ncopies;/* how many copies to data has */ > + int nparity;/* number of stripes worth of bytes to > +store parity information */ > int ret; > u64 max_stripe_size; > u64 max_chunk_size; > @@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > devs_min = btrfs_raid_array[index].devs_min; > devs_increment = btrfs_raid_array[index].devs_increment; > ncopies = btrfs_raid_array[index].ncopies; > + nparity = btrfs_raid_array[index].nparity; > > if (type & BTRFS_BLOCK_GROUP_DATA) { > max_stripe_size = SZ_1G; > @@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, >* this will have to be fixed for RAID1 and RAID10 over >* more drives >*/ > - data_stripes = num_stripes / ncopies; > - > - if (type & BTRFS_BLOCK_GROUP_RAID5) > - data_stripes = num_stripes - 1; > - > -
Re: [PATCH 2/6] btrfs: alloc_chunk: improve chunk size variable name
On 10/04/2018 11:24 PM, Hans van Kranenburg wrote: > num_bytes is really a way too generic name for a variable in this > function. There are a dozen other variables that hold a number of bytes > as value. > > Give it a name that actually describes what it does, which is holding > the size of the chunk that we're allocating. > > Signed-off-by: Hans van Kranenburg > --- > fs/btrfs/volumes.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 9c9bb127..40fa85e68b1f 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -123,6 +123,8 @@ const char *get_raid_name(enum btrfs_raid_types type) > return btrfs_raid_array[type].raid_name; > } > > + > + Meh, missed that one, will remove. > static int init_first_rw_device(struct btrfs_trans_handle *trans, > struct btrfs_fs_info *fs_info); > static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info); > @@ -4599,7 +4601,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > u64 max_stripe_size; > u64 max_chunk_size; > u64 stripe_size; > - u64 num_bytes; > + u64 chunk_size; > int ndevs; > int i; > int j; > @@ -4801,9 +4803,9 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > map->type = type; > map->sub_stripes = sub_stripes; > > - num_bytes = stripe_size * data_stripes; > + chunk_size = stripe_size * data_stripes; > > - trace_btrfs_chunk_alloc(info, map, start, num_bytes); > + trace_btrfs_chunk_alloc(info, map, start, chunk_size); > > em = alloc_extent_map(); > if (!em) { > @@ -4814,7 +4816,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > set_bit(EXTENT_FLAG_FS_MAPPING, >flags); > em->map_lookup = map; > em->start = start; > - em->len = num_bytes; > + em->len = chunk_size; > em->block_start = 0; > em->block_len = em->len; > em->orig_block_len = stripe_size; > @@ -4832,7 +4834,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > refcount_inc(>refs); > write_unlock(_tree->lock); > > - ret = btrfs_make_block_group(trans, 0, type, start, num_bytes); > + ret = btrfs_make_block_group(trans, 0, type, start, chunk_size); > if (ret) > goto error_del_extent; > > -- Hans van Kranenburg
[PATCH 6/6] btrfs: alloc_chunk: rework chunk/stripe calculations
Previously, the stripe_size variable was modified too many times in the __btrfs_alloc_chunk function. The most problematic place was the if block dealing with a chunk bigger than max_chunk_size, which would throw away (overwrite) the value of stripe_size, maybe realizing a few lines later that the previous value was actually better and executing a copy of former logic to try get it back in the previous state. Instead of on-the-fly calculating the target chunk size, adjust the max_stripe_size variable based on the max_chunk_size that was set before, and use that to simply compare it to stripe_size at some point. This removes the whole problematic if block. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 46 +- fs/btrfs/volumes.h | 2 +- 2 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 453046497ac8..862ee17ee0e5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4596,14 +4596,15 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, struct btrfs_device_info *devices_info = NULL; u64 total_avail; int num_stripes;/* total number of stripes to allocate */ - int data_stripes; /* number of stripes that count for - block group size */ + int num_data_stripes; /* number of stripes worth of bytes to + store data including copies */ int sub_stripes;/* sub_stripes info for map */ int dev_stripes;/* stripes per dev */ int devs_max; /* max devs to use */ int devs_min; /* min devs needed */ int devs_increment; /* ndevs has to be a multiple of this */ - int ncopies;/* how many copies to data has */ + int ncopies;/* how many times actual data is duplicated + inside num_data_stripes */ int nparity;/* number of stripes worth of bytes to store parity information */ int ret; @@ -4747,6 +4748,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, } ndevs = min(ndevs, devs_max); + num_stripes = ndevs * dev_stripes; + num_data_stripes = num_stripes - nparity; /* * The primary goal is to maximize the number of stripes, so use as @@ -4756,31 +4759,24 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * max_avail is the total size so we have to adjust. */ stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); - num_stripes = ndevs * dev_stripes; - - /* -* this will have to be fixed for RAID1 and RAID10 over -* more drives -*/ - data_stripes = (num_stripes - nparity) / ncopies; /* -* Use the number of data stripes to figure out how big this chunk -* is really going to be in terms of logical address space, -* and compare that answer with the max chunk size. If it's higher, -* we try to reduce stripe_size. +* Now that we know how many stripes we're going to use, we can adjust +* down max_stripe_size if needed, paying attention to max_chunk_size. +* +* By multiplying chunk size with ncopies, we get the total amount of +* bytes that need to fit into all the non-parity stripes. +* +* A chunk is allowed to end up being a bit bigger than max_chunk_size +* when rounding up the stripe_size to a 16MiB boundary makes it so. +* Unless... it ends up being bigger than the amount of physical free +* space we can use for it. */ - if (stripe_size * data_stripes > max_chunk_size) { - /* Reduce stripe_size, round it up to a 16MB boundary -* again and then use it, unless it ends up being even -* bigger than the previous value we had already. -*/ - stripe_size = min(round_up(div_u64(max_chunk_size, - data_stripes), SZ_16M), - stripe_size); - } + max_stripe_size = min(round_up((max_chunk_size * ncopies) / + num_data_stripes, SZ_16M), + max_stripe_size); - /* align to BTRFS_STRIPE_LEN */ + stripe_size = min(max_stripe_size, stripe_size); stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN); map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS); @@ -4804,7 +4800,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, map->type = type; map->sub_stripes = sub_stripes; - chunk_size = stripe_size * data_stripes; + chunk_size = div_u64(stripe_size * num_data_stripes
[PATCH 0/6] Chunk allocator DUP fix and cleanups
This patch set contains an additional fix for a newly exposed bug after the previous attempt to fix a chunk allocator bug for new DUP chunks: https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374 The DUP fix is "fix more DUP stripe size handling". I did that one before starting to change more things so it can be applied to earlier LTS kernels. Besides that patch, which is fixing the bug in a way that is least intrusive, I added a bunch of other patches to help getting the chunk allocator code in a state that is a bit less error-prone and bug-attracting. When running this and trying the reproduction scenario, I can now see that the created DUP device extent is 827326464 bytes long, which is good. I wrote and tested this on top of linus 4.19-rc5. I still need to create a list of related use cases and test more things to at least walk through a bunch of obvious use cases to see if there's nothing exploding too quickly with these changes. However, I'm quite confident about it, so I'm sharing all of it already. Any feedback and review is appreciated. Be gentle and keep in mind that I'm still very much in a learning stage regarding kernel development. The stable patches handling workflow is not 100% clear to me yet. I guess I have to add a Fixes: in the DUP patch which points to the previous commit 92e222df7b. Moo!, Knorrie Hans van Kranenburg (6): btrfs: alloc_chunk: do not refurbish num_bytes btrfs: alloc_chunk: improve chunk size variable name btrfs: alloc_chunk: fix more DUP stripe size handling btrfs: fix ncopies raid_attr for RAID56 btrfs: introduce nparity raid_attr btrfs: alloc_chunk: rework chunk/stripe calculations fs/btrfs/volumes.c | 84 +++--- fs/btrfs/volumes.h | 4 ++- 2 files changed, 45 insertions(+), 43 deletions(-) -- 2.19.0.329.g76f2f5c1e3
[PATCH 4/6] btrfs: fix ncopies raid_attr for RAID56
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These values are not used anywhere by the way. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7045814fc98d..d82b3d735ebe 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -96,7 +96,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .devs_min = 2, .tolerated_failures = 1, .devs_increment = 1, - .ncopies= 2, + .ncopies= 1, .raid_name = "raid5", .bg_flag= BTRFS_BLOCK_GROUP_RAID5, .mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, @@ -108,7 +108,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .devs_min = 3, .tolerated_failures = 2, .devs_increment = 1, - .ncopies= 3, + .ncopies= 1, .raid_name = "raid6", .bg_flag= BTRFS_BLOCK_GROUP_RAID6, .mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, -- 2.19.0.329.g76f2f5c1e3
[PATCH 3/6] btrfs: alloc_chunk: fix more DUP stripe size handling
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling" fixed calculating the stripe_size for a new DUP chunk. However, the same calculation reappears a bit later, and that one was not changed yet. The resulting bug that is exposed is that the newly allocated device extents ('stripes') can have a few MiB overlap with the next thing stored after them, which is another device extent or the end of the disk. The scenario in which this can happen is: * The block device for the filesystem is less than 10GiB in size. * The amount of contiguous free unallocated disk space chosen to use for chunk allocation is 20% of the total device size, or a few MiB more or less. An example: - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB) - There's 1578MiB unallocated raw disk space left in one contiguous piece. In this case stripe_size is first calculated as 789MiB, (half of 1578MiB). Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we enter the if block. Now stripe_size value is immediately overwritten while calculating an adjusted value based on max_chunk_size, which ends up as 788MiB. Next, the value is rounded up to a 16MiB boundary, 800MiB, which is actually more than the value we had before. However, the last comparison fails to detect this, because it's comparing the value with the total amount of free space, which is about twice the size of stripe_size. In the example above, this means that the resulting raw disk space being allocated is 1600MiB, while only a gap of 1578MiB has been found. The second device extent object for this DUP chunk will overlap for 22MiB with whatever comes next. The underlying problem here is that the stripe_size is reused all the time for different things. So, when entering the code in the if block, stripe_size is immediately overwritten with something else. If later we decide we want to have the previous value back, then the logic to compute it was copy pasted in again. With this change, the value in stripe_size is not unnecessarily destroyed, so the duplicated calculation is not needed any more. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 40fa85e68b1f..7045814fc98d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4763,19 +4763,16 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, /* * Use the number of data stripes to figure out how big this chunk * is really going to be in terms of logical address space, -* and compare that answer with the max chunk size +* and compare that answer with the max chunk size. If it's higher, +* we try to reduce stripe_size. */ if (stripe_size * data_stripes > max_chunk_size) { - stripe_size = div_u64(max_chunk_size, data_stripes); - - /* bump the answer up to a 16MB boundary */ - stripe_size = round_up(stripe_size, SZ_16M); - - /* -* But don't go higher than the limits we found while searching -* for free extents + /* Reduce stripe_size, round it up to a 16MB boundary +* again and then use it, unless it ends up being even +* bigger than the previous value we had already. */ - stripe_size = min(devices_info[ndevs - 1].max_avail, + stripe_size = min(round_up(div_u64(max_chunk_size, + data_stripes), SZ_16M), stripe_size); } -- 2.19.0.329.g76f2f5c1e3
[PATCH 2/6] btrfs: alloc_chunk: improve chunk size variable name
num_bytes is really a way too generic name for a variable in this function. There are a dozen other variables that hold a number of bytes as value. Give it a name that actually describes what it does, which is holding the size of the chunk that we're allocating. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9c9bb127..40fa85e68b1f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -123,6 +123,8 @@ const char *get_raid_name(enum btrfs_raid_types type) return btrfs_raid_array[type].raid_name; } + + static int init_first_rw_device(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info); @@ -4599,7 +4601,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 max_stripe_size; u64 max_chunk_size; u64 stripe_size; - u64 num_bytes; + u64 chunk_size; int ndevs; int i; int j; @@ -4801,9 +4803,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, map->type = type; map->sub_stripes = sub_stripes; - num_bytes = stripe_size * data_stripes; + chunk_size = stripe_size * data_stripes; - trace_btrfs_chunk_alloc(info, map, start, num_bytes); + trace_btrfs_chunk_alloc(info, map, start, chunk_size); em = alloc_extent_map(); if (!em) { @@ -4814,7 +4816,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, set_bit(EXTENT_FLAG_FS_MAPPING, >flags); em->map_lookup = map; em->start = start; - em->len = num_bytes; + em->len = chunk_size; em->block_start = 0; em->block_len = em->len; em->orig_block_len = stripe_size; @@ -4832,7 +4834,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, refcount_inc(>refs); write_unlock(_tree->lock); - ret = btrfs_make_block_group(trans, 0, type, start, num_bytes); + ret = btrfs_make_block_group(trans, 0, type, start, chunk_size); if (ret) goto error_del_extent; -- 2.19.0.329.g76f2f5c1e3
[PATCH 1/6] btrfs: alloc_chunk: do not refurbish num_bytes
num_bytes is used to store the chunk length of the chunk that we're allocating. Do not reuse it for something really different in the same function. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f4405e430da6..9c9bb127 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4837,8 +4837,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, goto error_del_extent; for (i = 0; i < map->num_stripes; i++) { - num_bytes = map->stripes[i].dev->bytes_used + stripe_size; - btrfs_device_set_bytes_used(map->stripes[i].dev, num_bytes); + btrfs_device_set_bytes_used(map->stripes[i].dev, + map->stripes[i].dev->bytes_used + + stripe_size); } atomic64_sub(stripe_size * map->num_stripes, >free_chunk_space); -- 2.19.0.329.g76f2f5c1e3
[PATCH 5/6] btrfs: introduce nparity raid_attr
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an nparity field in raid_attr. Signed-off-by: Hans van Kranenburg --- fs/btrfs/volumes.c | 18 +++--- fs/btrfs/volumes.h | 2 ++ 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d82b3d735ebe..453046497ac8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -37,6 +37,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 1, .devs_increment = 2, .ncopies= 2, + .nparity= 0, .raid_name = "raid10", .bg_flag= BTRFS_BLOCK_GROUP_RAID10, .mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET, @@ -49,6 +50,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 1, .devs_increment = 2, .ncopies= 2, + .nparity= 0, .raid_name = "raid1", .bg_flag= BTRFS_BLOCK_GROUP_RAID1, .mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET, @@ -61,6 +63,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 0, .devs_increment = 1, .ncopies= 2, + .nparity= 0, .raid_name = "dup", .bg_flag= BTRFS_BLOCK_GROUP_DUP, .mindev_error = 0, @@ -73,6 +76,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 0, .devs_increment = 1, .ncopies= 1, + .nparity= 0, .raid_name = "raid0", .bg_flag= BTRFS_BLOCK_GROUP_RAID0, .mindev_error = 0, @@ -85,6 +89,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 0, .devs_increment = 1, .ncopies= 1, + .nparity= 0, .raid_name = "single", .bg_flag= 0, .mindev_error = 0, @@ -97,6 +102,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 1, .devs_increment = 1, .ncopies= 1, + .nparity= 2, .raid_name = "raid5", .bg_flag= BTRFS_BLOCK_GROUP_RAID5, .mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, @@ -109,6 +115,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .tolerated_failures = 2, .devs_increment = 1, .ncopies= 1, + .nparity= 2, .raid_name = "raid6", .bg_flag= BTRFS_BLOCK_GROUP_RAID6, .mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, @@ -4597,6 +4604,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int devs_min; /* min devs needed */ int devs_increment; /* ndevs has to be a multiple of this */ int ncopies;/* how many copies to data has */ + int nparity;/* number of stripes worth of bytes to + store parity information */ int ret; u64 max_stripe_size; u64 max_chunk_size; @@ -4623,6 +4632,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, devs_min = btrfs_raid_array[index].devs_min; devs_increment = btrfs_raid_array[index].devs_increment; ncopies = btrfs_raid_array[index].ncopies; + nparity = btrfs_raid_array[index].nparity; if (type & BTRFS_BLOCK_GROUP_DATA) { max_stripe_size = SZ_1G; @@ -4752,13 +4762,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * this will have to be fixed for RAID1 and RAID10 over * more drives */ - data_stripes = num_stripes / ncopies; - - if (type & BTRFS_BLOCK_GROUP_RAID5) - data_stripes = num_stripes - 1; - - if (type & BTRFS_BLOCK_GROUP_RAID6) - data_stripes = num_stripes - 2; + data_stripes = (num_stripes - nparity) / ncopies; /* * Use the number of data stripes to figure out how big this chunk diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 23e9285d88de..0fe005b4295a 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -331,6 +331,8 @@ struct btrfs_raid_attr { int tolerated_failures; /* max tolerated fail devs */ in
Re: DUP dev_extent might overlap something next to it
On 09/29/2018 01:30 AM, Hans van Kranenburg wrote: > [...] > > I didn't try filling it up and see what happens yet. Also, this can > probably done with a DUP chunk, but it's a bit harder to quickly prove. DUP metadata chunk ^^ -- Hans van Kranenburg
Re: DUP dev_extent might overlap something next to it
On 09/25/2018 02:05 AM, Hans van Kranenburg wrote: > (I'm using v4.19-rc5 code here.) > > Imagine allocating a DATA|DUP chunk. > > [blub, see previous message] Steps to reproduce DUP chunk beyond end of device: First create a 6302M block device and fill it up. mkdir bork cd bork dd if=/dev/zero of=image bs=1 count=0 seek=6302M mkfs.btrfs -d dup -m dup image losetup -f image mkdir mountpoint mount -o space_cache=v2 /dev/loop0 mountpoint cp -a /usr mountpoint After a while, this starts throwing: No space left on device Now we extend the size of the image to 7880MiB so that the next dup data chunk allocation will exactly try to use the 1578MiB free raw disk space to trigger the bug. dd if=/dev/zero of=image bs=1 count=0 seek=7880M losetup -c /dev/loop0 btrfs fi resize max mountpoint/ Now we trigger the new DUP data chunk allocation: cp /vmlinuz mountpoint/ Now I have a dev extent starting at 7446986752 on devid 1, with length 838860800. This means it ends at 8285847552, which is 7902, exactly 22MiB beyond the size of the device. The only nice thing about this is that df shows me I still have more than 8 exabyte of space in this image file. Label: none uuid: e711eea6-5332-44cf-9704-998a7a939970 Total devices 1 FS bytes used 2.81GiB devid1 size 7.70GiB used 7.72GiB path /dev/loop0 I didn't try filling it up and see what happens yet. Also, this can probably done with a DUP chunk, but it's a bit harder to quickly prove. And making this happen in the middle of a block device instead of at the end is also a bit harder. -- Hans van Kranenburg
Re: python-btrfs v10 preview... detailed usage reporting and a tutorial
On 09/24/2018 10:08 AM, Nikolay Borisov wrote: >> >> The bugs are all related to repeated kernel code all over the place >> containing a lot of if statements dealing with different kind of >> allocation profiles and their exceptions. What I ended up doing is >> making a few helper functions instead, see the commit "Add volumes.py, >> handling device / chunk logic". It would probably be nice to do the same >> in the kernel code, which would also solve the mentioned bugs and >> prevent new similar ones from happening. > > Would you care to report each bug separately so they can be triaged and > fixed? In case of the RAID10 5GiB thing I think I was mixing up things. When doing mkfs you end up with a RAID10 chunk of 5GiB (dunno why, didn't research), when mounting and pointing balance at it, I get a 10GiB for it back, so that's ok. For the DUP thing, I sent an explanation ("DUP dev_extent might overlap something next to it"), which doesn't seem to attract much attention yet. I'm preparing a pile of patches to volumes.[ch] to fix this, clean up things that I ran into and make the logic a bit less convoluted. -- Hans van Kranenburg
DUP dev_extent might overlap something next to it
(I'm using v4.19-rc5 code here.) Imagine allocating a DATA|DUP chunk. In the chunk allocator, we first set... max_stripe_size = SZ_1G; max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ... which is 10GiB. Then... /* we don't want a chunk larger than 10% of writeable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); Imagine we only have one 7880MiB block device in this filesystem. Now max_chunk_size is down to 788MiB. The next step in the code is to search for max_stripe_size * dev_stripes amount of free space on the device, which is in our example 1GiB * 2 = 2GiB. Imagine the device has exactly 1578MiB free in one contiguous piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail Next we recalculate the stripe_size (which is actually the device extent length), based on the actual maximum amount of available raw disk space: stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); stripe_size is now 789MiB Next we do... data_stripes = num_stripes / ncopies ...where data_stripes ends up as 1, because num_stripes is 2 (the amount of device extents we're going to have), and DUP has ncopies 2. Next there's a check... if (stripe_size * data_stripes > max_chunk_size) ...which matches because 789MiB * 1 > 788MiB. We go into the if code, and next is... stripe_size = div_u64(max_chunk_size, data_stripes); ...which resets stripe_size to max_chunk_size: 788MiB Next is a fun one... /* bump the answer up to a 16MB boundary */ stripe_size = round_up(stripe_size, SZ_16M); ...which changes stripe_size from 788MiB to 800MiB. We're not done changing stripe_size yet... /* But don't go higher than the limits we found while searching * for free extents */ stripe_size = min(devices_info[ndevs - 1].max_avail, stripe_size); This is bad. max_avail is twice the stripe_size (we need to fit 2 device extents on the same device for DUP). The result here is that 800MiB < 1578MiB, so it's unchanged. However, the resulting DUP chunk will need 1600MiB disk space, which isn't there, and the second dev_extent might extend into the next thing (next dev_extent? end of device?) for 22MiB. The last shown line of code relies on a situation where there's twice the value of stripe_size present as value for the variable stripe_size when it's DUP. This was actually the case before commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote: "[...] in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size." In the previous version of the code, the 16MiB alignment (why is this done, by the way?) would result in a 50% chance that it would actually do an 8MiB alignment for the individual dev_extents, since it was operating on double the size. Does this matter? Does it matter that stripe_size can be set to anything which is not 16MiB aligned because of the amount of remaining available disk space which is just taken? What is the main purpose of this round_up? The most straightforward thing to do seems something like... stripe_size = min( div_u64(devices_info[ndevs - 1].max_avail, dev_stripes), stripe_size ) ..just putting half of the max_avail into stripe_size. I hope the above analysis is correct... If this is fixed, then it will be the 5th commit in btrfs history which tries to fix the same thing in this code again, where all the previous ones were breaking things while fixing things. The real problem is that in the very first version of this code, the stripe_size variable was already reused for very different purposes throughout the function. And it still is, all the time. So, while it may seem very logical fix (again), I guess this needs more eyes, since we missed this line the previous time. D: -- Hans van Kranenburg
python-btrfs v10 preview... detailed usage reporting and a tutorial
Hi all, I'm planning for a python-btrfs release to happen in a about week. All new changes are in the develop branch: https://github.com/knorrie/python-btrfs/commits/develop tl;dr: check out the two new examples added in the latest git commits and see if they provide correct info! ## Detailed usage reporting The new FsUsage object provides information on different levels (physical allocated bytes on devices, virtual space usage, etc) and also contains code to estimate how much space is still actually really available before ENOSPC happens. (e.g. the values that would ideally show up in df output). It works for all allocation profiles! Two examples have been added, which use the new code. I would appreciate extra testing. Please try them and see if the reported numbers make sense: space_calculator.py --- Best to be initially described as a CLI version of the well-known webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at it, choose data and metadata profile and see how much space you would get to store actual data. See commit message "Add example to calculate usable and wasted space" for example output. show_usage.py - The contents of the old show_usage.py example that simply showed a list of block groups are replaced with a detailed usage report of an existing filesystem. See commit message "A new show usage example!" for example output. ## A btrfs tutorial! A while ago I started creating documentation for python-btrfs in tutorial style. By playing around with an example filesystem we learn where btrfs puts our data on the disks, what a chunk, block group and an extent is, how we can quickly look up interesting things in metadata and how cows climb trees, moo. https://github.com/knorrie/python-btrfs/issues/11 https://github.com/knorrie/python-btrfs/blob/tutorial/tutorial/README.md I'm not sure yet if I'm going to 'ship' the first few pages already, since it's still very much a work in progress, but in any case feedback / ideas are welcome. Have a look! ## Other changes Other changes are the addition of the sync, fideduperange and get_features ioctl calls and a workaround for python 3.7 which breaks the struct module api. ## P.S. And finally, when doing the above, I discovered a few extra unintended features and bugs in the btrfs chunk allocator (Did you know RAID10 block groups are limited to 5GiB in size? Did you know that when the last chunk added on a disk is of DUP type, it could end up having an end beyond the limit of a device?). I still have to actually test the second one, causing it to happen. If anyone is interested to help with that, please ask about it. The bugs are all related to repeated kernel code all over the place containing a lot of if statements dealing with different kind of allocation profiles and their exceptions. What I ended up doing is making a few helper functions instead, see the commit "Add volumes.py, handling device / chunk logic". It would probably be nice to do the same in the kernel code, which would also solve the mentioned bugs and prevent new similar ones from happening. Have fun, -- Hans van Kranenburg
Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
On 09/19/2018 10:04 PM, Martin Steigerwald wrote: > Hans van Kranenburg - 19.09.18, 19:58: >> However, as soon as we remount the filesystem with space_cache=v2 - >> >>> writes drop to just around 3-10 MB/s to each disk. If we remount to >>> space_cache - lots of writes, system unresponsive. Again remount to >>> space_cache=v2 - low writes, system responsive. >>> >>> That's a huuge, 10x overhead! Is it expected? Especially that >>> space_cache=v1 is still the default mount option? >> >> Yes, that does not surprise me. >> >> https://events.static.linuxfound.org/sites/events/files/slides/vault20 >> 16_0.pdf >> >> Free space cache v1 is the default because of issues with btrfs-progs, >> not because it's unwise to use the kernel code. I can totally >> recommend using it. The linked presentation above gives some good >> background information. > > What issues in btrfs-progs are that? Missing code to make offline changes to a filesystem that has a free space tree. So when using btrfstune / repair / whatever you first need to remove the whole free space tree with a command, and then add it back on the next mount. For me personally that's not a problem (I don't have to make offline changes), but I understand that having that situation out of the box for every new user would be a bit awkward. > I am wondering whether to switch to freespace tree v2. Would it provide > benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 > on a laptop? As shown in the linked presentation, it provides benefit on a largeish filesystem and if your writes are touching a lot of different block groups (since v1 writes out the full space cache for all of them on every transaction commit). I'd say, it provides benefit as soon as you encounter filesystem delays because of it, and as soon as you see using it eases the pain a lot. So, yes, that's your case. -- Hans van Kranenburg
Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
Hi, On 09/19/2018 10:43 AM, Tomasz Chmielewski wrote: > I have a mysql slave which writes to a RAID-1 btrfs filesystem (with > 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full. > > The slave receives around 0.5-1 MB/s of data from the master over the > network, which is then saved to MySQL's relay log and executed. In ideal > conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s > of data written to disk. > > MySQL directory and files in it are chattr +C (since the directory was > created, so all files are really +C); there are no snapshots. > > > Now, an interesting thing. > > When the filesystem is mounted with these options in fstab: > > defaults,noatime,discard > > We can see a *constant* write of 25-100 MB/s to each disk. The system is > generally unresponsive and it sometimes takes long seconds for a simple > command executed in bash to return. Did you already test the difference with/without 'discard'? Also, I think that depending on the tooling that you use to view disk IO, discards will also show up as disk write statistics. > However, as soon as we remount the filesystem with space_cache=v2 - > writes drop to just around 3-10 MB/s to each disk. If we remount to > space_cache - lots of writes, system unresponsive. Again remount to > space_cache=v2 - low writes, system responsive. > > That's a huuge, 10x overhead! Is it expected? Especially that > space_cache=v1 is still the default mount option? Yes, that does not surprise me. https://events.static.linuxfound.org/sites/events/files/slides/vault2016_0.pdf Free space cache v1 is the default because of issues with btrfs-progs, not because it's unwise to use the kernel code. I can totally recommend using it. The linked presentation above gives some good background information. Another thing that's interesting is finding out what kind of things btrfs is writing if it's writing that much MB/s to disk. Doing this is not very trivial. I've been spending quite some time researching these kind of issues. Here's what I found out: https://www.spinics.net/lists/linux-btrfs/msg70624.html (oh wow, that's almost a year ago already) There are a bunch of tracepoints in the kernel code that could help debugging all of this more, but I've not yet gotten around to writing something to conveniently to use them to live show what's happening. I'm still using the "Thanks to a bug, solved in [2]" in the above mailing list post way of combining extent allocators in btrfs now to keep things workable on the larger filesystem. -- Hans van Kranenburg
Re: Move data and mount point to subvolume
On 09/18/2018 08:10 PM, Marc Joliet wrote: > Am Sonntag, 16. September 2018, 14:50:04 CEST schrieb Hans van Kranenburg: >> The last example, where you make a subvolume and move everything into >> it, will not do what you want. Since a subvolume is a separate new >> directoty/file hierarchy, mv will turn into a cp and rm operation >> (without warning you) probably destroying information about data shared >> between files. > > I thought that wasn't true anymore. The NEWS file to coreutils contains this > (for version 8.24): > > mv will try a reflink before falling back to a standard copy, which is > more efficient when moving files across BTRFS subvolume boundaries. I was wrong when saying that mv will copy/rm between subvolumes indeed, because you can reflink files between subvolumes, as long as they're under the same mount point. (You still can NOT between mount points iirc, even when they are mounts from the same btrfs.) But still, mv silently does one or the other, which is also confusing, because if it starts copying/removing things while that was not your intention, at first you're like "hm, takes longer than I tought" and then "oh wait, n", and then it's too late already. -- Hans van Kranenburg
Re: can anyone receive this message
On 09/18/2018 02:13 AM, sunny.s.zhang wrote: > > Can anyone receive this message? I found i can receive message from the > mail list, but can't receive message that send by myself. Yes. You can also check public list archives to see if your message shows up, e.g.: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg80664.html If it does, then the message reached the list, but your own incoming mail server might be throwing away email with your own address in the From, but forwarded via a mailing list? -- Hans van Kranenburg
Re: Move data and mount point to subvolume
On 09/16/2018 02:37 PM, Hans van Kranenburg wrote: > On 09/16/2018 01:14 PM, Rory Campbell-Lange wrote: >> Hi >> >> We have a backup machine that has been happily running its backup >> partitions on btrfs (on top of a luks encrypted disks) for a few years. >> >> Our backup partition is on /bkp which is a top level subvolume. >> Data, RAID1: total=2.52TiB, used=1.36TiB >> There are no other subvolumes. >> >> We have /bkp with /bkp/backup in it. We would like to mount /bkp/backup >> at /bkp instead. Note that /bkp/backup has a lot of hardlinked files. >> >> I guess I could do >> >> cd /bkp/backup >> mv * ../ >> rmdir backup > > Doing it the other way around is easier, since you don't have to think > about hardlinked/reflinked/etc/anything while copying data: Oh wait, I'm stupid, I was reading too quickly. Let me finish my coffee first. > btrfs sub snap /bkp /bkp/backup > > Now you have the exact identical thing in /backup, and you can start > throwing away everything outside of /backup. > > To reduce chance for accidental errors, you can snapshot with -r to make > the new /backup read-only first. Then after removing everything outside > it, make it rw again with: > > btrfs property set -ts /bkp/backup ro false > >> But would it also be possible to do something like >> >> cd /bkp >> btrfs subvol create backup-subvol >> mv /bkp/backup/* /bkp/backup-subvol >> ... then mount /bkp/backup-subvol at /bkp >> >> Would this second approach work, and preserve hardlinks? You essentially want to turn a subdirectory into a subvolume. The last example, where you make a subvolume and move everything into it, will not do what you want. Since a subvolume is a separate new directoty/file hierarchy, mv will turn into a cp and rm operation (without warning you) probably destroying information about data shared between files. So, what you can do is (similar to what I earlier typed): * You now have subvol 5 (at /bkp) * btrfs sub snap /bkp /bkp/backup-subvol * Now you have a new subvol (256 or higher number) which already contains everything * Then inside /bkp/backup-subvol you will again see /bkp/backup-subvol/backup (since you snapshotted the toplevel which also contains it) * Now mv /bkp/backup/backup-subvol/* /bkp/backup-subvol (so the mv operations stays within the same subvolume) * Then remove everything outside /bkp/backup-subvol and mv /bkp/backup-subvol /bkp/backup, and then voila... you can now use subvol=/backup to mount it. >> The machine is btrfs-progs v4.7.3 Linux 4.9.0-8-amd64 on Debian. The >> coreutils version is 8.26-3. > > Another note: since it seems you have about 100% more data space > allocated (set apart for data purpose) than you're actually using, or in > other words, having the 1GiB chunks on average just for 50% filled... > >Data, RAID1: total=2.52TiB, used=1.36TiB > > ...in combination with using Linux 4.9, I suspect there's also 'ssd' in > your mount options (not in fstab, but enabled by btrfs while mounting, > see /proc/mounts or mount command output)? > > If so, this is a nice starting point for more info about what might also > be happening to your filesystem: > https://www.spinics.net/lists/linux-btrfs/msg70622.html Have fun, -- Hans van Kranenburg
Re: Move data and mount point to subvolume
On 09/16/2018 01:14 PM, Rory Campbell-Lange wrote: > Hi > > We have a backup machine that has been happily running its backup > partitions on btrfs (on top of a luks encrypted disks) for a few years. > > Our backup partition is on /bkp which is a top level subvolume. > Data, RAID1: total=2.52TiB, used=1.36TiB > There are no other subvolumes. > > We have /bkp with /bkp/backup in it. We would like to mount /bkp/backup > at /bkp instead. Note that /bkp/backup has a lot of hardlinked files. > > I guess I could do > > cd /bkp/backup > mv * ../ > rmdir backup Doing it the other way around is easier, since you don't have to think about hardlinked/reflinked/etc/anything while copying data: btrfs sub snap /bkp /bkp/backup Now you have the exact identical thing in /backup, and you can start throwing away everything outside of /backup. To reduce chance for accidental errors, you can snapshot with -r to make the new /backup read-only first. Then after removing everything outside it, make it rw again with: btrfs property set -ts /bkp/backup ro false > But would it also be possible to do something like > > cd /bkp > btrfs subvol create backup-subvol > mv /bkp/backup/* /bkp/backup-subvol > ... then mount /bkp/backup-subvol at /bkp > > Would this second approach work, and preserve hardlinks? > > The machine is btrfs-progs v4.7.3 Linux 4.9.0-8-amd64 on Debian. The > coreutils version is 8.26-3. Another note: since it seems you have about 100% more data space allocated (set apart for data purpose) than you're actually using, or in other words, having the 1GiB chunks on average just for 50% filled... Data, RAID1: total=2.52TiB, used=1.36TiB ...in combination with using Linux 4.9, I suspect there's also 'ssd' in your mount options (not in fstab, but enabled by btrfs while mounting, see /proc/mounts or mount command output)? If so, this is a nice starting point for more info about what might also be happening to your filesystem: https://www.spinics.net/lists/linux-btrfs/msg70622.html -- Hans van Kranenburg
Re: state of btrfs snapshot limitations?
On 09/15/2018 05:56 AM, James A. Robinson wrote: > [...] > > I've got to read up a bit more on subvolumes, I am missing some > context from the warnings given by Chris regarding per-subvolume > options. Btrfs lets you mount the filesystem multiple times, e.g. with a different subvolume id, so you can mount part of the filesystem somewhere. Some of the mount options (many btrfs specific ones, like space_cache*, autodefrag etc) get a value when doing the first mount, and subsequent ones cannot change them any more, because they're filesystem wide behavior. Some others can be changed on each individual mount (like the atime options), and when omitting them you get the non-optimal default again. -- Hans van Kranenburg
Re: state of btrfs snapshot limitations?
Hi, On 09/14/2018 11:05 PM, James A. Robinson wrote: > The mail archive seems to indicate this list is appropriate > for not only the technical coding issues, but also for user > questions, so I wanted to pose a question here. If I'm > wrong about that, I apologize in advance. It's fine. Your observation is correct. > The page > > https://btrfs.wiki.kernel.org/index.php/Incremental_Backup > > talks about the basic snapshot capabilities of btrfs and led > me to look up what, if any, limits might apply. I find some > threads from a few years ago that talk about limiting the > number of snapshots for a volume to 100. > > The reason I'm curious is I wanted to try and use the > snapshot capability as a way of keeping a 'history' of a > backup volume I maintain. The backup doesn't change a > lot overtime, but small changes are made to files within > it daily. The 100 above is just a number because users ask "ok, but *how* many?". As far as I know, the real thing that is causing complexity for the filesystem is how much actual changes are being done to the subvolume all the time, after it's being snapshotted. Creating btrfs snapshots is cheap. Only as soon as you start making modifications, the subvolume in which you make the changes is going to diverge from the other ones which share the same history. And changes mean not only changes to data (changing, adding removing files), but also pure metadata changes (e.g. using touch command on a file). When just using the snapshots, opening and reading files etc, this should however not be a big problem. But, other btrfs specific actions are affected, like balance and device remove, using quota. In any case, make sure: - you are not using quota / qgroups (highly affected by this sort of complexity) - you *always* mount with noatime (which is not the default, and yes, noatime, not relatime or anything else) to prevent unnessary changes on metadata which unnecessarily cause exactly this kind of complexity to happen. When doing this, and not having to use btrfs balance and add/remove disks, and if the data doesn't change much over time (especially if it's just adding new stuff all the time), you are likely able to have far more snapshots of the thing. > The Plan 9 OS has a nice archival filesystem that lets you > easily maintain snapshots, and has various tools that make > it simple to keep a /snapshot//mmdd snapshot going back > for the life of the filesystem. > > I wanted to try and replicate the basic functionality of > that history using a non-plan-9 filesystem. At first I > tried rsnapshot but I find its technique of rotating and > deleting backups is thrashing the disks to the point that it > can't keep up with the rotations (the cp -al is fast, but > the periodic rm -rf of older snapshots kills the disk). Yes, btrfs snapshots are already a huge improvement compared to that. (Also, cp -l causes a modifications to also be done in the "snapshots", because it's still the same file, b) > With btrfs I was thinking perhaps I could more efficiently > maintain the archive of changes over time using a snapshot. > If this is an awful thought and I should just go away, > please let me know. > > If the limit is 100 or less I'd need use a more complicated > rotation scheme. For example with a layout like the > following: > > min/ > hour/ > day/ > month/ > year/ > > The idea being each bucket, min, hour, day, month, would > be capped and older snapshots would be removed and replaced > with newer ones over time. > > so with a 15-minute snapshot cycle I'd end up with > > min/[00,15,30,45] > hour/[00-23] > day/[01-31] > month/[01-12] > year/[2018,2019,...] > > (72+ snapshots with room for a few years worth of yearly's). > > But if things have changed with btrfs over the past few > years and number of snapshots scales much higher, I would > use the easier scheme: > > /min/[00,15,30,45] > /hourly/[00-23] > /daily// > > with 365 snapshots added per additional year. There are tools available that can do this for you. The one I use is btrbk, https://github.com/digint/btrbk (probably packaged in your favorite linux distro). I'd say, just try it. Add a snapshot schedule in your btrbk config, and set it to never expire older ones. Then, just see what happens, and only if you start seeing things slow down a lot, start worrying about what to do, and let us know how far you got. Have fun, P.S. Here's an unfinished page from a tutorial that I'm writing that is still heavily under construction, which touches the subject of snapshotting data and metadata. Maybe it might help to explain "complexity starts when changing things" more: https://github.com/knorrie/python-btrfs/blob/tutorial/tutorial/cows.md -- Hans van Kranenburg
Re: How to ensure that a snapshot is not corrupted?
On 08/10/2018 12:07 PM, Cerem Cem ASLAN wrote: > Original question is here: https://superuser.com/questions/1347843 > > How can we sure that a readonly snapshot is not corrupted due to a disk > failure? > > Is the only way calculating the checksums one on another and store it > for further examination, or does BTRFS handle that on its own? It's no different than any other data stored in your filesystem. So when just reading things from the snapshot, or when using the btrfs scrub functionality, it will tell you if data that is read back matches the checksums. -- Hans van Kranenburg
Re: trouble mounting btrfs filesystem....
On 08/14/2018 07:09 PM, Andrei Borzenkov wrote: > 14.08.2018 18:16, Hans van Kranenburg пишет: >> On 08/14/2018 03:00 PM, Dmitrii Tcvetkov wrote: >>>> Scott E. Blomquist writes: >>>> > Hi All, >>>> > >>>> > [...] >>> >>> I'm not a dev, just user. >>> btrfs-zero-log is for very specific case[1], not for transid errors. >>> Transid errors mean that some metadata writes are missing, if >>> they prevent you from mounting filesystem it's pretty much fatal. If >>> btrfs could recover metadata from good copy it'd have done that. >>> >>> "wanted 2159304 found 2159295" means that some metadata is stale by >>> 9 commits. You could try to mount it with "ro,usebackuproot" mount >>> options as readonly mount is less strict. If that works you can try >>> "usebackuproot" without ro option. But 9 commits is probably too much >>> and there isn't enough data to rollback so far. >> >> Keep in mind that a successful mount with usebackuproot does not mean >> you're looking at a consistent filesystem. After each transaction >> commit, disk space that is no longer referenced is immediately freed for >> reuse. >> > > With all respect - in this case btrfs should not even pretend it can do > "usebackuproot". What this option is for then? I have no idea. As the Dutch say: "schiet mij maar in de kerstboom". The commit that introduces it contains no explanation at all. The only thing I can think of is when you're a developer and writing buggy code that writes out corrupted tree root blocks, which makes the filesystem crash in some way directly after a transaction ends while no other new writes are done again yet, combined with getting fed up of having to mkfs and put your testdata back all the time. >> So, even if you can mount with usebackuproot, you have to hope that none >> of the metadata blocks that were used back then have been overwritten >> already, even the ones in distant corners of trees. A full check / scrub >> / etc would be needed to find out. -- Hans van Kranenburg
Re: [PATCH] Btrfs: correctly caculate item size used when item key collision happends
ndex 452 namelen 26 name: abcdefghijklmnopqrstuv0451 > > Signed-off-by: ethanwu > --- > fs/btrfs/ctree.c | 15 +-- > fs/btrfs/extent-tree.c | 5 +++-- > fs/btrfs/file-item.c | 2 +- > 3 files changed, 17 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 4bc326d..3614dd9 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -2678,8 +2678,8 @@ static struct extent_buffer > *btrfs_search_slot_get_root(struct btrfs_root *root, > * @p: Holds all btree nodes along the search path > * @root:The root node of the tree > * @key: The key we are looking for > - * @ins_len: Indicates purpose of search, for inserts it is 1, for > - * deletions it's -1. 0 for plain searches > + * @ins_len: Indicates purpose of search, for inserts it is item size > + * including btrfs_item, for deletions it's -1. 0 for plain > searches > * @cow: boolean should CoW operations be performed. Must always be 1 > * when modifying the tree. > * > @@ -2893,6 +2893,17 @@ int btrfs_search_slot(struct btrfs_trans_handle > *trans, struct btrfs_root *root, > } > } else { > p->slots[level] = slot; > + /* > + * item key collision happens. In this case, if we are > allow > + * to insert the item(for example, in dir_item case, > item key > + * collision is allowed), it will be merged with the > original > + * item. Only the item size grows, no new btrfs item > will be > + * added. Since the ins_len already accounts the size > btrfs_item, > + * this value is counted twice. Duduct this value here > so the > + * leaf space check will be correct. > + */ > + if (ret == 0 && ins_len > 0) > + ins_len -= sizeof(struct btrfs_item); > if (ins_len > 0 && > btrfs_leaf_free_space(fs_info, b) < ins_len) { > if (write_lock_level < 1) { > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d9fe58..4e3aa09 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -1094,7 +1094,7 @@ static int convert_extent_item_v0(struct > btrfs_trans_handle *trans, > > new_size -= sizeof(*ei0); > ret = btrfs_search_slot(trans, root, , path, > - new_size + extra_size, 1); > + new_size + extra_size + sizeof(struct > btrfs_item), 1); > if (ret < 0) > return ret; > BUG_ON(ret); /* Corruption */ > @@ -1644,7 +1644,8 @@ int lookup_inline_extent_backref(struct > btrfs_trans_handle *trans, > } > > again: > - ret = btrfs_search_slot(trans, root, , path, extra_size, 1); > + ret = btrfs_search_slot(trans, root, , path, > + extra_size + sizeof(struct btrfs_item), 1); > if (ret < 0) { > err = ret; > goto out; > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index f9dd6d1..0b6c581 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -804,7 +804,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle > *trans, >*/ > btrfs_release_path(path); > ret = btrfs_search_slot(trans, root, _key, path, > - csum_size, 1); > + csum_size + sizeof(struct btrfs_item), 1); > if (ret < 0) > goto fail_unlock; > > -- Hans van Kranenburg
Re: trouble mounting btrfs filesystem....
On 08/14/2018 03:00 PM, Dmitrii Tcvetkov wrote: >> Scott E. Blomquist writes: >> > Hi All, >> > >> > [...] > > I'm not a dev, just user. > btrfs-zero-log is for very specific case[1], not for transid errors. > Transid errors mean that some metadata writes are missing, if > they prevent you from mounting filesystem it's pretty much fatal. If > btrfs could recover metadata from good copy it'd have done that. > > "wanted 2159304 found 2159295" means that some metadata is stale by > 9 commits. You could try to mount it with "ro,usebackuproot" mount > options as readonly mount is less strict. If that works you can try > "usebackuproot" without ro option. But 9 commits is probably too much > and there isn't enough data to rollback so far. Keep in mind that a successful mount with usebackuproot does not mean you're looking at a consistent filesystem. After each transaction commit, disk space that is no longer referenced is immediately freed for reuse. So, even if you can mount with usebackuproot, you have to hope that none of the metadata blocks that were used back then have been overwritten already, even the ones in distant corners of trees. A full check / scrub / etc would be needed to find out. -- Hans van Kranenburg
Re: trouble mounting btrfs filesystem....
On 08/12/2018 09:19 PM, Scott E. Blomquist wrote: > > Hi All, > > Early this morning there was a power glitch that affected our system. > > The second enclosure went offline but the file system stayed up for a > bit before rebooting and recovering the 2 missing arrays sdb1 and > sdc1. > > When mounting we get > > Aug 12 14:52:43 localhost kernel: [ 8536.649270] BTRFS info (device > sda1): has skinny extents > Aug 12 14:54:52 localhost kernel: [ 8665.900321] BTRFS error (device > sda1): parent transid verify failed on 177443463479296 wanted 2159304 found > 2159295 > Aug 12 14:54:52 localhost kernel: [ 8665.985512] BTRFS error (device > sda1): parent transid verify failed on 177443463479296 wanted 2159304 found > 2159295 > Aug 12 14:54:52 localhost kernel: [ 8666.056845] BTRFS error (device > sda1): failed to read block groups: -5 > Aug 12 14:54:52 localhost kernel: [ 8666.254178] BTRFS error (device > sda1): open_ctree failed > > We are here... > > # uname -a > Linux localhost 4.17.14-custom #1 SMP Sun Aug 12 11:54:00 EDT 2018 x86_64 > x86_64 x86_64 GNU/Linux > > # btrfs --version > btrfs-progs v4.17.1 > > # btrfs filesystem show > Label: none uuid: 8337c837-58cb-430a-a929-7f6d2f50bdbb > Total devices 3 FS bytes used 75.05TiB > devid1 size 47.30TiB used 42.07TiB path /dev/sda1 > devid2 size 21.83TiB used 16.61TiB path /dev/sdb1 > devid3 size 21.83TiB used 16.61TiB path /dev/sdc1 What kind of devices are this? You say enclosure... is it a bunch of disks doing its own RAID, with btrfs on top? Do you have RAID1 metadata on top of that, or single? At least if you go the mkfs route (I read the other replies) then also find out what happened. If your storage is losing data in situations like this while it told btrfs that the data was safe, you're running a dangerous operation. -- Hans van Kranenburg
Re: [RFC PATCH 00/17] btrfs zoned block device support
On 08/10/2018 09:28 AM, Qu Wenruo wrote: > > > On 8/10/18 2:04 AM, Naohiro Aota wrote: >> This series adds zoned block device support to btrfs. >> >> [...] > > And this is the patch modifying extent allocator. > > Despite that, the support zoned storage looks pretty interesting and > have something in common with planned priority-aware extent allocator. Priority-aware allocator? Is someone actually working on that, or is it planned like everything is 'planned' (i.e. nice idea, and might happen or might as well not happen ever, SIYH)? -- Hans van Kranenburg signature.asc Description: OpenPGP digital signature
Re: Expected Received UUID
Hi, Can you please report the kernel versions you are using on every system (uname -a)? I would indeed expect the Received UUID value for C to have the same uuid as the original UUID of A, so the b00e8ba1 one. The send part, generating the send stream is done by the running kernel. The receive part is done by the userspace btrfs progs. The btrfs progs receive code just sets the Received UUID to whatever it reads from the send stream information. So, your sending kernel version is important here. When looking up the lines that send the UUID, in fs/btrfs/send.c, I can see there was a problem like this in the past: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b96b1db039ebc584d03a9933b279e0d3e704c528 I was introduced between linux 4.1 and 4.2 and solved in 2015 with that commit, which ended up somewhere in 4.3 I think. >From the btrfs-progs versions you list, I suspect this is a Debian Jessie and 2x Debian Stretch, but the Debian 3.16 and 4.9 kernels would not should not have this issue. On 07/31/2018 07:42 PM, Gaurav Sanghavi wrote: > Hello everyone, > > While researching BTRFS for a project that involves backing up snapshots > from device A to device B > before consolidating backups from device B to device C, I noticed that > received UUID on snapshot on > device C is not the same as received UUID for the same snapshot on > device B. Here are my steps: > > 1. Device A > BTRFS version: v3.17 > > btrfs su sn -r /home/test/snapshot1 /home/test/snaps/snapshot1 > btrfs su show /home/test/snaps/snapshot1 > Name: snapshot1 > uuid: b00e8ba1-5aaa-3641-9c4c-e168eee5c296 > Parent uuid: cb570dec-e9fd-1f40-99d2-2070c8ee2466 > Received UUID: --- > Creation time: 2018-07-30 18:32:37 > Flags: readonly > > 2. Send to Device B > btrfs send /home/test/snaps/snapshot1 | ssh 'btrfs receive > /home/backups/' > > After send completes, on Device B > BTRFS version: v4.7.3 > btrfs su show /home/backups/snapshot1 > Name: snapshot1 > UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c > Parent UUID: a2314f7c-4b11-ed40-901f-f1acb5ebf802 > Received UUID: b00e8ba1-5aaa-3641-9c4c-e168eee5c296 > Creation time: 2018-07-30 18:42:37 -0700 > Flags: readonly > > > 3. Send to Device C > btrfs send /home/backups/snapshot1 | ssh 'btrfs receive > /home/backups2/' > > After send completes, on Device C > BTRFS version: v4.7.3 > btrfs su show /home/backups2/snapshot1 > Name: snapshot1 > UUID: 8a13aab5-8e44-2541-9082-bc583933b964 > Parent UUID: 54e9b4ff-46dc-534e-b70f-69eb7bb21028 > Received UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c > Creation time: 2018-07-30 18:58:32 -0700 > Flags: readonly > > 1. I have gone through some of the archived emails and have noticed > people mentioning that > if received UUID is set, btrfs send propogates this 'received UUID'. But > in above case, > it's different for the same snapshot on all three devices. Is this the > expected behavior ? > > 2. We want to be able to start backing up from Device A to C, in case B > goes down or needs > to be replaced. If received UUID is expected to differ for the snapshot > on device B and C, incremental > backups will not work from A to C without setting received UUID. I have > seen python-btrfs > mentioned in a couple of emails; but have anyone of you used it in a > production environment ? > > This is my first post to this email. Please let me know if I am missing > any details. > > Thanks, > Gaurav > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable root partition
Hi, On 07/31/2018 08:03 PM, Cerem Cem ASLAN wrote: > Hi, > > I'm having trouble with my server setup, which contains a BTRFS root > partition on top of LVM on top of LUKS partition. > > Yesterday server was shut down unexpectedly. I booted the system with a > pendrive which contains Debian 4.9.18 and tried to mount the BTRFS root > partition manually. > > 1. cryptsetup luksOpen /dev/sda5 > > Seems to decrypt the partition (there are no errors) > > 2. /dev/mapper/foo--vg-root and /dev/mapper/foo--vg-swap_1 is created > automatically, so I suppose LVM works correctly. > > 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo > Gives the following error: > > mount: wrong fs type, bad option, bad superblock on ... > > 4. dmesg | tail > Outputs the following: > > > [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 > 02 00 00 02 00 > [17755.840921] blk_update_request: I/O error, dev sda, sector 507906 > [17755.840941] EXT4-fs (dm-4): unable to read superblock > [18140.052300] sd 3:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > [18140.052305] sd 3:0:0:0: [sda] tag#0 CDB: ATA command pass > through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00 > [18142.991851] sd 3:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > [18142.991856] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 > 80 00 00 08 00 > [18142.991860] blk_update_request: I/O error, dev sda, sector 508032 > [18142.991869] Buffer I/O error on dev dm-4, logical block 16, async > page read This points at hardware level failure, regardless if the disk would hold a btrfs or ext4 or whatever other kind of filesystem. The disk itself cannot read data back when we ask for it. > 4. > > # btrfs restore -i -D /dev/mapper/foo--vg-root /dev/null > No valid Btrfs found on /dev/mapper/foo--vg-root > Could not open root, trying backup super > No valid Btrfs found on /dev/mapper/foo--vg-root > Could not open root, trying backup super > No valid Btrfs found on /dev/mapper/foo--vg-root > Could not open root, trying backup super > > We are pretty sure that no unexpected electric cuts has been happened. > > At this point I don't know what information I should supply. > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: python-btrfs & btrfs-heatmap ebuilds now available for Gentoo
On 07/23/2018 04:25 PM, Holger Hoffstätte wrote: > > I wanted to migrate my collection of questionable shell scripts for btrfs > maintenance/inspection to a more stable foundation and therefore created > Gentoo ebuilds for Hans van Kranenburg's excellent python-btrfs [1] and > btrfs-heatmap [2] packages. > > They can be found in my overlay at: > > https://github.com/hhoffstaette/portage/tree/master/sys-fs/python-btrfs > https://github.com/hhoffstaette/portage/tree/master/sys-fs/btrfs-heatmap Thanks for doing this! And now you're making us curious about those not-so-questionable-any-more scripts. :D > Both are curently only available for python-3.6 since a change in 3.7 > broke the existing python API [3]. As soon as that is fixed I'll add 3.7 > to the PYTHON_COMPAT entries. Yes, thanks for testing, I'll have a look at it and will probably also ask you to help testing the fix. > btrfs-heatmap properly uses the python-exec wrapper and therefore works > regardless of currently selected default python version. :) > > I hope this is useful to someone. I bet it will. ;-] -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
extent [147895111680 12345344] > ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, > owner: 374857, offset: 192200704) wanted: 172, have: 251 > Delete backref in extent [147895111680 12345344] > ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, > owner: 374857, offset: 217653248) wanted: 348, have: 418 > Delete backref in extent [150850146304 17522688] > ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, > owner: 374857, offset: 235175936) wanted: 555, have: 1449 > Deleted root 2 item[156909494272, 178, 5476627808561673095] > ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, > owner: 374857, offset: 235175936) wanted: 556, have: 1452 > Deleted root 2 item[156909494272, 178, 7338474132555182983] > > At the rate it's going, it'll probably take days though, it's already been 36H -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Add more details while checking tree block
On 06/22/2018 06:25 PM, Nikolay Borisov wrote: > > > On 22.06.2018 19:17, Su Yue wrote: >> >> >> >>> Sent: Friday, June 22, 2018 at 11:26 PM >>> From: "Hans van Kranenburg" >>> To: "Nikolay Borisov" , "Su Yue" >>> , linux-btrfs@vger.kernel.org >>> Subject: Re: [PATCH] btrfs: Add more details while checking tree block >>> >>> On 06/22/2018 01:48 PM, Nikolay Borisov wrote: >>>> >>>> >>>> On 22.06.2018 04:52, Su Yue wrote: >>>>> For easier debug, print eb->start if level is invalid. >>>>> Also make print clear if bytenr found is not expected. >>>>> >>>>> Signed-off-by: Su Yue >>>>> --- >>>>> fs/btrfs/disk-io.c | 8 >>>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>>>> index c3504b4d281b..a90dab84f41b 100644 >>>>> --- a/fs/btrfs/disk-io.c >>>>> +++ b/fs/btrfs/disk-io.c >>>>> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct >>>>> btrfs_io_bio *io_bio, >>>>> >>>>> found_start = btrfs_header_bytenr(eb); >>>>> if (found_start != eb->start) { >>>>> - btrfs_err_rl(fs_info, "bad tree block start %llu %llu", >>>>> - found_start, eb->start); >>>>> + btrfs_err_rl(fs_info, "bad tree block start want %llu have >>>>> %llu", >>>> >>>> nit: I'd rather have the want/have in brackets (want %llu have% llu) >>> >>> From a user support point of view, this text should really be improved. >>> There are a few places where 'want' and 'have' are reported in error >>> strings, and it's totally unclear what they mean. >>> >> Yes. The strings are too concise for users to understand errors. >> Developers always prefer to avoid "unnecessary" words in logs. >> >>> Intuitively I'd say when checking a csum, the "want" would be what's on >>> disk now, since you want that to be correct, and the "have" would be >>> what you have calculated, but it's actually the other way round, or >>> wasn't it? Or was it? >> For csum, you are right at all. >> >> In this situation, IIRC bytenr of "have" is read from disk. >> Bytenr of "want" is assigned while constructing eb, from parent ptr >> , root item and other things which are also read from disk. >>> >>> Every time someone pastes such a message when we help on IRC for >>> example, there's confusion, and I have to look up the source again, >>> because I always forget. >> >> Me too. >>> >>> What about (%llu stored on disk, %llu calculated now) or something similar? > > Wha tabout expected got No, that's just as horrible as want and found. Did you 'expect' the same checksum to be on disk disk as what you 'get' when calculating one? Or did you 'expect' the same thing after calculation as what you 'got' when reading from disk? /o\ >>> >> "(%llu stored on disk " part sounds good to me. But I don't know how to >> describe the second part. May the good sleep help me. >> >> Thanks, >> Su >>>> >>>>> + eb->start, found_start); >>>>> ret = -EIO; >>>>> goto err; >>>>> } >>>>> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct >>>>> btrfs_io_bio *io_bio, >>>>> } >>>>> found_level = btrfs_header_level(eb); >>>>> if (found_level >= BTRFS_MAX_LEVEL) { >>>>> - btrfs_err(fs_info, "bad tree block level %d", >>>>> - (int)btrfs_header_level(eb)); >>>>> + btrfs_err(fs_info, "bad tree block level %d on %llu", >>>>> + (int)btrfs_header_level(eb), eb->start); >>>>> ret = -EIO; >>>>> goto err; >>>>> } >>>>> >>> >>> -- >>> Hans van Kranenburg >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Add more details while checking tree block
On 06/22/2018 01:48 PM, Nikolay Borisov wrote: > > > On 22.06.2018 04:52, Su Yue wrote: >> For easier debug, print eb->start if level is invalid. >> Also make print clear if bytenr found is not expected. >> >> Signed-off-by: Su Yue >> --- >> fs/btrfs/disk-io.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index c3504b4d281b..a90dab84f41b 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct >> btrfs_io_bio *io_bio, >> >> found_start = btrfs_header_bytenr(eb); >> if (found_start != eb->start) { >> -btrfs_err_rl(fs_info, "bad tree block start %llu %llu", >> - found_start, eb->start); >> +btrfs_err_rl(fs_info, "bad tree block start want %llu have >> %llu", > > nit: I'd rather have the want/have in brackets (want %llu have% llu) >From a user support point of view, this text should really be improved. There are a few places where 'want' and 'have' are reported in error strings, and it's totally unclear what they mean. Intuitively I'd say when checking a csum, the "want" would be what's on disk now, since you want that to be correct, and the "have" would be what you have calculated, but it's actually the other way round, or wasn't it? Or was it? Every time someone pastes such a message when we help on IRC for example, there's confusion, and I have to look up the source again, because I always forget. What about (%llu stored on disk, %llu calculated now) or something similar? > >> + eb->start, found_start); >> ret = -EIO; >> goto err; >> } >> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct >> btrfs_io_bio *io_bio, >> } >> found_level = btrfs_header_level(eb); >> if (found_level >= BTRFS_MAX_LEVEL) { >> -btrfs_err(fs_info, "bad tree block level %d", >> - (int)btrfs_header_level(eb)); >> +btrfs_err(fs_info, "bad tree block level %d on %llu", >> + (int)btrfs_header_level(eb), eb->start); >> ret = -EIO; >> goto err; >> } >> -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About more loose parameter sequence requirement
On 06/18/2018 03:05 PM, Qu Wenruo wrote: > > > On 2018年06月18日 20:02, David Sterba wrote: >> On Mon, Jun 18, 2018 at 07:40:44PM +0800, Qu Wenruo wrote: >>> >>> >>> On 2018年06月18日 19:34, David Sterba wrote: >>>> On Thu, Jun 14, 2018 at 03:17:45PM +0800, Qu Wenruo wrote: >>>>> I understand that btrfs-progs introduced restrict parameter/option order >>>>> to distinguish global and sub-command parameter/option. >>>>> >>>>> However it's really annoying if one just want to append some new options >>>>> to previous command: >>>>> >>>>> E.g. >>>>> # btrfs check /dev/data/btrfs >>>>> # !! --check-data-csum >>>>> >>>>> The last command will fail as current btrfs-progs doesn't allow any >>>>> option after parameter. >>>>> >>>>> >>>>> Despite the requirement to distinguish global and subcommand >>>>> option/parameter, is there any other requirement for such restrict >>>>> option-first-parameter-last policy? >>>> >>>> I'd say that it's a common and recommended pattern. Getopt is able to >>>> reorder the parameters so mixed options and non-options are accepted, >>>> unless POSIXLY_CORRECT (see man getopt(3)) is not set. With the more >>>> strict requirement, 'btrfs' option parser works the same regardless of >>>> that. >>> >>> But currently it doesn't work. >>> Just as the example, btrfs check /dev/data/btrfs --check-data-sum won't >>> work. >>> It's different from a lot of other common programs. >> >> With POSIXLY_CORRECT set, it expectedly won't work. With POSIXLY_CORRECT >> unset, it would work in general, but not for 'btrfs'. >> >> As this is per-application decision I find it ok, besides that I also >> find the 'options anywhere' pattern bad. Does it really save you that >> much time while typing the commands with new arguments? > > Personally speaking, yes. > >> There are >> movement shortcuts to jump by words, you get the previous command by >> up-arrow. About the same number of keystrokes. > > Not really, normally just "!! ", where I don't even need to > move my fingers to arrow keys. > (I'm all ears about extra bash shortcuts on this) It's called `set -o vi` ;-] If you remap capslock to be escape, there's not much moving around of fingers going on. CAPS-k-b-b-a and type your new option. No reaching to shift and 1 necessary at all. :] >> >> New code needs to be tested, documented and maintained, that's the cost >> I find too high for something that's convenience for people who are used >> to some shell builtins. > > The biggest problem is, the behavior isn't even consistent across > btrfs-progs. > mkfs.btrfs accept such out-of-order parameters while btrfs not. > > And most common tools, like commands provided by coretutil, they don't > care about the order. > The only daily exception is 'scp', which I found it pretty unhandy. > > And just as Paul and Hugo, I think there are quite some users preferring > out-of-order parameter/options. > > > I also understand the maintenance burden, but at least let's try if > there are better solution for this. > > Thanks, > Qu > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"
Hi, On 06/04/2018 03:14 PM, Russell Coker wrote: > The command "reboot -nffd" (kernel reboot without flushing kernel buffers or > writing status) when run on a BTRFS system with SE Linux will often result in > /var/log/audit/audit.log being unlabeled. This recent fix might be what you're looking for: https://www.spinics.net/lists/linux-btrfs/msg77927.html > It also results in some systemd- > journald files like /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/ > system.journal being unlabeled but that is rarer. I think that the same > problem afflicts both systemd-journald and auditd but it's a race condition > that on my systems (both production and test) is more likely to affect auditd. > > root@stretch:/# xattr -l /var/log/audit/audit.log > security.selinux: > 73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_ > 0010 72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s > 0020 30 00 0. > > SE Linux uses the xattr "security.selinux", you can see what it's doing with > xattr(1) but generally using "ls -Z" is easiest. > > If this issue just affected "reboot -nffd" then a solution might be to just > not run that command. However this affects systems after a power outage. > > I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security > update for Debian/Stretch which is the latest supported release of Debian). > I > have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 > (the > latest from Debian/Unstable). For testing I reproduced this with a 4G > filesystem in a VM, but in production it has happened on BTRFS RAID-1 arrays, > both SSD and HDD. > > #!/bin/bash > set -e > COUNT=$(ps aux|grep [s]bin/auditd|wc -l) > date > if [ "$COUNT" = "1" ]; then > echo "all good" > else > echo "failed" > exit 1 > fi > > Firstly the above is the script /usr/local/sbin/testit, I test for auditd > running because it aborts if the context on it's log file is wrong. When SE > Linux is in enforcing mode an incorrect/missing label on the audit.log file > causes auditd to abort. > > root@stretch:~# ls -liZ /var/log/audit/audit.log > 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun > 1 > 12:23 /var/log/audit/audit.log > Above is before I do the tests. > > while ssh stretch /usr/local/sbin/testit ; do > ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 & > sleep 20 > done > Above is the shell code I run to do the tests. Note that the VM in question > runs on SSD storage which is why it can consistently boot in less than 20 > seconds. > > Fri 1 Jun 12:26:13 UTC 2018 > all good > Fri 1 Jun 12:26:33 UTC 2018 > failed > Above is the output from the shell code in question. After the first reboot > it fails. The probability of failure on my test system is greater than 50%. > > root@stretch:~# ls -liZ /var/log/audit/audit.log > 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun 1 > 12:26 /var/log/audit/audit.log > Now the result. Note that the Inode has not changed. I could understand a > newly created file missing an xattr, but this is an existing file which > shouldn't have had it's xattr changed. But somehow it gets corrupted. > > The first possibility I considered was that SE Linux code might be at fault. > I asked on the SE Linux mailing list (I haven't been involved in SE Linux > kernel code for about 15 years) and was informed that this isn't likely at > all. There have been no problems like this reported with other filesystems. > > Does anyone have any ideas of other tests I should run? Anyone want me to > try > a different kernel? I can give root on a VM to anyone who wants to poke at > it. > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: fi-usage: fix RAID10 raw disk usage
In case of RAID10, fi usage is reporting half the amount of allocated space and twice the amount of unallocated disk space. Let's fix this. For example, a RAID10 chunk of 3GiB with num_stripes 6, half of the stripes (dev extents) are a mirror of the other half, so the 3GiB of actual data has to be divided by 3, and not 6 to get the size of each of those device extents. Signed-off-by: Hans van Kranenburg --- cmds-fi-usage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c index b9a2b1c8..3bd2ccdf 100644 --- a/cmds-fi-usage.c +++ b/cmds-fi-usage.c @@ -660,7 +660,7 @@ static u64 calc_chunk_size(struct chunk_info *ci) else if (ci->type & BTRFS_BLOCK_GROUP_RAID6) return ci->size / (ci->num_stripes -2); else if (ci->type & BTRFS_BLOCK_GROUP_RAID10) - return ci->size / ci->num_stripes; + return ci->size / (ci->num_stripes / 2); return ci->size; } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: off-by-one uncompressed invalid ram_bytes corruptions
On 05/21/2018 03:07 AM, Qu Wenruo wrote: > > [...] > > And the problem of wrong "root=" output also makes me pretty curious. Yeah, there a bunch of error messages in the kernel with always say root=1, regardless of the actual root the tree block is in. I think I once tried to find out why, but apparently it wasn't a really obvious error somewhere. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to change/fix 'Received UUID'
On 05/03/2018 20:47, Marc MERLIN wrote: > On Mon, Mar 05, 2018 at 10:38:16PM +0300, Andrei Borzenkov wrote: >>> If I absolutely know that the data is the same on both sides, how do I >>> either >>> 1) force back in a 'Received UUID' value on the destination >> >> I suppose the most simple is to write small program that does it using >> BTRFS_IOC_SET_RECEIVED_SUBVOL. > > Understdood. > Given that I have not worked with the code at all, what is the best > tool in btrfs progs, to add this to? > > btrfstune? > btrfs propery set? > other? > > David, is this something you'd be willing to add support for? > (to be honest, it'll be quicker for someone who knows the code to add than > for me, but if no one has the time, I'l see if I can have a shot at it) If you want something right now that works, so you can continue doing your backups, python-btrfs also has the ioctl, since v9, together with an example of using it: https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d P.S. even when coding it in C, the documentation in that commit message might be useful. :) fwiw, Hans -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: incremental send/receive ERROR: cannot find parent subvolume
On 02/26/2018 11:54 PM, Emil.s wrote: > Hello, > > I'm trying to restore a subvolume from a backup, but I'm failing when > I try to setup the replication chain again. > > Previously I had disk A and B, where I was sending snapshots from A to > B using "send -c /disk_a/1 /disk_a/2 | receive /disk_b" and so on. > Now disk A failed, so I got a new disk: C. > > First I sent the last backup from B to C by running "send disk_b/2 | > receive /disk_c/". I can't read all of it right now, my time is a bit limited, but I have a quick suggestion to at least make the original goal work, I think. The list of snapshots that were on your A was organized differently than the list with the same numbers 1, 2, 3 on B. On A, you always snapshot the main r/w subvolume. A \_ A1 \_ A2 \_ A3 On B, every time you receive based on a previous one, it snapshots the previous one and then puts changes on top: B1 (received from A1) \_ B2 (snap from B1, with changes from A1->A2) \_ B3 (snap from B2, with changes from A2->A3) \_ etc... If A is gone, and you have new empty C: > Then my plan was to use disk B as the new main disk, so on disk B, I > created a new snapshot by running "snapshot -r disk_b/2 > disk_b/my_volume". > Now I want to send "disk_b/my_volume" to disk C, so that I can set > disk_b/2 to RW and start using it again, but the last send fails with > "ERROR: cannot find parent subvolume". Do it like this: -# btrfs snap B2 my_volume (so no -r) now my_volume is a writable clone of B2 -# btrfs snap -r my_volume my_volume1 now you start a new line of snapshots where my_volume has the same role as A before. my_volume \_ my_volume1 \_ my_volume2 \_ my_volume3 Now you can send my_volume3 -p my_volume2 etc... Then you discard all old Bx whenever you want. In the following, disk_b/2 and your other example names are suddenly replaced by real names, which makes it a bit hard to follow quickly. > Disk B: > root@tillberga: /backup #> btrfs subvol sh snapshots/user_data_2018-02-17/ > snapshots/user_data_2018-02-17 > Name: user_data_2018-02-17 > UUID: 93433261-d954-9f4b-8319-d450ae079a9a > Parent UUID: 51180286-c202-c94c-b8f9-2ecc8d2b5b7c > Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10 > Flags: readonly > Snapshot(s): > user_data_test > > Disk C: > root@hedemora: /btrfs #> btrfs subvol sh user_data_2018-02-17/ > user_data_2018-02-17 > Name: user_data_2018-02-17 > UUID: 35361429-a42c-594b-b390-51ffb9725324 > Parent UUID: - > Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a > Flags: readonly > Snapshot(s): > > Disk B has UUID 93433261-d954-9f4b-8319-d450ae079a9a, and disk C has > "Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a". I think that's > alright? > > The new snapshot on disk B looks as following: > root@tillberga: /backup #> btrfs subvol sh user_data_test > user_data_test > Name: user_data_test > UUID: bf88000c-e3a6-434b-8f69-f5ce2174227e > Parent UUID: 93433261-d954-9f4b-8319-d450ae079a9a > Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10 > Flags: readonly > Snapshot(s): > > But when I'm trying to send it, I'm getting the following: > root@tillberga: /backup #> btrfs send -v -c > snapshots/user_data_2018-02-17 user_data_test | ssh user@hedemora > "sudo btrfs receive -v /btrfs/" > At subvol user_data_test > BTRFS_IOC_SEND returned 0 > joining genl thread > receiving snapshot user_data_test > uuid=014fc004-ae04-0148-9525-1bf556fd4d10, ctransid=52373 > parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10, > parent_ctransid=52373 > At snapshot user_data_test > ERROR: cannot find parent subvolume > > Note that the receiver says > "parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10". Not really sure > where that comes from, but disk B has the same, so maybe that's the > UUID of the original snapshot on disk A? > > Is it possible to continue to send incremental snapshots between these > two file systems, or must I do a full sync? -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Add nossd_spread mount option
On 02/22/2018 12:31 AM, Howard McLauchlan wrote: > Btrfs has two mount options for SSD optimizations: ssd and ssd_spread. > Presently there is an option to disable all SSD optimizations, but there > isn't an option to disable just ssd_spread. > > This patch adds a mount option nossd_spread that disables ssd_spread > only. > > Signed-off-by: Howard McLauchlan <hmclauch...@fb.com> > --- > fs/btrfs/super.c | 13 + > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 6e71a2a78363..4c0fcf5b3e7e 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -310,10 +310,10 @@ static void btrfs_put_super(struct super_block *sb) > enum { > Opt_degraded, Opt_subvol, Opt_subvolid, Opt_device, Opt_nodatasum, > Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd, > - Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress, > - Opt_compress_type, Opt_compress_force, Opt_compress_force_type, > - Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, > - Opt_space_cache, Opt_space_cache_version, Opt_clear_cache, > + Opt_nossd, Opt_ssd_spread, Opt_nossd_spread, Opt_thread_pool, Opt_noacl, > + Opt_compress, Opt_compress_type, Opt_compress_force, > + Opt_compress_force_type, Opt_notreelog, Opt_ratio, Opt_flushoncommit, > + Opt_discard, Opt_space_cache, Opt_space_cache_version, Opt_clear_cache, > Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, > Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, > Opt_skip_balance, Opt_check_integrity, > @@ -353,6 +353,7 @@ static const match_table_t tokens = { > {Opt_ssd, "ssd"}, > {Opt_ssd_spread, "ssd_spread"}, > {Opt_nossd, "nossd"}, > + {Opt_nossd_spread, "nossd_spread"}, > {Opt_acl, "acl"}, > {Opt_noacl, "noacl"}, > {Opt_notreelog, "notreelog"}, .oO(Why doesn't the enum just have one option per line, so the changelog is less invasive?) > @@ -582,6 +583,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char > *options, > btrfs_clear_and_info(info, SSD_SPREAD, >"not using spread ssd allocation > scheme"); > break; > + case Opt_nossd_spread: > + btrfs_clear_and_info(info, SSD_SPREAD, > + "not using spread ssd allocation > scheme"); > + break; > case Opt_barrier: > btrfs_clear_and_info(info, NOBARRIER, >"turning on barriers"); > Related: * https://www.spinics.net/lists/linux-btrfs/msg64247.html * https://www.spinics.net/lists/linux-btrfs/msg64277.html * https://www.spinics.net/lists/linux-btrfs/msg64499.html Apparently that discussion never resulted in actual changes, so thanks for continuing it now. The mount options are a bit weird, because ssd_spread also includes ssd, but doesn't show it. I personally don't like all of them at all, and I should really finish and send my proposal to get them replaced by options that can choose extent allocator for data and metadata individually (instead of some setting that changes them both at the same time) because there are proper real life situations that e.g. benefit from nossd style 'tetris' data allocator with ssd_spread style 'contiguous' metadata extent allocator. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote: > On 02/21/2018 10:03 AM, Hans van Kranenburg wrote: >> On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote: >>> On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>> My suggestion is to use balance to reduce number of block groups, so we >>>> could do less search at mount time. >>>> >>>> It's more like reason 2. >>>> >>>> But it only works for case where there are a lot of fragments so a lot >>>> of chunks are not fully utilized. >>>> Unfortunately, that's not the case for OP, so my suggestion doesn't >>>> make >>>> sense here. >>> >>> I ran the balance all the same, and the number of chunks has not >>> changed. Before 3454, and after 3454: >>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>> 3454 >>> >>> HOWEVER, the time to mount has gone up somewhat significantly, from >>> 11.537s to 16.553s, which was very unexpected. Output from previously >>> run commands shows the extent tree metadata grew about 25% due to the >>> balance. Everything else stayed roughly the same, and no additional >>> data was added to the system (nor snapshots taken, nor additional >>> volumes added, etc): >>> >>> Before balance: >>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >>> ROOT_TREE 1.14MiB 0( 72) 1( 1) >>> EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) >>> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >>> DEV_TREE 272.00KiB 0( 16) 1( 1) >>> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >>> CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) >>> QUOTA_TREE 0.00B >>> UUID_TREE 16.00KiB 0( 1) >>> FREE_SPACE_TREE 0.00B >>> DATA_RELOC_TREE 16.00KiB 0( 1) >>> >>> After balance: >>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >>> ROOT_TREE 1.16MiB 0( 73) 1( 1) >>> EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) >>> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >>> DEV_TREE 272.00KiB 0( 16) 1( 1) >>> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >>> CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) >>> QUOTA_TREE 0.00B >>> UUID_TREE 16.00KiB 0( 1) >>> FREE_SPACE_TREE 0.00B >>> DATA_RELOC_TREE 16.00KiB 0( 1) >> >> Heu, interesting. >> >> What's the output of `btrfs fi df /mountpoint` and `grep btrfs >> /proc/self/mounts` (does it contain 'ssd') and which kernel version is >> this? (I get a bit lost in the many messages and subthreads in this >> thread) I also can't find in the threads which command "the balance" >> means. > > Short recap: > - I found long mount time for 1.65TB of home dir data at ~4s > - Doubling this data on the same btrfs fs to 3.3TB increased mount time > to 11s > - Qu et. al. suggested balance might reduce chunks, which came in around > 3400, and the chunk walk on mount was the driving factor in terms of time > - I ran balance > - Mount time went up to 16s, and all else remains the same except the > extent tree. > > $ sudo btrfs fi df /mnt/btrfs > Data, single: total=3.32TiB, used=3.32TiB > System, DUP: total=8.00MiB, used=384.00KiB > Metadata, DUP: total=16.50GiB, used=15.82GiB > GlobalReserve, single: total=512.00MiB, used=0.00B Ah, so allocated data space is 100% filled with data. That's very good yes. And it explains why you can't lower the amount of chunks by balancing. You're just moving around data and replacing full chunks with new full chunks. :] Doesn't explain why it blows up the size of the extent tree though. I have no idea why that is. > $ sudo grep btrfs /proc/self/mounts > /dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0 Ok, no 'ssd', good. > $ uname -a > Linux 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 > x86_64 x86_64 x86_64 GNU/Linux > > I plan to rerun this on a newer kernel, but haven't had time to spin up > another machine with a modern kernel yet, and this machine is also being > used for other things right now so I can't just upgrade it. > >> And what does this tell you? >> >> https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py >> > > $ sudo ./show_free_space_fragmentation.py /mnt/btrfs > No Free Space Tree (space_cache=v2) found! > Falling back to using the extent tree to determine free space extents. > vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1 > score 0 > Skipped because of usage > 90%: 3397 chunks Good. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote: > On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>>>> 3454 >>>>> >> Increasing node size may reduce extent tree size. Although at most >> reduce one level AFAIK. >> >> But considering that the higher the node is, the more chance it's >> cached, reducing tree height wouldn't bring much performance impact >> AFAIK. >> >> If one could do real world benchmark to beat or prove my assumption, it >> would be much better though. > > I'm willing to try this if you tell me exactly what you'd like me to do. > I've not mucked with nodesize before, so I'd like to avoid changing it > to something absurd. > >>> Qu's suggestion is actually independent of all the above reasons, but >>> does kind of fit in with the fourth as another case of preventative >>> maintenance. >> >> My suggestion is to use balance to reduce number of block groups, so we >> could do less search at mount time. >> >> It's more like reason 2. >> >> But it only works for case where there are a lot of fragments so a lot >> of chunks are not fully utilized. >> Unfortunately, that's not the case for OP, so my suggestion doesn't make >> sense here. > > I ran the balance all the same, and the number of chunks has not > changed. Before 3454, and after 3454: > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > HOWEVER, the time to mount has gone up somewhat significantly, from > 11.537s to 16.553s, which was very unexpected. Output from previously > run commands shows the extent tree metadata grew about 25% due to the > balance. Everything else stayed roughly the same, and no additional > data was added to the system (nor snapshots taken, nor additional > volumes added, etc): > > Before balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > After balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.16MiB 0( 73) 1( 1) > EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) Heu, interesting. What's the output of `btrfs fi df /mountpoint` and `grep btrfs /proc/self/mounts` (does it contain 'ssd') and which kernel version is this? (I get a bit lost in the many messages and subthreads in this thread) I also can't find in the threads which command "the balance" means. And what does this tell you? https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py Just to make sure you're not pointlessly shovelling data around on a filesystem that is already in bad shape. >> BTW, if OP still wants to try something to possibly to reduce mount time >> with same the fs, I could try some modification to current block group >> iteration code to see if it makes sense. > > I'm glad to try anything if it's helpful to improving BTRFS. Just let > me know. > > Best, > > ellis -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ongoing Btrfs stability issues
On 02/17/2018 05:34 AM, Shehbaz Jaffer wrote: >> It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS >> volumes are all SSD > > I have recently done some SSD corruption experiments on small set of > workloads, so I thought I would share my experience. > > While creating btrfs using mkfs.btrfs command for SSDs, by default the > metadata duplication option is disabled. this renders btrfs-scrubbing > ineffective, as there are no redundant metadata to restore corrupted > metadata from. > So if there are any errors during read operation on SSD, unlike HDD > where the corruptions would be handled by btrfs scrub on the fly while > detecting checksum error, for SSD the read would fail as uncorrectable > error. First of all, the ssd mount option does not have anything to do with having single or DUP metadata. Well, both the things that happen by default (mkfs using single, mount enabling the ssd option) are happening because of the lookup result on the rotational flag, but that's all. > Could you confirm if metadata DUP is enabled for your system by > running the following cmd: > > $btrfs fi df /mnt # mount is the mount point > Data, single: total=8.00MiB, used=64.00KiB > System, single: total=4.00MiB, used=16.00KiB > Metadata, single: total=168.00MiB, used=112.00KiB > GlobalReserve, single: total=16.00MiB, used=0.00B > > If metadata is single in your case as well (and not DUP), that may be > the problem for btrfs-scrub not working effectively on the fly > (mid-stream bit-rot correction), causing reliability issues. A couple > of such bugs that are observed specifically for SSDs is reported here: > > https://bugzilla.kernel.org/show_bug.cgi?id=198463 > https://bugzilla.kernel.org/show_bug.cgi?id=198807 Here you show that when you have 'single' metadata, there's no copy to recover from. This is expected. Also, instead of physically damaging flash cells inside your SSD, you are writing data to a perfectly working one. This is a different failure scenario. One of the reasons to turn off DUP for metadata by default on SSD is (from man mkfs.btrfs): "The controllers may put data written in a short timespan into the same physical storage unit (cell, block etc). In case this unit dies, both copies are lost. BTRFS does not add any artificial delay between metadata writes." .. "The traditional rotational hard drives usually fail at the sector level." And, of course, in case of EBS, you don't have any idea at all where the data actually ends up, since it's talking to a black box service, and not an SSD. In any case, using DUP instead of single obviously increases the chance of recovery in case of failures that corrupt one copy of the data when it's travelling between system memory and disk, while you're sending two of them right after each other, so you're totally right that it's better to enable. > These do not occur for HDD, and I believe should not occur when > filesystem is mounted with nossd mode. So to reiterate, mounting nossd does not make your metadata writes DUP. > On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as >> excerpted: >> >>> This will probably sound like an odd question, but does BTRFS think your >>> storage devices are SSD's or not? Based on what you're saying, it >>> sounds like you're running into issues resulting from the >>> over-aggressive SSD 'optimizations' that were done by BTRFS until very >>> recently. >>> >>> You can verify if this is what's causing your problems or not by either >>> upgrading to a recent mainline kernel version (I know the changes are in >>> 4.15, I don't remember for certain if they're in 4.14 or not, but I >>> think they are), or by adding 'nossd' to your mount options, and then >>> seeing if you still have the problems or not (I suspect this is only >>> part of it, and thus changing this will reduce the issues, but not >>> completely eliminate them). Make sure and run a full balance after >>> changing either item, as the aforementioned 'optimizations' have an >>> impact on how data is organized on-disk (which is ultimately what causes >>> the issues), so they will have a lingering effect if you don't balance >>> everything. >> >> According to the wiki, 4.14 does indeed have the ssd changes. >> >> According to the bug, he's running 4.13.x on one server and 4.14.x on >> two. So upgrading the one to 4.14.x should mean all will have that fix. >> >> However, without a full balance it /will/ take some time to settle down >> (again, assuming btrfs was using ssd mode), so the lingering effect could >> still be creating problems on the 4.14 kernel servers for the moment. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote: > On 02/15/2018 08:55 PM, Qu Wenruo wrote: >> On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >>> Very helpful information. Thank you Qu and Hans! >>> >>> I have about 1.7TB of homedir data newly rsync'd data on a single >>> enterprise 7200rpm HDD and the following output for btrfs-debug: >>> >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >>> total bytes 6001175126016 >>> bytes used 1832557875200 >>> >>> Hans' (very cool) tool reports: >>> ROOT_TREE 624.00KiB 0( 38) 1( 1) >>> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) >> >> Extent tree is not so large, a little unexpected to see such slow mount. >> >> BTW, how many chunks do you have? >> >> It could be checked by: >> >> # btrfs-debug-tree -t chunk | grep CHUNK_ITEM | wc -l > > Since yesterday I've doubled the size by copying the homdir dataset in > again. Here are new stats: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 > total bytes 6001175126016 > bytes used 3663525969920 > > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > The old mean mount time was 4.319s. It now takes 11.537s for the > doubled dataset. Again please realize this is on an old version of > BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still > like to understand this delay more. Should I expect this to scale in > this way all the way up to my proposed 60-80TB filesystem so long as the > file size distribution stays roughly similar? That would definitely be > in terms of multiple minutes at that point. Well, imagine you have a big tree (an actual real life tree outside) and you need to pick things (e.g. apples) which are hanging everywhere. So, what you need to to is climb the tree, climb on a branch all the way to the end where the first apple is... climb back, climb up a bit, go onto the next branch to the end for the next apple... etc etc The bigger the tree is, the longer it keeps you busy, because the apples will be semi-evenly distributed around the full tree, and they're always hanging at the end of the branch. The speed with which you can climb around (random read disk access IO speed for btrfs, because your disk cache is empty when first mounting) determines how quickly you're done. So, yes. >>> Taking 100 snapshots (no changes between snapshots however) of the above >>> subvolume doesn't appear to impact mount/umount time. >> >> 100 unmodified snapshots won't affect mount time. >> >> It needs new extents, which can be created by overwriting extents in >> snapshots. >> So it won't really cause much difference if all these snapshots are all >> unmodified. > > Good to know, thanks! > >>> Snapshot creation >>> and deletion both operate at between 0.25s to 0.5s. >> >> IIRC snapshot deletion is delayed, so the real work doesn't happen when >> "btrfs sub del" returns. > > I was using btrfs sub del -C for the deletions, so I believe (if that > command truly waits for the subvolume to be utterly gone) it captures > the entirety of the snapshot. > > Best, > > ellis -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
( 7876) 1(28) 2( 1) CHUNK_TREE112.00KiB 0( 6) 1( 1) DEV_TREE 80.00KiB 0( 4) 1( 1) FS_TREE 1016.34MiB 0( 64113) 1( 881) 2(52) CSUM_TREE 777.42MiB 0( 49571) 1( 183) 2( 1) QUOTA_TREE0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 336.00KiB 0(20) 1( 1) DATA_RELOC_TREE16.00KiB 0( 1) > > Thanks, > Qu > >> >> >>> >>> Note that I'm not sensitive to multi-second mount delays. I am >>> sensitive to multi-minute mount delays, hence why I'm bringing this up. >>> >>> FWIW: I am currently populating a machine we have with 6TB drives in it >>> with real-world home dir data to see if I can replicate the mount issue. >>> >>> Thanks, >>> >>> ellis >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: alloc_chunk: fix DUP stripe size handling
On 02/14/2018 03:49 PM, David Sterba wrote: > On Mon, Feb 05, 2018 at 05:45:11PM +0100, Hans van Kranenburg wrote: >> In case of using DUP, we search for enough unallocated disk space on a >> device to hold two stripes. >> >> The devices_info[ndevs-1].max_avail that holds the amount of unallocated >> space found is directly assigned to stripe_size, while it's actually >> twice the stripe size. >> >> Later on in the code, an unconditional division of stripe_size by >> dev_stripes corrects the value, but in the meantime there's a check to >> see if the stripe_size does not exceed max_chunk_size. Since during this >> check stripe_size is twice the amount as intended, the check will reduce >> the stripe_size to max_chunk_size if the actual correct to be used >> stripe_size is more than half the amount of max_chunk_size. >> >> The unconditional division later tries to correct stripe_size, but will >> actually make sure we can't allocate more than half the max_chunk_size. >> >> Fix this by moving the division by dev_stripes before the max chunk size >> check, so it always contains the right value, instead of putting a duct >> tape division in further on to get it fixed again. >> >> Since in all other cases than DUP, dev_stripes is 1, this change only >> affects DUP. >> >> Other attempts in the past were made to fix this: >> * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried >> to fix the same problem, but still resulted in part of the code acting >> on a wrongly doubled stripe_size value. >> * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally >> broke this fix again. >> >> The real problem was already introduced with the rest of the code in >> 73c5de0051. >> >> The user visible result however will be that the max chunk size for DUP >> will suddenly double, while it's actually acting according to the limits >> in the code again like it was 5 years ago. >> >> Reported-by: Naohiro Aota <naohiro.a...@wdc.com> >> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html >> Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") >> Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") >> Signed-off-by: Hans van Kranenburg <hans.van.kranenb...@mendix.com> >> Cc: Naohiro Aota <naohiro.a...@wdc.com> >> Cc: Arne Jansen <sensi...@gmx.net> >> Cc: Chris Mason <chris.ma...@fusionio.com> > > I guess half of the addresses have bounced :) Have you used the > get_maintainer.pl script? Yes, indeed, :s I was copy/pasting addresses too quickly from old commits which touched the relevant pieces of the code in the past and only looked at the names. :| And no, I didn't use get_maintainer.pl, thanks for pointing at it. Instead of sending out even more spam when I saw it shortly after, I waited for first feedback. I corrected (I think) addresses in the To of this mail now. Can you fix them up at the end of the commit message? Your other comments indicate you're not asking for a v2. > The fix is short, I had to read the allocator code again so it took me > longer to review it. Your description in the changelog was really > helpful. Thanks. Yes, it's already a while ago the other mail thread about this was going on, and it didn't end up with a final patch yet. I started digging into the issue back then because it matched some behaviour I'd seen before: 512MB dup metadata chunks being replaced with the same number of 1024MB chunks when doing a conversion to single, which was a bit weird... And this explained it. >> --- >> fs/btrfs/volumes.c | 4 +--- >> 1 file changed, 1 insertion(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 4006b2a1233d..a50bd02b7ada 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -4737,7 +4737,7 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >> * the primary goal is to maximize the number of stripes, so use as many >> * devices as possible, even if the stripes are not maximum sized. >> */ >> -stripe_size = devices_info[ndevs-1].max_avail; >> +stripe_size = div_u64(devices_info[ndevs-1].max_avail, dev_stripes); > > I'll enhance the comment above with more explanation why do it here, Because it *is* the actual stripe_size. :) Feel free to enhance! > otherwise consider this > > Reviewed-by: David Sterba <dste...@suse.com> Yay! >> num_stripes = ndevs * dev_stripes; >> >> /* >> @@ -4772,8 +4772,6 @@ static int __btrfs_alloc_chunk(
Re: btrfs-cleaner / snapshot performance analysis
On 02/12/2018 03:45 PM, Ellis H. Wilson III wrote: > On 02/11/2018 01:03 PM, Hans van Kranenburg wrote: >>> 3. I need to look at the code to understand the interplay between >>> qgroups, snapshots, and foreground I/O performance as there isn't >>> existing architecture documentation to point me to that covers this >> >> Well, the excellent write-up of Qu this morning shows some explanation >> from the design point of view. > > Sorry, I may have missed this email. Or perhaps you are referring to a > wiki or blog post of some kind I'm not following actively? Either way, > if you can forward me the link, I'd greatly appreciate it. You are in the To: of it: https://www.spinics.net/lists/linux-btrfs/msg74737.html >> nocow only keeps the cows on a distance as long as you don't start >> snapshotting (or cp --reflink) those files... If you take a snapshot, >> then you force btrfs to keep the data around that is referenced by the >> snapshot. So, that means that every next write will be cowed once again, >> moo, so small writes will be redirected to a new location, causing >> fragmentation again. The second and third write can go in the same (new) >> location of the first new write, but as soon as you snapshot again, this >> happens again. > > Ah, very interesting. Thank you for clarifying! -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-cleaner / snapshot performance analysis
On 02/11/2018 04:59 PM, Ellis H. Wilson III wrote: > Thanks Tomasz, > > Comments in-line: > > On 02/10/2018 05:05 PM, Tomasz Pala wrote: >> You won't have anything close to "accurate" in btrfs - quotas don't >> include space wasted by fragmentation, which happens to allocate from >> tens >> to thousands times (sic!) more space than the files itself. >> Not in some worst-case scenarios, but in real life situations... >> I got 10 MB db-file which was eating 10 GB of space after a week of >> regular updates - withOUT snapshotting it. All described here. > > The underlying filesystem this is replacing was an in-house developed > COW filesystem, so we're aware of the difficulties of fragmentation. I'm > more interested in an approximate space consumed across snapshots when > considering CoW. I realize it will be approximate. Approximate is ok > for us -- no accounting for snapshot space consumed is not. If your goal is to have an approximate idea for accounting, and you don't need to be able to actually enforce limits, and if the filesystems that you are using are as small as the 40GiB example you gave... Why not just use `btrfs fi du ` now and then and update your administration with the results? .. Instead of putting the burden of keeping track of all administration during every tiny change all day long? > Also, I don't see the thread you mentioned. Perhaps you forgot to > mention it, or an html link didn't come through properly? > >>> course) or how many subvolumes/snapshots there are. If I know that >>> above N snapshots per subvolume performance tanks by M%, I can apply >>> limits on the use-case in the field, but I am not aware of those kinds >>> of performance implications yet. >> >> This doesn't work like this. It all depends on data that are subject of >> snapshots, especially how they are updated. How exactly, including write >> patterns. >> >> I think you expect answers that can't be formulated - with fs >> architecture so >> advanced as ZFS or btrfs it's behavior can't be analyzed for simple >> answers like 'keep less than N snapshots'. > > I was using an extremely simple heuristic to drive at what I was looking > to get out of this. I should have been more explicit that the example > was not to be taken literally. > >> This is an exception of easy-answer: btrfs doesn't handle databases with >> CoW. Period. Doesn't matter if snapshotted or not, ANY database files >> (systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They >> slow down entire system to the speed of cheap SD card. > > I will keep this in mind, thank you. We do have a higher level above > BTRFS that stages data. I will consider implementing an algorithm to > add the nocow flag to the file if it has been written to sufficiently to > indicate it will be a bad fit for the BTRFS COW algorithm. Adding nocow attribute to a file only works when it's just created and not written to yet or when setting it on the containing directory and letting it inherit for new files. You can't just turn it on for existing files with content. https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F >> Actually, if you do not use compression and don't need checksums of data >> blocks, you may want to mount all the btrfs with nocow by default. >> This way the quotas would be more accurate (no fragmentation _between_ >> snapshots) and you'll have some decent performance with snapshots. >> If that is all you care. > > CoW is still valuable for us as we're shooting to support on the order > of hundreds of snapshots per subvolume, Hundreds will get you into trouble even without qgroups. > and without it (if BTRFS COW > works the same as our old COW FS) that's going to be quite expensive to > keep snapshots around. So some hybrid solution is required here. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-cleaner / snapshot performance analysis
On 02/11/2018 05:15 PM, Ellis H. Wilson III wrote: > Thanks Hans. Sorry for the top-post, but I'm boiling things down here > so I don't have a clear line-item to respond to. The take-aways I see > here to my original queries are: > > 1. Nobody has done a thorough analysis of the impact of snapshot > manipulation WITHOUT qgroups enabled on foreground I/O performance > 2. Nobody has done a thorough analysis of the impact of snapshot > manipulation WITH qgroups enabled on foreground I/O performance It's more that there is no simple list of clear-cut answers that apply to every possible situation and type/pattern of work that you can throw at a btrfs filesystem. > 3. I need to look at the code to understand the interplay between > qgroups, snapshots, and foreground I/O performance as there isn't > existing architecture documentation to point me to that covers this Well, the excellent write-up of Qu this morning shows some explanation from the design point of view. > 4. I should be cautioned that CoW in BTRFS can exhibit pathological (if > expected) capacity consumption for very random-write-oriented datasets > with or without snapshots, and nocow (or in my case transparently > absorbing and coalescing writes at a higher tier) is my friend. nocow only keeps the cows on a distance as long as you don't start snapshotting (or cp --reflink) those files... If you take a snapshot, then you force btrfs to keep the data around that is referenced by the snapshot. So, that means that every next write will be cowed once again, moo, so small writes will be redirected to a new location, causing fragmentation again. The second and third write can go in the same (new) location of the first new write, but as soon as you snapshot again, this happens again. > 5. I should be cautioned that CoW is broken across snapshots when > defragmentation is run. > > I will update a test system to the most recent kernel and will perform > tests to answer #1 and #2. I will plan to share it when I'm done. If I > have time to write-up my findings for #3 I will similarly share that. > > Thanks to all for your input on this issue. Have fun! -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-cleaner / snapshot performance analysis
Hey, On 02/10/2018 07:29 PM, Ellis H. Wilson III wrote: > Thank you very much for your response Hans. Comments in-line, but I did > want to handle one miscommunication straight-away: > > I'm a huge fan of BTRFS. If I came off like I was complaining, my > sincere apologies. To be completely transparent we are using BTRFS in > a very large project at my company, which I am lead architect on, and > while I have read the academic papers, perused a subset of the source > code, and been following it's development in the background, I now need > to deeply understand where there might be performance hiccups. I'd suggest just trying to do what you want to do for real, finding out what the problems are and then finding out what to do about them, but I think that's already almost exactly what you've started doing now. :) If you ask 100 different btrfs users about your specific situation, you probably get 100 different answers. So, I'll just throw some of my own thoughts in here, which may or may not make sense for you. > All of > our foreground I/O testing with BTRFS in RAID0/RAID1/single across > different SSDs and HDDs has been stellar, but we haven't dug too far > into snapshot performance, balancing, and other more background-oriented > performance. Hence my interest in finding documentation and analysis I > can read and grok myself on the implications of snapshot operations on > foreground I/O if such exists. > More in-line below: > > On 02/09/2018 03:36 PM, Hans van Kranenburg wrote: >>> This has proven thus far less than helpful, as >>> the response tends to be "use less snapshots," or "disable quotas," both >>> of which strike me as intellectually unsatisfying answers >> >> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I >> didn't even realize I had those, and now the problem is gone." > > I meant less than helpful for me, since for my project I need detailed > and fairly accurate capacity information per sub-volume, and the > relationship between qgroups and subvolume performance wasn't being > spelled out in the responses. Please correct me if I am wrong about > needing qgroups enabled to see detailed capacity information > per-subvolume (including snapshots). Aha, so you actually want to use qgroups. >>> the former in a filesystem where snapshots are supposed to be >>> "first-class citizens." They are. But if you put extra optional feature X Y and Z on top which kill your performance, then they are still supposed to be first-class citizens, but feature X Y and Z might start blurring it a bit. The problem is that qgroups and quota etc is still in development and if you ask the developers, they are probably honest about the fact that you cannot just enable that part of the functionality without some expected and unexpected performance side effects. >> Throwing complaints around is also not helpful. > > Sorry about this. It wasn't directed in any way at BTRFS developers, > but rather some of the suggestions for solution proposed in random > forums online. > As mentioned I'm a fan of BTRFS, especially as my > project requires the snapshots to truly be first-class citizens in that > they are writable and one can roll-back to them at-will, unlike in ZFS > and other filesystems. I was just saying it seemed backwards to suggest > having less snapshots was a solution in a filesystem where the > architecture appears to treat them as a core part of the design. And I was just saying that subvolumes and snapshots are fine, and that you shouldn't blame them while your problems might be more likely qgroups/quota related. >> The "performance implications" are highly dependent on your specific >> setup, kernel version, etc, so it really makes sense to share: >> >> * kernel version >> * mount options (from /proc/mounts|grep btrfs) >> * is it ssd? hdd? iscsi lun? >> * how big is the FS >> * how many subvolumes/snapshots? (how many snapshots per subvolume) > > I will answer the above, but would like to reiterate my previous comment > that I still would like to understand the fundamental relationships here > as in my project kernel version is very likely to change (to more > recent), along with mount options and underlying device media. Once > this project hits the field I will additionally have limited control > over how large the FS gets (until physical media space is exhausted of > course) or how many subvolumes/snapshots there are. If I know that > above N snapshots per subvolume performance tanks by M%, I can apply > limits on the use-case in the field, but I am not aware of those kinds > of performance implications yet. > > My present situation is the following
[PATCH] btrfs: alloc_chunk: fix DUP stripe size handling
In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.a...@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenb...@mendix.com> Cc: Naohiro Aota <naohiro.a...@wdc.com> Cc: Arne Jansen <sensi...@gmx.net> Cc: Chris Mason <chris.ma...@fusionio.com> --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4006b2a1233d..a50bd02b7ada 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4737,7 +4737,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * the primary goal is to maximize the number of stripes, so use as many * devices as possible, even if the stripes are not maximum sized. */ - stripe_size = devices_info[ndevs-1].max_avail; + stripe_size = div_u64(devices_info[ndevs-1].max_avail, dev_stripes); num_stripes = ndevs * dev_stripes; /* @@ -4772,8 +4772,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, stripe_size = devices_info[ndevs-1].max_avail; } - stripe_size = div_u64(stripe_size, dev_stripes); - /* align to BTRFS_STRIPE_LEN */ stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/26] btrfs-progs: use libbtrfsutil for subvol show
On 01/26/2018 07:41 PM, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > Now implemented with btrfs_util_subvolume_path(), > btrfs_util_subvolume_info(), and subvolume iterators. > > Signed-off-by: Omar Sandoval <osan...@fb.com> > --- > cmds-subvolume.c | 150 > --- > utils.c | 118 --- > utils.h | 5 -- > 3 files changed, 99 insertions(+), 174 deletions(-) > > diff --git a/cmds-subvolume.c b/cmds-subvolume.c > index c5e03011..b969fc88 100644 > --- a/cmds-subvolume.c > +++ b/cmds-subvolume.c > > > [...] > } else > strcpy(tstr, "-"); > printf("\tCreation time: \t\t%s\n", tstr); > > - printf("\tSubvolume ID: \t\t%llu\n", get_ri.root_id); > - printf("\tGeneration: \t\t%llu\n", get_ri.gen); > - printf("\tGen at creation: \t%llu\n", get_ri.ogen); > - printf("\tParent ID: \t\t%llu\n", get_ri.ref_tree); > - printf("\tTop level ID: \t\t%llu\n", get_ri.top_id); > + printf("\tSubvolume ID: \t\t%" PRIu64 "\n", subvol.id); > + printf("\tGeneration: \t\t%" PRIu64 "\n", subvol.generation); > + printf("\tGen at creation: \t%" PRIu64 "\n", subvol.otransid); > + printf("\tParent ID: \t\t%" PRIu64 "\n", subvol.parent_id); > + printf("\tTop level ID: \t\t%" PRIu64 "\n", subvol.parent_id); > > - if (get_ri.flags & BTRFS_ROOT_SUBVOL_RDONLY) > + if (subvol.flags & BTRFS_ROOT_SUBVOL_RDONLY) > printf("\tFlags: \t\t\treadonly\n"); > else > printf("\tFlags: \t\t\t-\n"); > > /* print the snapshots of the given subvol if any*/ > printf("\tSnapshot(s):\n"); > - filter_set = btrfs_list_alloc_filter_set(); > - btrfs_list_setup_filter(_set, BTRFS_LIST_FILTER_BY_PARENT, > - (u64)(unsigned long)get_ri.uuid); > - btrfs_list_setup_print_column(BTRFS_LIST_PATH); > > - fd = open_file_or_dir(fullpath, ); > - if (fd < 0) { > - fprintf(stderr, "ERROR: can't access '%s'\n", fullpath); > - goto out; > + err = btrfs_util_f_create_subvolume_iterator(fd, > + BTRFS_FS_TREE_OBJECTID, > + 0, ); > + > [...] When you have enough subvolumes in a filesystem, let's say 10 (yes, that sometimes happens), the current btrfs sub list is quite unusable, which is kind of expected. But, currently, sub show is also unusable because it still starts loading a list of all subvolumes to be able to print the 'snapshots, if any'. So I guess that this new sub show will at least print the info first and then use the iterator and start crawling through the list, which can be interrupted? At least you get the relevant info first then. :-) -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/26] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"
Hi, On 01/26/2018 07:51 PM, Hugo Mills wrote: > On Fri, Jan 26, 2018 at 10:40:48AM -0800, Omar Sandoval wrote: >> From: Omar Sandoval <osan...@fb.com> >> >> One of the features requests I get most often is a library to do the >> sorts of operations that we do with btrfs-progs. We can shell out to >> btrfs-progs, but the output format isn't always easily parsasble, and >> shelling out isn't always ideal. There's libbtrfs, but it's very >> incomplete, doesn't have a well thought out API, and is licensed under >> the GPL, making it hard to use for many projects. >> >> libbtrfsutil is a new library written from scratch to address these >> issues. The implementation is completely independent of the existing >> btrfs-progs code, including kerncompat.h, and has a clean API and naming >> scheme. It is licensed under the LGPL. It also includes Python bindings >> by default. I will maintain the library code. Insert mandatory picture here: https://memegenerator.net/img/instances/12076366/python-all-the-things.jpg >*speechless* > >That's awesome, Omar (although with the python bindings, you've > probably just ruined Hans' day ;) ). Not quite, or should I say, quite the opposite, actually. I can think of a few different python libs related to btrfs right now... 1. This one, which works on the same level as the btrfs-progs command line utils and provides the same set of functionality, with the same input and output, only way easier to handle when automating things. The audience is the same audience as who would else try to 'shell out' to btrfs-progs and then do painful parsing of the output. 2. A lib for advanced btrfs users (more of a niche audience) who want to get a peek behind the curtains of what is going on in their online filesystem. Directly do searches in metadata, directly call an ioctl etc, either for educational purposes, or to e.g. do things like write a small special purpose dedupe program, visualize things etc. This is the python3-btrfs lib that I've been working on for some time now. 3. A lib for btrfs developers who want to script working on offline fileystems. This one does not exist yet. Well, it's a bit of an unborn baby now, my first brain dump: https://github.com/knorrie/btrfs-progs/blob/python/python3/README.md This would be the lib that exposes the (to be revised and improved) internal libbtrfs (I guess). All of these are not competitors. They serve different purpose and a different audience. So I'm happy to see number 1 also happening. :-) >> Patch 1 is a preparation cleanup which can go in independently. Patch 2 >> adds the build system stuff for the library, and patch 3 does the same >> for the Python bindings. Patches 4-14 implement the library helpers, >> currently subvolume helpers and the sync ioctls. Patches 15-26 replace >> the btrfs-progs and libbtrfs code to use libbtrfsutil instead. I took >> care to preserve backwards-compatibility. `btrfs subvol list` in >> particular had some buggy behaviors for -o and -a that I emulated in the >> new code, see the comments in the code. +1 for starting to clean up that code. Especially the whole subvolume list part looks like a great improvement. >> These patches are also available on my GitHub: >> https://github.com/osandov/btrfs-progs/tree/libbtrfsutil. That branch >> will rebase as I update this series. >> >> Please share feedback regarding the API, implementation, or anything >> else. Great work, and I'm of course also interested in what you think about the lib nr.3 idea, since I didn't throw it around on the list before yet. :-) -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblock update: Is there really any benefits of updating synchronously?
On 01/24/2018 07:54 PM, waxhead wrote: > Hans van Kranenburg wrote: >> On 01/23/2018 08:51 PM, waxhead wrote: >>> Nikolay Borisov wrote: >>>> On 23.01.2018 16:20, Hans van Kranenburg wrote: >> >> [...] >> >>>>> >>>>> We also had a discussion about the "backup roots" that are stored >>>>> besides the superblock, and that they are "better than nothing" to help >>>>> maybe recover something from a borken fs, but never ever guarantee you >>>>> will get a working filesystem back. >>>>> >>>>> The same holds for superblocks from a previous generation. As soon as >>>>> the transaction for generation X succesfully hits the disk, all space >>>>> that was occupied in generation X-1 but no longer in X is available to >>>>> be overwritten immediately. >>>>> >>> Ok so this means that superblocks with a older generation is utterly >>> useless and will lead to corruption (effectively making my argument >>> above useless as that would in fact assist corruption then). >> >> Mostly, yes. >> >>> Does this means that if disk space was allocated in X-1 and is freed in >>> X it will unallocated if you roll back to X-1 e.g. writing to >>> unallocated storage. >> >> Can you reword that? I can't follow that sentence. > Sure why not. I'll give it a go: > > Does this mean that if... > * Superblock generation N-1 have range 1234-2345 allocated and used. > > and > > * Superblock generation N-0 (the current) have range 1234-2345 free > because someone deleted a file or something Ok, so I assume that with current you mean the one on disk now. > Then > > It is no point in rolling back to generation N-1 because that refers to > what is no essentially free "memory" which may or may have not been > written over by generation N-0. If space that was used in N-1 turned into free space during N-0, then N-0 will never have reused that space already, since if writing out N-0 had crashed halfway, so the superblock as seen when mounting is still N-1, then you need to be able to fully use N-1. It can be used immediately by N+1 however after the N-0 superblock is safe on disk. > And therefore N-1 which still thinks > range 1234-2345 is allocated may point to the wrong data. So, at least for disk space used by metadata blocks: 1234-2345 - N-1 - in use 1234-2345 - N-0 - not in use, but can't be overwritten yet 1234-2345 - N+1 - can start writing whatever it wants in that disk location any time > I hope that was easier to follow - if not don't hold back on the > explicitives! :) > >> >>> I was under the impression that a superblock was like a "snapshot" of >>> the entire filesystem and that rollbacks via pre-gen superblocks was >>> possible. Am I mistaking? >> >> Yes. The first fundamental thing in Btrfs is COW which makes sure that >> everything referenced from transaction X, from the superblock all the >> way down to metadata trees and actual data space is never overwritten by >> changes done in transaction X+1. >> > Perhaps a tad off topic, but assuming the (hopefully) better explanation > above clear things up a bit. What happens if a block is freed?! in X+1 > --- which must mean that it can be overwritten in transaction X+1 (which > I assume means a new superblock generation). After all without freeing > and overwriting data there is no way to re-use space. Freed in X you mean? Or not? But you write "freed?! in X+1". For actual data disk space, it's the same pattern as above (so space freed up during a transaction can only be reused in the next one), but implemented a bit differently. For metadata trees which do not have reference counting, (e.g. the extent tree), there's the pinned extent (metadata block disk locations) list I mentioned already. For data, we have the filesystem (subvolume) trees which reference all files and the data extents that they use data from, and via the links to the extent tree they keep all locations where actual data is on disk as occupied. Now comes the different part. Because the filesystem trees already implement the extra reference counting functionality, this is being used to prevent freed up data space from already being overwritten in the same transaction. How does this work? Well, that's the rest of the wiki section I linked below. :-D So you're asking exactly the right next question here I guess. When making changes to a subvolume tree (normal file create, write content, rename delete etc), btrfs is secretly just cloning the tree into a new subvolume with the same subvolume ID. Wait, what? Whoa
Re: Superblock update: Is there really any benefits of updating synchronously?
On 01/23/2018 08:51 PM, waxhead wrote: > Nikolay Borisov wrote: >> On 23.01.2018 16:20, Hans van Kranenburg wrote: [...] >>> >>> We also had a discussion about the "backup roots" that are stored >>> besides the superblock, and that they are "better than nothing" to help >>> maybe recover something from a borken fs, but never ever guarantee you >>> will get a working filesystem back. >>> >>> The same holds for superblocks from a previous generation. As soon as >>> the transaction for generation X succesfully hits the disk, all space >>> that was occupied in generation X-1 but no longer in X is available to >>> be overwritten immediately. >>> > Ok so this means that superblocks with a older generation is utterly > useless and will lead to corruption (effectively making my argument > above useless as that would in fact assist corruption then). Mostly, yes. > Does this means that if disk space was allocated in X-1 and is freed in > X it will unallocated if you roll back to X-1 e.g. writing to > unallocated storage. Can you reword that? I can't follow that sentence. > I was under the impression that a superblock was like a "snapshot" of > the entire filesystem and that rollbacks via pre-gen superblocks was > possible. Am I mistaking? Yes. The first fundamental thing in Btrfs is COW which makes sure that everything referenced from transaction X, from the superblock all the way down to metadata trees and actual data space is never overwritten by changes done in transaction X+1. For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the way this is done is actually quite simple. If a block is cowed, the old location is added to a 'pinned extents' list (in memory), which is used as a blacklist for choosing space to put new writes in. After a transaction is completed on disk, that list with pinned extents is emptied and all that space is available for immediate reuse. This way we make sure that if the transaction that is ongoing is aborted, the previous one (latest one that is completely on disk) is always still there. If the computer crashes and the in memory list is lost, no big deal, we just continue from the latest completed transaction again after a reboot. (ignoring extra log things for simplicity) So, the only situation in which you can fully use an X-1 superblock is when none of that previously pinned space has actually been overwritten yet afterwards. And if any of the space was overwritten already, you can go play around with using an older superblock and your filesystem mounts and everything might look fine, until you hit that distant corner and BOOM! >8 Extra!! Moar!! >8 But, doing so does not give you snapshot functionality yet! It's more like a poor mans snapshot that only can prevent from messing up the current version. Snapshot functionality is implemented only for filesystem trees (subvolumes) by adding reference counting (which does end up on disk) to the metadata blocks, and then COW trees as a whole. If you make a snapshot of a filesystem tree, the snapshot gets a whole new tree ID! It's not a previous version of the same subvolume you're looking at, it's a clone! This is a big difference. The extent tree is always tree 2. The chunk tree is always tree 3. But your subvolume snapshot gets a new tree number. Technically, it would maybe be possible to implement reference counting and snapshots to all of the metadata trees, but it would probably mean that the whole filesystem would get stuck in rewriting itself all day instead of doing any useful work. The current extent tree already has such amount of rumination problems that the added work of keeping track of reference counts would make it completely unusable. In the wiki, it's here: https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging Actually, I just paraphrased the first two of those six alineas... The subvolume trees actually having a previous version of themselves again (wh!) is another thing... ;] -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblock update: Is there really any benefits of updating synchronously?
On 01/23/2018 10:03 AM, Nikolay Borisov wrote: > > On 23.01.2018 09:03, waxhead wrote: >> Note: This have been mentioned before, but since I see some issues >> related to superblocks I think it would be good to bring up the question >> again. >> >> [...] >> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock >> >> The superblocks are updated synchronously on HDD's and one after each >> other on SSD's. > > There is currently no distinction in the code whether we are writing to > SSD or HDD. So what does that line in the wiki mean, and why is it there? "btrfs normally updates all superblocks, but in SSD mode it will update only one at a time." > Also what do you mean by synchronously, if you inspect the > code in write_all_supers you will see what for every device we issue > writes for every available copy of the superblock and then wait for all > of them to be finished via the 'wait_dev_supers'. In that regard sb > writeout is asynchronous. > >> Superblocks are also (to my knowledge) not protected by copy-on-write >> and are read-modify-update. >> >> On a storage device with >256GB there will be three superblocks. >> >> BTRFS will always prefer the superblock with the highest generation >> number providing that the checksum is good. > > Wrong. On mount btrfs will only ever read the first superblock at 64k. > If that one is corrupted it will refuse to mount, then it's expected the > user will initiate recovery procedure with btrfs-progs which reads all > supers and replaces them with the "newest" one (as decided by the > generation number) So again, the line "The superblock with the highest generation is used when reading." in the wiki needs to go away then? >> On the list there seem to be a few incidents where the superblocks have >> gone toast and I am pondering what (if any) benefits there is by >> updating the superblocks synchronously. >> >> The superblock is checkpoint'ed every 30 seconds by default and if >> someone pulls the plug (poweroutage) on HDD's then a synchronous write >> depending on (the quality of) your hardware may perhaps ruin all the >> superblock copies in one go. E.g. Copy A,B and C will all be updated at >> 30s. >> >> On SSD's, since one superblock is updated after other it would mean that >> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s > > As explained previously there is no notion of "SSD vs HDD" modes. We also had a discussion about the "backup roots" that are stored besides the superblock, and that they are "better than nothing" to help maybe recover something from a borken fs, but never ever guarantee you will get a working filesystem back. The same holds for superblocks from a previous generation. As soon as the transaction for generation X succesfully hits the disk, all space that was occupied in generation X-1 but no longer in X is available to be overwritten immediately. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
On 01/10/2018 05:38 AM, Duncan wrote: > [...] > > And I've definitely noticed an effect since the ssd option stopped using > the 2 MiB spreading algorithm in 4.14. Glad to hear. :-) > In particular, while chunk usage > was generally stable before that and I only occasionally needed to run > balance to clear out empty chunks, now, balance with the usage filter > will apparently actively fill in empty space in existing chunks, so while > previously a usage-filtered balance that only rewrote one chunk didn't > actually free anything, simply allocating a new chunk to replace the one > it freed, so at least two chunks needed rewritten to actually free space > back to unallocated... > > Now, usage-filtered rewrites of only a single chunk routinely frees the > allocated space, because it writes that small bit of data in the freed > chunk into existing free space in other chunks. And that back-filling the existing chunks indeed also means a decrease in total work that needs to be done. But this probably also means that the free space gaps you have/had were rather small so they got ignored in the past. Large free gaps would always get data written in them by balance already. It probably also means that now you're on 4.14, much less of these small free space gaps will be left behind, because they're already immediately reused by new small writes. > At least I /presume/ that new balance-usage behavior is due to the ssd > changes. Most probably, yes. > Maybe it's due to other patches. Either way, it's an > interesting and useful change. =:^) -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
On 01/10/2018 10:37 PM, waxhead wrote: > As just a regular user I would think that the first thing you would need > is an analyze that can tell you if it is a good idea to balance or not > in the first place. Tooling to create that is available. Btrfs allows you to read a lot of different data to analyze, and then you can experiment with your own algorithms to find out which blockgroup you're going to feed to balance next. There's two language options... * C -> in this case you're extending btrfs-progs to build new tools * Python -> python-btrfs has everything in it to quickly throw together things like this, and examples are available with the source. For example: - balance_least_used.py -> balance starting from the least used chunk and work up towards a max of X% used. - show_free_space_fragmentation.py -> find out which chunks have badly fragmented free space. Remember: if you have a 1GiB chunk with usage 50%, that doesn't tell you if it has only a handful of extents, filling up the first 500MiB, with the rest empty, or if it's thousands of alternating pieces of 4KiB used space and 4KiB free space. ;-] [0] In the same way you can program something new, like a balance algorithm that cleans up blocks with high free space fragmentation first. Or, another thing you could do is first count the number of extents in a block group and add it to the algorithm. Balance of a block group with a few extents is much faster than thousands of extents with a lot of reflinks, like highly deduped data. Or... look at generation of metadata to find out which parts of data on your disk have been touched recently, and which weren't... Too many fun things to play around with. \:D/ As always, first thing to do is make sure you're on 4.14 or otherwise use nossd, otherwise you might keep shoveling data around forever. And if your filesystem has been treated badly by <4.14 kernel in ssd mode for a long time, then first get that cleaned up: https://www.spinics.net/lists/linux-btrfs/msg70622.html > Scrub seems like a great place to start - e.g. scrub could auto-analyze > and report back need to balance. I also think that scrub should > optionally autobalance if needed. > > Balance may not be needed, but if one can determine that balancing would > speed up things a bit I don't see why this as an option can't be > scheduled automatically. Ideally there should be a "scrub and polish" > option that would scrub, balance and perhaps even defragment in one go. > > In fact, the way I see it btrfs should idealy by itself keep track on > each data/metadata chunk and it should know , when was this chunk last > affected by a scrub, balance, defrag etc and perform the required > operations by itself based on a configuration or similar. Some may > disagree for good reasons , but for me this is my wishlist for a > filesystem :) e.g. a pool that just works and only annoys you with the > need of replacing a bad disk every now and then :) I don't think these kind of things will ever end up in kernel code. [0] There's a version in the devel branch in git that also works without free space tree, taking a slower detour via the extent tree. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance problems
On 12/28/2017 12:15 PM, Nikolay Borisov wrote: > > On 23.12.2017 13:19, James Courtier-Dutton wrote: >> >> During a btrfs balance, the process hogs all CPU. >> Or, to be exact, any other program that wishes to use the SSD during a >> btrfs balance is blocked for long periods. Long periods being more >> than 5 seconds. >> Is there any way to multiplex SSD access while btrfs balance is >> operating, so that other applications can still access the SSD with >> relatively low latency? >> >> My guess is that btrfs is doing a transaction with a large number of >> SSD blocks at a time, and thus blocking other applications. >> >> This makes for atrocious user interactivity as well as applications >> failing because they cannot access the disk in a relatively low latent >> manner. >> For, example, this is causing a High Definition network CCTV >> application to fail. >> >> What I would really like, is for some way to limit SSD bandwidths to >> applications. >> For example the CCTV app always gets the bandwidth it needs, and all >> other applications can still access the SSD, but are rate limited. >> This would fix my particular problem. >> We have rate limiting for network applications, why not disk access also? > > So how are you running btrfs balance? Or, to again take one step further back... *Why* are you running btrfs balance at all? :) > Are you using any filters > whatsoever? The documentation > [https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-balance] has the > following warning: > > Warning: running balance without filters will take a lot of time as it > basically rewrites the entire filesystem and needs to update all block > pointers. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: Fix progs_extra build dependencies
The Makefile does not have a dependency path that builds dependencies for tools listed in progs_extra. E.g. doing make btrfs-show-super in a clean build environment results in: gcc: error: cmds-inspect-dump-super.o: No such file or directory Makefile:389: recipe for target 'btrfs-show-super' failed Signed-off-by: Hans van Kranenburg <h...@knorrie.org> --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 30a0ee22..390b138f 100644 --- a/Makefile +++ b/Makefile @@ -220,7 +220,7 @@ cmds_restore_cflags = -DBTRFSRESTORE_ZSTD=$(BTRFSRESTORE_ZSTD) CHECKER_FLAGS += $(btrfs_convert_cflags) # collect values of the variables above -standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter btrfs-%, $(progs,$($(dep))) +standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter btrfs-%, $(progs) $(progs_extra,$($(dep))) SUBDIRS = BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS)) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs-progs: Fix build of btrfs-calc-size
Build would fail because it couldn't find the usage function. Signed-off-by: Hans van Kranenburg <h...@knorrie.org> --- btrfs-calc-size.c | 1 + 1 file changed, 1 insertion(+) diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c index 1ac7c785..d2d68ab2 100644 --- a/btrfs-calc-size.c +++ b/btrfs-calc-size.c @@ -19,6 +19,7 @@ #include "volumes.h" #include "utils.h" #include "commands.h" +#include "help.h" int main(int argc, char **argv) { -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] btrfs-progs: Fix progs_extra build failures
Alternatively, it might be a better idea to just remove the deprecated source files, since this is not the first time build failures in them went unnoticed. Hans van Kranenburg (2): btrfs-progs: Fix progs_extra build dependencies btrfs-progs: Fix build of btrfs-calc-size Makefile | 2 +- btrfs-calc-size.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
On 12/19/2017 06:08 PM, Rich Rauenzahn wrote: > What's also confusing is I just ran a manual balance on the fs using > defaults (which are aggressive) and it completed with no problems. > It smells more like a race condition than a particular corruption. Just wild first guess... are you also using btrfs send/receive functionality where the system having problems is the sending part? > On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahn <rraue...@gmail.com> wrote: >> I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight. >> >> Every night I'm getting a kernel hang, sometimes caught by my >> watchdog, sometimes not. Last night's hang was on the balance of DATA >> on / at 70. >> >> I'm not sure how to further trace this down to help you -- the console >> by the time I notice just has lots of messages on it without the >> initial ones. Capturing more logs is definitely the first thing to do. Look if the output of `dmesg` still shows the btrfs errors. Otherwise, if something is spamming there, turn that off, or if you don't have the errors in a log file because the log files are on the same btrfs, then you have to find out another way to capture them. E.g. make the kernel buffer for messages bigger, use netconsole or just pragmatic things like ssh from another server and `dmesg -w` and store it on the other machine. >> The last items in /var/log/message aren't helpful, but >> I'm pretty sure it is the nightly balance. >> >> I've run btrfs check on / with no issues recently. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.14.3] btrfs out of space error
On 12/15/2017 09:53 AM, Qu Wenruo wrote: > > > On 2017年12月15日 16:36, Ian Kumlien wrote: >> On Fri, Dec 15, 2017 at 6:34 AM, Roman Mamedov <r...@romanrm.net> wrote: >>> On Fri, 15 Dec 2017 01:39:03 +0100 >>> Ian Kumlien <ian.kuml...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Running a 4.14.3 kernel, this just happened, but there should have >>>> been another 20 gigs or so available. >>>> >>>> The filesystem seems fine after a reboot though >>> >>> What are your mount options, and can you show the output of "btrfs fi >>> df" and "btrfs fi us" for the filesystem? And what does >>> "cat /sys/block/sdb/queue/rotational" return. >> >> It's a btrfs raid1 mirror of two ssd:s >> >> mount options was: >> defaults,acl,noatime,space_cache,autodefrag,compress=lzo >> >> btrfs fi df / >> Data, RAID1: total=459.25GiB, used=372.42GiB >> Data, single: total=8.00MiB, used=0.00B >> System, RAID1: total=8.00MiB, used=80.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, RAID1: total=6.00GiB, used=3.69GiB > > Both meta and data has a lot of space let. The real question is how fragmented that free space is. Here's a good starting point to read up about the -o ssd behavior pre 4.14: https://www.spinics.net/lists/linux-btrfs/msg70622.html >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> btrfs fi us / >> Overall: >> Device size: 930.54GiB >> Device allocated: 930.53GiB >> Device unallocated: 20.05MiB >> Device missing: 0.00B >> Used: 752.22GiB >> Free (estimated): 86.84GiB (min: 86.84GiB) >> Data ratio: 2.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,single: Size:8.00MiB, Used:0.00B >>/dev/sdb28.00MiB >> >> Data,RAID1: Size:459.25GiB, Used:372.42GiB >>/dev/sdb2 459.25GiB >>/dev/sdc2 459.25GiB >> >> Metadata,single: Size:8.00MiB, Used:0.00B >>/dev/sdb28.00MiB >> >> Metadata,RAID1: Size:6.00GiB, Used:3.69GiB >>/dev/sdb26.00GiB >>/dev/sdc26.00GiB >> >> System,single: Size:4.00MiB, Used:0.00B >>/dev/sdb24.00MiB >> >> System,RAID1: Size:8.00MiB, Used:80.00KiB >>/dev/sdb28.00MiB >>/dev/sdc28.00MiB >> >> Unallocated: >>/dev/sdb2 24.00KiB >>/dev/sdc2 20.02MiB > > Well, at least no new chunk can be allocated. Exactly. > In v4.15, Josef introduced a new inode reservation system, which I think > it would enhance metadata related reservation. > > If you're hitting the problem quite frequently, please consider to try > v4.15-rc* to see if it solves the problem. > > Despite of that, there should be no damage to your fs. > (except some unwritten data in buffer) > > Thanks, > Qu > >> >> And as expected: >> cat /sys/block/sdb/queue/rotational >> 0 >> >>> I wonder if it's the same old "ssd allocation scheme" problem, and no >>> balancing done in a long time or at all. Looks like it. So, yay, you're on 4.14 already. Now just do a full balance of your entire filesystem, only once (data only, metadata not needed) and then you can forget about this again. >> I had something similar happen on a laptop a while ago - took a while >> before i could get it back in order >> (in that case i think it was actually a oops --- it kept saying "no >> space left" and switched to read only even >> if you removed a lot of data, invalidated the space cache and so on) -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: self heal from SB fail
On 12/08/2017 09:17 AM, Qu Wenruo wrote: > > > On 2017年12月08日 15:57, Anand Jain wrote: >> -EXPERIMENTAL- >> As of now when primary SB fails we won't self heal and would fail mount, >> this is an experimental patch which thinks why not go and read backup >> copy. > > Just curious about in which real world case that backup super block can > help. > At least from what I see in mail list, only few cases where backup super > helps. That's about using 'backup roots' with old info about previous generations than the current one on disk, which is a different thing than using one of the other copies of the super block itself. > Despite that self super heal seems good, although I agree with David, we > need a weaker but necessary check (magic and fsid from primary super?) > to ensure it's a valid btrfs before we use the backup supers. > > Thanks, > Qu > > >> >> Signed-off-by: Anand Jain <anand.j...@oracle.com> >> --- >> fs/btrfs/disk-io.c | 8 +++- >> fs/btrfs/volumes.c | 10 +++--- >> 2 files changed, 14 insertions(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 9b20c1f3563b..a791b8dfe8a8 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -3190,7 +3190,7 @@ struct buffer_head *btrfs_read_dev_super(struct >> block_device *bdev) >> * So, we need to add a special mount option to scan for >> * later supers, using BTRFS_SUPER_MIRROR_MAX instead >> */ >> -for (i = 0; i < 1; i++) { >> +for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { >> ret = btrfs_read_dev_one_super(bdev, i, ); >> if (ret) >> continue; >> @@ -4015,11 +4015,17 @@ static int btrfs_check_super_valid(struct >> btrfs_fs_info *fs_info) >> ret = -EINVAL; >> } >> >> +#if 0 >> +/* >> + * Need a way to check for any copy of SB, as its not a >> + * strong check, just ignore this for now. >> + */ >> if (btrfs_super_bytenr(sb) != BTRFS_SUPER_INFO_OFFSET) { >> btrfs_err(fs_info, "super offset mismatch %llu != %u", >>btrfs_super_bytenr(sb), BTRFS_SUPER_INFO_OFFSET); >> ret = -EINVAL; >> } >> +#endif >> >> /* >> * Obvious sys_chunk_array corruptions, it must hold at least one key >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 9fa2539a8493..f368db94d62b 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -1369,7 +1369,7 @@ int btrfs_scan_one_device(const char *path, fmode_t >> flags, void *holder, >> { >> struct btrfs_super_block *disk_super; >> struct block_device *bdev; >> -struct page *page; >> +struct buffer_head *sb_bh; >> int ret = -EINVAL; >> u64 devid; >> u64 transid; >> @@ -1392,8 +1392,12 @@ int btrfs_scan_one_device(const char *path, fmode_t >> flags, void *holder, >> goto error; >> } >> >> -if (btrfs_read_disk_super(bdev, bytenr, , _super)) >> +sb_bh = btrfs_read_dev_super(bdev); >> +if (IS_ERR(sb_bh)) { >> +ret = PTR_ERR(sb_bh); >> goto error_bdev_put; >> +} >> +disk_super = (struct btrfs_super_block *) sb_bh->b_data; >> >> devid = btrfs_stack_device_id(_super->dev_item); >> transid = btrfs_super_generation(disk_super); >> @@ -1413,7 +1417,7 @@ int btrfs_scan_one_device(const char *path, fmode_t >> flags, void *holder, >> if (!ret && fs_devices_ret) >> (*fs_devices_ret)->total_devices = total_devices; >> >> -btrfs_release_disk_super(page); >> +brelse(sb_bh); >> >> error_bdev_put: >> blkdev_put(bdev, flags); >> > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transacti hammering the system
On 12/01/2017 06:57 PM, Holger Hoffstätte wrote: > On 12/01/17 18:34, Matt McKinnon wrote: >> Thanks, I'll give space_cache=v2 a shot. > > Yes, very much recommended. > >> My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/ > > Turn autodefrag off and use noatime instead of relatime. > > Your filesystem also seems very full, We don't know. btrfs fi df only displays allocated space. And that being full is good, it means not too much free space fragments everywhere. > that's bad with every filesystem but > *especially* with btrfs because the allocator has to work really hard to find > free space for COWing. Really consider deleting stuff or adding more space. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transacti hammering the system
On 12/01/2017 05:31 PM, Matt McKinnon wrote: > Sorry, I missed your in-line reply: > > >> 1) The one right above, btrfs_write_out_cache, is the write-out of the >> free space cache v1. Do you see this for multiple seconds going on, and >> does it match the time when it's writing X MB/s to disk? >> > > It seems to only last until the next watch update. > > [] io_schedule+0x16/0x40 > [] get_request+0x23e/0x720 > [] blk_queue_bio+0xc1/0x3a0 > [] generic_make_request+0xf8/0x2a0 > [] submit_bio+0x75/0x150 > [] btrfs_map_bio+0xe5/0x2f0 [btrfs] > [] btree_submit_bio_hook+0x8c/0xe0 [btrfs] > [] submit_one_bio+0x63/0xa0 [btrfs] > [] flush_epd_write_bio+0x3b/0x50 [btrfs] > [] flush_write_bio+0xe/0x10 [btrfs] > [] btree_write_cache_pages+0x379/0x450 [btrfs] > [] btree_writepages+0x5d/0x70 [btrfs] > [] do_writepages+0x1c/0x70 > [] __filemap_fdatawrite_range+0xaa/0xe0 > [] filemap_fdatawrite_range+0x13/0x20 > [] btrfs_write_marked_extents+0xe9/0x110 [btrfs] > [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80 > [btrfs] > [] btrfs_commit_transaction+0x665/0x900 [btrfs] > [] transaction_kthread+0x18a/0x1c0 [btrfs] > [] kthread+0x109/0x140 > [] ret_from_fork+0x25/0x30 > > The last three lines will stick around for a while. Is switching to > space cache v2 something that everyone should be doing? Something that > would be a good test at least? Yes. Read on. >> 2) How big is this filesystem? What does your `btrfs fi df >> /mountpoint` say? >> > > # btrfs fi df /export/ > Data, single: total=30.45TiB, used=30.25TiB > System, DUP: total=32.00MiB, used=3.62MiB > Metadata, DUP: total=66.50GiB, used=65.08GiB > GlobalReserve, single: total=512.00MiB, used=53.69MiB Multi-TiB filesystem, check. total/used ratio looks healthy. >> 3) What kind of workload are you running? E.g. how can you describe it >> within a range from "big files which just sit there" to "small writes >> and deletes all over the place all the time"? > > It's a pretty light workload most of the time. It's a file system that > exports two NFS shares to a small lab group. I believe it is more small > reads all over a large file (MRI imaging) rather than small writes. Ok. >> 4) What kernel version is this? `uname -a` output? > > # uname -a > Linux machine_name 4.12.8-custom #1 SMP Tue Aug 22 10:15:01 EDT 2017 > x86_64 x86_64 x86_64 GNU/Linux > Yes, I'd recommend switching to space_cache v2, which stores the free space information in a tree instead of separate blobs, and does not block the transaction while writing out all info of all touched parts of the filesystem again. Here's of course the famous presentation with all kinds of info why: http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf How: * umount the filesystem * btrfsck --clear-space-cache v1 /block/device * do a rw mount with the space_cache=v2 option added (only needed explicitly once) During that mount, it will generate the free space tree by reading the extent tree and writing the inverse of it. This will take some time, depending on how fast your storage can do random reads with a cold disk cache. For x86_64, using the free space cache v2 is fine since linux 4.5. Up to 4.9, there was a bug for big-endian systems. So, with your kernel it's absolutely fine. Why isn't this the default yet? It's because btrfs-progs don't have support to update the free space tree when doing offline modifications (like check --repair or btrfstune, which you hopefully don't need often anyway). So, until that's fully added, you need to do an `btrfsck --clear-space-cache v2`, then do the offline r/w action and then generate the tree again on next mount. Additional tips (forgot to ask for your /proc/mounts before): * Use the noatime mount option, so that only accessing files does not lead to changes in metadata, which lead to writes, which lead to cowing and writes in a new place, which lead to updates of the free space administration etc... -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transacti hammering the system
On 12/01/2017 04:24 PM, Matt McKinnon wrote: > Thanks for this. Here's what I get: Ok, and which one is displaying most of the time? > [...] > > [] io_schedule+0x16/0x40 > [] get_request+0x23e/0x720 > [] blk_queue_bio+0xc1/0x3a0 > [] generic_make_request+0xf8/0x2a0 > [] submit_bio+0x75/0x150 > [] btrfs_map_bio+0xe5/0x2f0 [btrfs] > [] btree_submit_bio_hook+0x8c/0xe0 [btrfs] > [] submit_one_bio+0x63/0xa0 [btrfs] > [] flush_epd_write_bio+0x3b/0x50 [btrfs] > [] flush_write_bio+0xe/0x10 [btrfs] > [] btree_write_cache_pages+0x379/0x450 [btrfs] > [] btree_writepages+0x5d/0x70 [btrfs] > [] do_writepages+0x1c/0x70 > [] __filemap_fdatawrite_range+0xaa/0xe0 > [] filemap_fdatawrite_range+0x13/0x20 > [] btrfs_write_marked_extents+0xe9/0x110 [btrfs] > [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80 > [btrfs] > [] btrfs_commit_transaction+0x665/0x900 [btrfs] > > [...] > > [] io_schedule+0x16/0x40 > [] wait_on_page_bit+0xe8/0x120 > [] read_extent_buffer_pages+0x1cd/0x2e0 [btrfs] > [] btree_read_extent_buffer_pages+0x9f/0x100 [btrfs] > [] read_tree_block+0x32/0x50 [btrfs] > [] read_block_for_search.isra.32+0x120/0x2e0 [btrfs] > [] btrfs_next_old_leaf+0x215/0x400 [btrfs] > [] btrfs_next_leaf+0x10/0x20 [btrfs] > [] btrfs_lookup_csums_range+0x12e/0x410 [btrfs] > [] csum_exist_in_range.isra.49+0x2a/0x81 [btrfs] > [] run_delalloc_nocow+0x9b2/0xa10 [btrfs] > [] run_delalloc_range+0x68/0x340 [btrfs] > [] writepage_delalloc.isra.47+0xf0/0x140 [btrfs] > [] __extent_writepage+0xc7/0x290 [btrfs] > [] extent_write_cache_pages.constprop.53+0x2b5/0x450 > [btrfs] > [] extent_writepages+0x4d/0x70 [btrfs] > [] btrfs_writepages+0x28/0x30 [btrfs] > [] do_writepages+0x1c/0x70 > [] __filemap_fdatawrite_range+0xaa/0xe0 > [] filemap_fdatawrite_range+0x13/0x20 > [] btrfs_fdatawrite_range+0x20/0x50 [btrfs] > [] __btrfs_write_out_cache+0x3d9/0x420 [btrfs] > [] btrfs_write_out_cache+0x86/0x100 [btrfs] > [] btrfs_write_dirty_block_groups+0x261/0x390 [btrfs] > [] commit_cowonly_roots+0x1fb/0x290 [btrfs] > [] btrfs_commit_transaction+0x434/0x900 [btrfs] 1) The one right above, btrfs_write_out_cache, is the write-out of the free space cache v1. Do you see this for multiple seconds going on, and does it match the time when it's writing X MB/s to disk? 2) How big is this filesystem? What does your `btrfs fi df /mountpoint` say? 3) What kind of workload are you running? E.g. how can you describe it within a range from "big files which just sit there" to "small writes and deletes all over the place all the time"? 4) What kernel version is this? `uname -a` output? -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transacti hammering the system
On 12/01/2017 03:25 PM, Matt McKinnon wrote: > > Is there any way to figure out what exactly btrfs-transacti is chugging > on? I have a few file systems that seem to get wedged for days on end > with this process pegged around 100%. I've stopped all snapshots, made > sure no quotas were enabled, turned on autodefrag in the mount options, > tried manual defragging, kernel upgrades, yet still this brings my > system to a crawl. > > Network I/O to the system seems very tiny. The only I/O I see to the > disk is btrfs-transacti writing a couple M/s. > > # time touch foo > > real 2m54.303s > user 0m0.000s > sys 0m0.002s > > # uname -r > 4.12.8-custom > > # btrfs --version > btrfs-progs v4.13.3 > > Yes, I know I'm a bit behind there... One of the simple things you can do is watch the stack traces of the kernel thread. watch 'cat /proc//stack' where is the pid of the btrfs-transaction process. In there, you will see a pattern of reoccuring things, like, it's searching for free space, it's writing out free space cache, or other things. Correlate this with the disk write traffic and see if we get a step further. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
Ok, just want to add one more thing :) On 11/28/2017 08:12 PM, David Sterba wrote: > On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote: >> On 11/28/2017 06:34 PM, David Sterba wrote: >>> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: >>>> Last week, when implementing the automatic classifier to dynamically >>>> create tree item data objects by key type in python-btrfs, I ran into >>>> the following commits in btrfs-progs: >>>> >>>> commit 8609c8bad68528f668d9ce564b868aa4828107a0 >>>> btrfs-progs: print-tree: factor out temporary_item dump >>>> and >>>> commit a4b65f00d53deb1b495728dd58253af44fcf70df >>>> btrfs-progs: print-tree: factor out persistent_item dump >>>> >>>> ...which are related to kernel... >>>> >>>> commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 >>>> btrfs: introduce key type for persistent permanent items >>>> and >>>> commit 0bbbccb17fea86818e1a058faf5903aefd20b31a >>>> btrfs: introduce key type for persistent temporary items >>>> >>>> Afaics the goal is to overload types because there can be only 256 in >>>> total. However, I'm missing the design decisions behind the >>>> implementation of it. It's not in the commit messages, and it hasn't >>>> been on the mailing list. >>> >>> The reason is avoid wasting key types but still allow to store new types >>> of data to the btrees, if they were not part of the on-disk format. >>> >>> I'm not sure if this has been discussed or mentioned under some patches >>> or maybe unrelated patches. I do remember that I discussed that with >>> Chris in private on IRC and have the logs, dated 2015-09-02. >>> >>> At that time the balance item and dev stats item were introduced, maybe >>> also the qgroup status item type. This had me alarmed enough to >>> reconsider how the keys are allocated. >>> >>>> Before, there was an 1:1 mapping from key types to data structures. Now, >>>> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items >>>> which use this type can be using any data structure they want, so it's >>>> some kind of YOLO_ITEM_KEY. >>> >>> In some sense it is, so it's key+objectid to determine the structure. >>> >>>> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For >>>> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with >>>> case BTRFS_DEV_STATS_OBJECTID. >>>> >>>> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means >>>> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also >>>> 0, and I'm storing a btrfs_kebab_item struct indexed at >>>> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c >>>> will try to parse the data by calling print_dev_stats? >>> >>> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that >>> case. >> >> (I'm just thinking out loud here, if you think I'm wasting your time >> just say.) >> >> Yes, so the objectid numbers have to be "registered" / "reserved" in the >> documentation, and they have to be unique over all trees. > > Right. > >> Maybe the information I was looking for is... in what cases should or >> shouldn't this be used? Because that limits the possible usage quite a >> bit. Or is it only for very very specific things. > > The keys are not free for use, they need to have a defined meaning and > are sort of part of the on-disk format. There must be some usecase and > reasoning why it's necessary to be done that way, and not in another. > Like xattr. > > I don't have an example now, but I could imagine some per-tree > information that can be tracked and updated at commit time. > >> E.g. if I wanted to (just a random idea) add per device statistics, and >> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if >> I have multiple devices. That's already a no go if there's anyone in any >> other tree that is doing anything with any objectid in the range of >> valid device numbers. > > In that case there would be a new objectid PER_DEVICE_OBJECTID, with the > persistent key, and all the device ids can go to the offset field. The > objectid field should not be dynamic, by design. Ok, didn't think of that yet, clear. So... just a random idea... How cool would it be to have the b
Re: About 'key type for persistent [...] items'
On 11/28/2017 06:34 PM, David Sterba wrote: > On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: >> Last week, when implementing the automatic classifier to dynamically >> create tree item data objects by key type in python-btrfs, I ran into >> the following commits in btrfs-progs: >> >> commit 8609c8bad68528f668d9ce564b868aa4828107a0 >> btrfs-progs: print-tree: factor out temporary_item dump >> and >> commit a4b65f00d53deb1b495728dd58253af44fcf70df >> btrfs-progs: print-tree: factor out persistent_item dump >> >> ...which are related to kernel... >> >> commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 >> btrfs: introduce key type for persistent permanent items >> and >> commit 0bbbccb17fea86818e1a058faf5903aefd20b31a >> btrfs: introduce key type for persistent temporary items >> >> Afaics the goal is to overload types because there can be only 256 in >> total. However, I'm missing the design decisions behind the >> implementation of it. It's not in the commit messages, and it hasn't >> been on the mailing list. > > The reason is avoid wasting key types but still allow to store new types > of data to the btrees, if they were not part of the on-disk format. > > I'm not sure if this has been discussed or mentioned under some patches > or maybe unrelated patches. I do remember that I discussed that with > Chris in private on IRC and have the logs, dated 2015-09-02. > > At that time the balance item and dev stats item were introduced, maybe > also the qgroup status item type. This had me alarmed enough to > reconsider how the keys are allocated. > >> Before, there was an 1:1 mapping from key types to data structures. Now, >> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items >> which use this type can be using any data structure they want, so it's >> some kind of YOLO_ITEM_KEY. > > In some sense it is, so it's key+objectid to determine the structure. > >> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For >> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with >> case BTRFS_DEV_STATS_OBJECTID. >> >> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means >> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also >> 0, and I'm storing a btrfs_kebab_item struct indexed at >> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c >> will try to parse the data by calling print_dev_stats? > > As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that > case. (I'm just thinking out loud here, if you think I'm wasting your time just say.) Yes, so the objectid numbers have to be "registered" / "reserved" in the documentation, and they have to be unique over all trees. Maybe the information I was looking for is... in what cases should or shouldn't this be used? Because that limits the possible usage quite a bit. Or is it only for very very specific things. E.g. if I wanted to (just a random idea) add per device statistics, and use this, I'd need to use the key also with objectid 1, 2, 3, etc... if I have multiple devices. That's already a no go if there's anyone in any other tree that is doing anything with any objectid in the range of valid device numbers. >> What's the idea behind that? Instead of having the key type field define >> the struct and meaning, we now suddenly need the tuple (tree, objectid, >> type), and we need all three to determine what's inside the item data? >> So, the code in print_tree.c would also need to know about the tree >> number and pass that into the different functions. > > No, all key types, even the persistent/temporary are independent of the > tree type. So it's only type <-> structure mapping, besides > persistent/temporary types. Yeah, I wasn't explicit about that, I meant only for the persistent/temporary case yes. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: dump_tree: remove superfluous _TREE
-# btrfs inspect-internal dump-tree -t fs /dev/block/device ERROR: unrecognized tree id: fs Without this fix I can't dump-tree fs, but I can dump-tree fs_tree and also fs_tree_tree, which is a bit silly. --- cmds-inspect-dump-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 4e72c08a..df44bb63 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -143,7 +143,7 @@ static u64 treeid_from_string(const char *str, const char **end) { "CHUNK", BTRFS_CHUNK_TREE_OBJECTID }, { "DEVICE", BTRFS_DEV_TREE_OBJECTID }, { "DEV", BTRFS_DEV_TREE_OBJECTID }, - { "FS_TREE", BTRFS_FS_TREE_OBJECTID }, + { "FS", BTRFS_FS_TREE_OBJECTID }, { "CSUM", BTRFS_CSUM_TREE_OBJECTID }, { "CHECKSUM", BTRFS_CSUM_TREE_OBJECTID }, { "QUOTA", BTRFS_QUOTA_TREE_OBJECTID }, -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About 'key type for persistent [...] items'
Hi, Last week, when implementing the automatic classifier to dynamically create tree item data objects by key type in python-btrfs, I ran into the following commits in btrfs-progs: commit 8609c8bad68528f668d9ce564b868aa4828107a0 btrfs-progs: print-tree: factor out temporary_item dump and commit a4b65f00d53deb1b495728dd58253af44fcf70df btrfs-progs: print-tree: factor out persistent_item dump ...which are related to kernel... commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 btrfs: introduce key type for persistent permanent items and commit 0bbbccb17fea86818e1a058faf5903aefd20b31a btrfs: introduce key type for persistent temporary items Afaics the goal is to overload types because there can be only 256 in total. However, I'm missing the design decisions behind the implementation of it. It's not in the commit messages, and it hasn't been on the mailing list. Before, there was an 1:1 mapping from key types to data structures. Now, with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items which use this type can be using any data structure they want, so it's some kind of YOLO_ITEM_KEY. The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with case BTRFS_DEV_STATS_OBJECTID. However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also 0, and I'm storing a btrfs_kebab_item struct indexed at (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c will try to parse the data by calling print_dev_stats? What's the idea behind that? Instead of having the key type field define the struct and meaning, we now suddenly need the tuple (tree, objectid, type), and we need all three to determine what's inside the item data? So, the code in print_tree.c would also need to know about the tree number and pass that into the different functions. Am I missing something, or is my observation correct? Thanks,-- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
python-btrfs release v9
I just tagged v9 of the python btrfs library. https://github.com/knorrie/python-btrfs Q: What is python-btrfs? A: python-btrfs is a python3 implementation of the data structures for btrfs metadata, and an implementation of the kernel API (via ioctl calls) that btrfs provides to userland. Q: Ok, whatever, dude, but what would I need that for? A: You probably won't need this. Q: I'm a btrfs developer, what is this? A: It provides quick hackable ways to inspect a running filesystem. You can look up things in the current kernel metadata memory, play around with ioctls, and quickly cobble up scripts for that. Here's the summary of changes: python-btrfs v9, Nov 23 2017 * IOCTLs: IOC_SET_RECEIVED_SUBVOL, IOC_LOGICAL_INO_V2 * Add crc32c checksum function for data, thanks Nikolay Borisov * Recognize zstd compression type * Add a tree object pretty printer to dump data for any tree item * Examples added: - Show default subvolume id - Lookup a block checksum in the checksum tree - Determine bad free space fragmentation score for block groups - Set the received_uuid e.a. of a subvolume - Dump an entire metadata tree using the pretty printer * Fixes: - crc32c: add implicit seed truncation to 32 bits value * Small fixes and cosmetic changes Note: the IOC_SET_RECEIVED_SUBVOL is added 'because we can'. It's not added because it's a very good to mess around with it except for educational purpose in a test environment. As usual, detailed explanations of all changes are in the git commit messages. Feedback from Nikolay Borisov pushed me to have a look at how the checksum storage is done. My favorite one for this release is the tree object pretty printer, which allows us to dump a single tree object displaying all its attributes, but also allows us to stream a search query into it to output a full dump of a metadata tree. See the examples/dump_tree.py about this! Tomorrow I'll update pypi and will prepare debian packages for unstable and stretch-backports. -- Have fun, Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
On 11/18/2017 12:48 PM, Hans van Kranenburg wrote: > > So, who wants to help? > > 1. Find a test system that you can crash. > 2. Create a test filesystem with some data. > 3. Run with 4.14? (makes the most sense I think) > 4. Continuously feed the data to balance and send everything to /dev/null > 5. Collect stack traces and borken filesystem images. I moved it to a ramdisk: -# modprobe brd rd_nr=1 rd_size=4194304 -# mkfs.btrfs -m single -d single /dev/ram0 -# mount -o noatime /dev/ram0 /btrfs -# cd /btrfs -# btrfs sub create moo -# rsync -av /usr/ moo/ -# btrfs sub snap -r moo/ moo-ro Now remove part of the files, yolo style -# rm $(find moo -type f | shuf | head -n 5000) Put them back, so we have some differences for incremental send -# rsync -av /usr/ moo/ -# btrfs sub snap -r moo/ moo-ro2 Now again: -# while true; do btrfs balance start --full-balance /btrfs; done and -# while true; do btrfs send --no-data /btrfs/moo-ro/ | wc -c; btrfs send --no-data -p /btrfs/moo-ro/ /btrfs/moo-ro4/ |wc -c; date; done Now I got rid of the disk traffic, and kernel cpu time goes to >300%. The error seen before is easily triggered. It happens both on normal and on incremental send. It happens both when using --no-data and when not using that option. send aborts with "ERROR: send ioctl failed with -5: Input/output error", and dmesg shows: [...] [17094.578876] BTRFS error (device ram0): did not find backref in send_root. inode=28151, offset=0, disk_byte=8769369178112 found extent=8769369178112 [17328.368458] BTRFS error (device ram0): did not find backref in send_root. inode=23264, offset=0, disk_byte=8902861979648 found extent=8902861979648 [17352.779099] BTRFS error (device ram0): did not find backref in send_root. inode=17392, offset=0, disk_byte=8917230010368 found extent=8917230010368 [18012.009357] BTRFS error (device ram0): did not find backref in send_root. inode=29245, offset=0, disk_byte=9295300538368 found extent=9295300538368 [18193.218649] BTRFS error (device ram0): did not find backref in send_root. inode=16437, offset=0, disk_byte=9400309366784 found extent=9400309366784 [18604.697898] BTRFS error (device ram0): did not find backref in send_root. inode=24508, offset=0, disk_byte=9635165790208 found extent=9635165790208 [18621.053722] BTRFS error (device ram0): did not find backref in send_root. inode=10039, offset=0, disk_byte=9644150468608 found extent=9644150468608 [19039.051399] BTRFS error (device ram0): did not find backref in send_root. inode=29411, offset=0, disk_byte=9883807432704 found extent=9883807432704 [19373.297701] BTRFS error (device ram0): did not find backref in send_root. inode=7946, offset=0, disk_byte=10074868215808 found extent=10074868215808 [19573.432255] BTRFS error (device ram0): did not find backref in send_root. inode=26743, offset=0, disk_byte=10190374899712 found extent=10190374899712 [19682.305240] BTRFS error (device ram0): did not find backref in send_root. inode=24823, offset=0, disk_byte=10252750929920 found extent=10252750929920 [20012.420346] BTRFS error (device ram0): did not find backref in send_root. inode=25763, offset=0, disk_byte=10441684029440 found extent=10441684029440 [20430.100411] BTRFS error (device ram0): did not find backref in send_root. inode=14572, offset=0, disk_byte=10680836050944 found extent=10680836050944 [21328.821766] BTRFS error (device ram0): did not find backref in send_root. inode=11756, offset=0, disk_byte=11195322470400 found extent=11195322470400 [...] After a few hours I have a long list of those, but that's all so far. No other big explosions. So, if anyone has an idea of what to try next? Maybe it needs more than 1 block group each for data and metadata? Maybe speeding it up (with a small amount of data and the ramdisk) does not increase the chance of triggering something, but just decreases it? Welcome to the world of trying to reproduce errors... :D -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
On 11/18/2017 12:48 PM, Hans van Kranenburg wrote: > > So, who wants to help? > > 1. Find a test system that you can crash. > 2. Create a test filesystem with some data. > 3. Run with 4.14? (makes the most sense I think) > 4. Continuously feed the data to balance and send everything to /dev/null > 5. Collect stack traces and borken filesystem images. Ok, what I just did: -# mkfs.btrfs --nodiscard /dev/xvdb -# mount -o noatime /dev/xvdb /btrfs -# cd btrfs/ -# btrfs sub create usr -# rsync -av /usr/ usr/ -# btrfs sub snap -r usr/ usro And then, both at the same time: -# while true; do btrfs send usro/ > /dev/null; date; echo 3 > /proc/sys/vm/drop_caches; done and -# while true; do btrfs balance start --full-balance /btrfs; done After a few minutes we have the first error: [ 1605.484627] BTRFS error (device xvdb): did not find backref in send_root. inode=4260, offset=0, disk_byte=130793844736 found extent=130793844736 Sat Nov 18 13:25:28 CET 2017 At subvol usro/ ERROR: send ioctl failed with -5: Input/output error No big stack trace this time. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
On 11/18/2017 01:49 AM, Tomasz Chmielewski wrote: > I'm getting the following BUG when running balance on one of my systems: > > > [ 3458.698704] BTRFS info (device sdb3): relocating block group > 306045779968 flags data|raid1 > [ 3466.892933] BTRFS info (device sdb3): found 2405 extents > [ 3495.408630] BTRFS info (device sdb3): found 2405 extents > [ 3498.161144] [ cut here ] > [ 3498.161150] kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856! > [ 3498.161264] invalid opcode: [#1] SMP > [ 3498.161363] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common > xt_LOG xt_multiport xt_conntrack xt_nat binfmt_misc veth ip6table_filter > xt_CHECKSUM iptable_mangle xt_tcpudp ip6t_MASQUERADE > nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 > nf_nat_ipv6 ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > nf_conntrack iptable_filter ip_tables x_tables bridge stp llc intel_rapl > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel > aes_x86_64 crypto_simd glue_helper cryptd intel_cstate hci_uart > intel_rapl_perf btbcm input_leds serdev serio_raw btqca btintel > bluetooth intel_pch_thermal intel_lpss_acpi intel_lpss mac_hid acpi_pad > [ 3498.162060] ecdh_generic acpi_als kfifo_buf industrialio autofs4 > btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy > async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath > linear raid1 e1000e psmouse ptp ahci pps_core libahci wmi > pinctrl_sunrisepoint i2c_hid video pinctrl_intel hid > [ 3498.162386] CPU: 7 PID: 29041 Comm: btrfs Not tainted > 4.14.0-041400-generic #201711122031 > [ 3498.162545] Hardware name: FUJITSU /D3401-H2, BIOS V5.0.0.12 R1.5.0 > for D3401-H2x 02/27/2017 > [ 3498.162723] task: 8d7858e82f00 task.stack: b4ee47d5c000 > [ 3498.162890] RIP: 0010:read_node_slot+0xd7/0xe0 [btrfs] > [ 3498.163027] RSP: 0018:b4ee47d5fb88 EFLAGS: 00010246 > [ 3498.163156] RAX: 8d78c8bb7000 RBX: 8d8124abd380 RCX: > 0001 > [ 3498.163290] RDX: 0048 RSI: 8d7ae1fef6f8 RDI: > 8d8124aa > [ 3498.163422] RBP: b4ee47d5fba8 R08: 0001 R09: > 8d8124abd384 > [ 3498.163555] R10: 0001 R11: 00114000 R12: > 0002 > [ 3498.163689] R13: b4ee47d5fc66 R14: b4ee47d5fc50 R15: > > [ 3498.163825] FS: 7fa4c9a998c0() GS:8d816e5c() > knlGS: > [ 3498.163990] CS: 0010 DS: ES: CR0: 80050033 > [ 3498.164120] CR2: 56410155a028 CR3: 0009c194c002 CR4: > 003606e0 > [ 3498.164255] DR0: DR1: DR2: > > [ 3498.164390] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 3498.164523] Call Trace: > [ 3498.164694] tree_advance+0x16e/0x1d0 [btrfs] > [ 3498.164874] btrfs_compare_trees+0x2da/0x6a0 [btrfs] > [ 3498.165078] ? process_extent+0x1580/0x1580 [btrfs] > [ 3498.165264] btrfs_ioctl_send+0xe94/0x1120 [btrfs] It's using send + balance at the same time. There's something that makes btrfs explode when you do that. It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various different explosions in kernel log. Since that happened, I made sure I never did those two things at the same time. > [ 3498.165450] btrfs_ioctl+0x93c/0x1f00 [btrfs] > [ 3498.165587] ? enqueue_task_fair+0xa8/0x6c0 > [ 3498.165724] do_vfs_ioctl+0xa5/0x600 > [ 3498.165854] ? do_vfs_ioctl+0xa5/0x600 > [ 3498.165979] ? _do_fork+0x144/0x3a0 > [ 3498.166103] SyS_ioctl+0x79/0x90 > [ 3498.166234] entry_SYSCALL_64_fastpath+0x1e/0xa9 > [ 3498.166368] RIP: 0033:0x7fa4c8b17f07 > [ 3498.166488] RSP: 002b:7ffd33644e38 EFLAGS: 0202 ORIG_RAX: > 0010 > [ 3498.166653] RAX: ffda RBX: 7fa4c8a1a700 RCX: > 7fa4c8b17f07 > [ 3498.166787] RDX: 7ffd33644f30 RSI: 40489426 RDI: > 0004 > [ 3498.166921] RBP: 7ffd33644dc0 R08: R09: > 7fa4c8a1a700 > [ 3498.167055] R10: 7fa4c8a1a9d0 R11: 0202 R12: > > [ 3498.167190] R13: 7ffd33644dbf R14: 7fa4c8a1a9c0 R15: > 0129f020 > [ 3498.167326] Code: 48 c7 c3 fb ff ff ff e8 f8 5c 05 00 48 89 d8 5b 41 > 5c 41 5d 41 5e 5d c3 48 c7 c3 fe ff ff ff 48 89 d8 5b 41 5c 41 5d 41 5e > 5d c3 <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 > [ 3498.167690] RIP: read_node_slot+0xd7/0xe0 [btrfs] RSP: b4ee47d5fb88 > [ 3498.167892] ---[ end trace 6a751a3020dd3086 ]--- > [ 3499.572729] BTRFS info
Re: Theoretical Question about commit=n
On 11/13/2017 01:41 AM, Qu Wenruo wrote: > > On 2017年11月13日 06:01, Hans van Kranenburg wrote: >> On 11/12/2017 09:58 PM, Robert White wrote: >>> Is the commit interval monotonic, or is it seconds after sync? >>> >>> What I mean is that if I manually call sync(2) does the commit timer >>> reset? I'm thinking it does not, but I can imagine a workload where it >>> ideally would. >> >> The magic happens inside the transaction kernel thread: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/disk-io.c?h=v4.14#n1925 >> >> You can see the delay being computed: >> delay = HZ * fs_info->commit_interval; >> >> Almost at the end of the function, you see: >> schedule_timeout(delay) >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/time/timer.c?h=v4.14#n1676 >> >> This schedule_timeout function sets a timer and then the thread goes to >> sleep. If nothing happens, the kernel will wake up the thread after the >> timer expires (can be later, but not earlier) and then it will redo the >> loop. >> >> If something else wakes up the transaction thread, the timer is >> discarded if it's not expired yet. > > So far so good. > >> >> So it works like you would want. > > Not exactly. Ah, interesting. > Sync or commit_transaction won't wake up transaction_kthread. > > transaction_kthread will mostly be woken by trans error, remount or > under certain case of btrfs_end_transaction. > > So manually sync will not (at least not always) interrupt commit interval. The fun thing is, when I just do sync, I see that the time it takes for a next generation bump to happen is reset (while doing something simple like touch x in a loop in another terminal). > And even more, transaction_kthread will only commit transaction, which > means it will only ensure metadata consistent. > > It won't ensure buffered write to reach disk if its extent is not > allocated yet (delalloc). Hm, I have seen things like that in BTRFS_IOC_SYNC... Actually, I first responded on the timer reset question, because that one was easy to answer. I don't know if I want to descend the path further into (f)sync. I heard it can get really messy down there. :] > > Thanks, > Qu >> >> You can test this yourself by looking at the "generation" number of your >> filesystem. It's in the output of btrfs inspect dump-super: >> >> This is the little test filesystem I just used: >> >> -# btrfs inspect dump-super /dev/dorothy/mekker | grep ^generation >> generation 35 >> >> If you print the number in a loop, like every second, you can see it >> going up after a transaction happened. Now play around with other things >> and see when it changes. >> >>> (Again, this is purely theoretical, I have no such workload as I am >>> about to describe.) >>> >>> So suppose I have some sort of system, like a database, that I know will >>> do scattered writes and extends through some files and then call some >>> variant of sync(2). And I know that those sync() calls will be every >>> forty-to-sixty seconds because of reasons. It would be "neat" to be able >>> to set the commit=n to some high value, like 90, and then "normally" the >>> sync() behaviours would follow the application instead of the larger >>> commit interval. >>> >>> The value would be that the file system would tend _not_ to go into sync >>> while the application was still skittering about in the various files. >>> >>> Of course any other applications could call sync from their own contexts >>> for their own reasons. And there's an implicit fsync() on just about any >>> close() (at least if everything is doing its business "correctly") >>> >>> It may be a strange idea but I can think of some near realtime >>> applications might be able to leverage a modicum of control over the >>> sync event. There is no API, and not strong reason to desire one, for >>> controlling the commit via (low privelege) applications. >>> >>> But if the plumbing exists, then having a mode where sync() or fsync() >>> (which I think causes a general sync because of the journal) resets the >>> commit timer could be really interesting. >>> >>> With any kind of delayed block choice/mapping it could actually reduce >>> the entropy of the individual files for repeated random small writes. >>> The application would have to be reasonably aware, of course. >>> >>> Since something is causing a sync() the commit=N guarantee is still >>> being met for the whole system for any N, but applications could tend to >>> avoid mid-write commits by planing their sync()s. >>> >>> Just a thought. >> >> -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Theoretical Question about commit=n
On 11/12/2017 09:58 PM, Robert White wrote: > Is the commit interval monotonic, or is it seconds after sync? > > What I mean is that if I manually call sync(2) does the commit timer > reset? I'm thinking it does not, but I can imagine a workload where it > ideally would. The magic happens inside the transaction kernel thread: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/disk-io.c?h=v4.14#n1925 You can see the delay being computed: delay = HZ * fs_info->commit_interval; Almost at the end of the function, you see: schedule_timeout(delay) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/time/timer.c?h=v4.14#n1676 This schedule_timeout function sets a timer and then the thread goes to sleep. If nothing happens, the kernel will wake up the thread after the timer expires (can be later, but not earlier) and then it will redo the loop. If something else wakes up the transaction thread, the timer is discarded if it's not expired yet. So it works like you would want. You can test this yourself by looking at the "generation" number of your filesystem. It's in the output of btrfs inspect dump-super: This is the little test filesystem I just used: -# btrfs inspect dump-super /dev/dorothy/mekker | grep ^generation generation 35 If you print the number in a loop, like every second, you can see it going up after a transaction happened. Now play around with other things and see when it changes. > (Again, this is purely theoretical, I have no such workload as I am > about to describe.) > > So suppose I have some sort of system, like a database, that I know will > do scattered writes and extends through some files and then call some > variant of sync(2). And I know that those sync() calls will be every > forty-to-sixty seconds because of reasons. It would be "neat" to be able > to set the commit=n to some high value, like 90, and then "normally" the > sync() behaviours would follow the application instead of the larger > commit interval. > > The value would be that the file system would tend _not_ to go into sync > while the application was still skittering about in the various files. > > Of course any other applications could call sync from their own contexts > for their own reasons. And there's an implicit fsync() on just about any > close() (at least if everything is doing its business "correctly") > > It may be a strange idea but I can think of some near realtime > applications might be able to leverage a modicum of control over the > sync event. There is no API, and not strong reason to desire one, for > controlling the commit via (low privelege) applications. > > But if the plumbing exists, then having a mode where sync() or fsync() > (which I think causes a general sync because of the journal) resets the > commit timer could be really interesting. > > With any kind of delayed block choice/mapping it could actually reduce > the entropy of the individual files for repeated random small writes. > The application would have to be reasonably aware, of course. > > Since something is causing a sync() the commit=N guarantee is still > being met for the whole system for any N, but applications could tend to > avoid mid-write commits by planing their sync()s. > > Just a thought. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] btrfs: Remove received_uuid during received snapshot ro->rw switch
On 10/05/2017 10:22 AM, Nikolay Borisov wrote: > Currently when a read-only snapshot is received and subsequently its ro > property > is set to false i.e. switched to rw-mode the received_uuid of that subvol > remains > intact. However, once the received volume is switched to RW mode we cannot > guaranteee that it contains the same data, so it makes sense to remove the > received uuid. The presence of the received_uuid can also cause problems when > the volume is being send. > > Signed-off-by: Nikolay Borisov <nbori...@suse.com> > Suggested-by: David Sterba <dste...@suse.cz> > --- > > v4: > * Put the uuid tree removal code after the lightweight 'send in progress' > check and don't move btrfs_start_transaction as suggested by David > > v3: > * Rework the patch considering latest feedback from David Sterba i.e. > explicitly use btrfs_end_transaction > > fs/btrfs/ioctl.c | 20 > 1 file changed, 20 insertions(+) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index ee4ee7cbba72..9328c091854b 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1775,6 +1775,7 @@ static noinline int btrfs_ioctl_subvol_setflags(struct > file *file, > struct btrfs_trans_handle *trans; > u64 root_flags; > u64 flags; > + bool clear_received_uuid = false; > int ret = 0; > > if (!inode_owner_or_capable(inode)) > @@ -1824,6 +1825,7 @@ static noinline int btrfs_ioctl_subvol_setflags(struct > file *file, > btrfs_set_root_flags(>root_item, >root_flags & ~BTRFS_ROOT_SUBVOL_RDONLY); > spin_unlock(>root_item_lock); > + clear_received_uuid = true; > } else { > spin_unlock(>root_item_lock); > btrfs_warn(fs_info, > @@ -1840,6 +1842,24 @@ static noinline int btrfs_ioctl_subvol_setflags(struct > file *file, > goto out_reset; > } > > + if (clear_received_uuid) { > + if (!btrfs_is_empty_uuid(root->root_item.received_uuid)) { > + ret = btrfs_uuid_tree_rem(trans, fs_info, > + root->root_item.received_uuid, > + BTRFS_UUID_KEY_RECEIVED_SUBVOL, > + root->root_key.objectid); > + > + if (ret && ret != -ENOENT) { > + btrfs_abort_transaction(trans, ret); > + btrfs_end_transaction(trans); > + goto out_reset; > + } > + > + memset(root->root_item.received_uuid, 0, > + BTRFS_UUID_SIZE); Shouldn't we also wipe the other related fields here, like stime, rtime, stransid, rtransid? > + } > + } > + > ret = btrfs_update_root(trans, fs_info->tree_root, > >root_key, >root_item); > if (ret < 0) { > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html