On 2018/9/26 下午10:35, David Sterba wrote: > On Tue, Sep 11, 2018 at 01:38:15PM +0800, Qu Wenruo wrote: >> Before this patch, for quota enabled balance, btrfs needs to mark the >> whole subtree dirty for quota. >> >> E.g. >> OO = Old tree blocks (from file tree) >> NN = New tree blocks (from reloc tree) >> >> File tree (src) Reloc tree (dst) >> OO (a) NN (a) >> / \ / \ >> (b) OO OO (c) (b) NN NN (c) >> / \ / \ / \ / \ >> OO OO OO OO (d) OO OO OO NN (d) >> >> For old balance + quota case, quota will mark the whole src and dst tree >> dirty, including all the 3 old tree blocks in reloc tree. >> >> It's doable for small file tree or new tree blocks are all >> located at lower level. >> >> But for large file tree or new tree blocks are all located at higher >> level, this will lead to mark the whole tree dirty, and be unbelievably >> slow. >> >> This patch will change how we handle such balance with quota enabled >> case. >> >> Now we will search from (b) and (c) for any new tree blocks whose generation >> is equal to @last_snapshot, and only mark them dirty. >> >> In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d). >> (NN(a) will be traced when CoW happens for nodeptr modification). >> And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when >> CoW happens for nodeptr modification) >> >> For above case, we could skip 3 tree blocks, but for larger tree, we can >> skip tons of unmodified tree blocks, and hugely speed up balance. >> >> This patch will introduce a new function, >> btrfs_qgroup_trace_subtree_swap(), which will do the following main >> work: >> >> 1) Read out real root eb >> And setup basic dst_path for later calls >> 2) Call qgroup_trace_new_subtree_blocks() >> To trace all new tree blocks in reloc tree and their counter >> parts in file tree. >> >> Signed-off-by: Qu Wenruo <w...@suse.com> >> --- >> fs/btrfs/qgroup.c | 94 +++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/qgroup.h | 10 +++++ >> fs/btrfs/relocation.c | 11 ++--- >> 3 files changed, 107 insertions(+), 8 deletions(-) >> >> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c >> index 0702953d70a7..a94027b2620e 100644 >> --- a/fs/btrfs/qgroup.c >> +++ b/fs/btrfs/qgroup.c >> @@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct >> btrfs_trans_handle* trans, >> return ret; >> } >> >> +/* >> + * For balance subtree swap. >> + * >> + * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent >> and >> + * @dst_slot), and find any tree blocks whose generation is at >> @last_snapshot, >> + * and then go down @src_eb (pointed by @src_parent and @src_slot) to find >> + * the conterpart of the tree block, then mark both tree blocks as qgroup >> dirty, >> + * and skip all tree blocks whose generation is smaller than last_snapshot. >> + * >> + * This would skip tons of tree blocks of original >> btrfs_qgroup_trace_subtree(), >> + * which is the culprit of super slow balance if the file tree is large. >> + * >> + * @src_parent, @src_slot: pointer to src (file tree) eb. >> + * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb. >> + */ >> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans, >> + struct extent_buffer *src_parent, int src_slot, >> + struct extent_buffer *dst_parent, int dst_slot, >> + u64 last_snapshot) >> +{ >> + struct btrfs_fs_info *fs_info = trans->fs_info; >> + struct btrfs_path *dst_path = NULL; >> + struct btrfs_key first_key; >> + struct extent_buffer *src_eb = NULL; >> + struct extent_buffer *dst_eb = NULL; >> + u64 child_gen; >> + u64 child_bytenr; >> + int level; >> + int ret; >> + >> + if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) >> + return 0; >> + >> + /* Wrong parameter order */ >> + BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) > >> + btrfs_node_ptr_generation(dst_parent, dst_slot)); > > If this is to catch errors development-time errors, then an ASSERT is > probably better, but it looks like this needs to be an ordinary runtime > sanity check too.
For runtime check, it's already done before entering real relocation code. Relocation will check tree generation before really doing the swap. So it is a development time error. > >> + >> + /* Read out real @src_eb, pointed by @src_parent and @src_slot */ >> + child_bytenr = btrfs_node_blockptr(src_parent, src_slot); >> + child_gen = btrfs_node_ptr_generation(src_parent, src_slot); >> + btrfs_node_key_to_cpu(src_parent, &first_key, src_slot); >> + >> + src_eb = read_tree_block(fs_info, child_bytenr, child_gen, >> + btrfs_header_level(src_parent) - 1, &first_key); >> + if (IS_ERR(src_eb)) { >> + ret = PTR_ERR(src_eb); >> + goto out; >> + } >> + >> + /* Read out real @dst_eb, pointed by @src_parent and @src_slot */ >> + child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot); >> + child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot); >> + btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot); >> + >> + dst_eb = read_tree_block(fs_info, child_bytenr, child_gen, >> + btrfs_header_level(dst_parent) - 1, &first_key); >> + if (IS_ERR(dst_eb)) { >> + ret = PTR_ERR(dst_eb); >> + goto out; >> + } >> + >> + if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) >> { >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + level = btrfs_header_level(dst_eb); >> + dst_path = btrfs_alloc_path(); >> + if (!dst_path) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + /* For dst_path */ >> + extent_buffer_get(dst_eb); >> + dst_path->nodes[level] = dst_eb; >> + dst_path->slots[level] = 0; >> + dst_path->locks[level] = 0; >> + >> + /* Do the generation aware breadth-first search */ >> + ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level, >> + level, last_snapshot); >> + if (ret < 0) >> + goto out; >> + ret = 0; >> + >> +out: >> + free_extent_buffer(src_eb); >> + free_extent_buffer(dst_eb); >> + btrfs_free_path(dst_path); >> + if (ret < 0) >> + fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT; >> + return ret; >> +} >> + >> int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans, >> struct extent_buffer *root_eb, >> u64 root_gen, int root_level) >> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h >> index 54b8bb282c0e..9f9943dfd493 100644 >> --- a/fs/btrfs/qgroup.h >> +++ b/fs/btrfs/qgroup.h >> @@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct >> btrfs_trans_handle *trans, >> int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans, >> struct extent_buffer *root_eb, >> u64 root_gen, int root_level); >> + >> +/* >> + * Inform qgroup to trace subtree swap used in balance. >> + * Unlike btrfs_qgroup_trace_subtree(), this function will only trace >> + * new tree blocks whose generation is equal to (or larger than) >> @last_snapshot. >> + */ >> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans, >> + struct extent_buffer *src_parent, int src_slot, >> + struct extent_buffer *dst_parent, int dst_slot, >> + u64 last_snapshot); > > I've noticed that qgroup.h contains the function comments, but it's > preferred to have them in the .c file, next to the code. Please move it > and feel free to send patch moving the rest. A new code style learned. Will update the patchset to reflect this comment. Thanks, Qu > >> int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 >> bytenr, >> u64 num_bytes, struct ulist *old_roots, >> struct ulist *new_roots); >> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c >> index 8783a1776540..07ab61a740ae 100644 >> --- a/fs/btrfs/relocation.c >> +++ b/fs/btrfs/relocation.c >> @@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans, >> * and tree block numbers, if current trans doesn't free >> * data reloc tree inode. >> */ >> - ret = btrfs_qgroup_trace_subtree(trans, parent, >> - btrfs_header_generation(parent), >> - btrfs_header_level(parent)); >> - if (ret < 0) >> - break; >> - ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level], >> - btrfs_header_generation(path->nodes[level]), >> - btrfs_header_level(path->nodes[level])); >> + ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot, >> + path->nodes[level], path->slots[level], >> + last_snapshot); >> if (ret < 0) >> break; >> >> -- >> 2.18.0
signature.asc
Description: OpenPGP digital signature