On 2018/9/26 下午10:35, David Sterba wrote:
> On Tue, Sep 11, 2018 at 01:38:15PM +0800, Qu Wenruo wrote:
>> Before this patch, for quota enabled balance, btrfs needs to mark the
>> whole subtree dirty for quota.
>>
>> E.g.
>> OO = Old tree blocks (from file tree)
>> NN = New tree blocks (from reloc tree)
>>
>>         File tree (src)                        Reloc tree (dst)
>>             OO (a)                              NN (a)
>>            /  \                                /  \
>>      (b) OO    OO (c)                    (b) NN    NN (c)
>>         /  \  /  \                          /  \  /  \
>>        OO  OO OO OO (d)                    OO  OO OO NN (d)
>>
>> For old balance + quota case, quota will mark the whole src and dst tree
>> dirty, including all the 3 old tree blocks in reloc tree.
>>
>> It's doable for small file tree or new tree blocks are all
>> located at lower level.
>>
>> But for large file tree or new tree blocks are all located at higher
>> level, this will lead to mark the whole tree dirty, and be unbelievably
>> slow.
>>
>> This patch will change how we handle such balance with quota enabled
>> case.
>>
>> Now we will search from (b) and (c) for any new tree blocks whose generation
>> is equal to @last_snapshot, and only mark them dirty.
>>
>> In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
>> (NN(a) will be traced when CoW happens for nodeptr modification).
>> And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when
>> CoW happens for nodeptr modification)
>>
>> For above case, we could skip 3 tree blocks, but for larger tree, we can
>> skip tons of unmodified tree blocks, and hugely speed up balance.
>>
>> This patch will introduce a new function,
>> btrfs_qgroup_trace_subtree_swap(), which will do the following main
>> work:
>>
>> 1) Read out real root eb
>>    And setup basic dst_path for later calls
>> 2) Call qgroup_trace_new_subtree_blocks()
>>    To trace all new tree blocks in reloc tree and their counter
>>    parts in file tree.
>>
>> Signed-off-by: Qu Wenruo <w...@suse.com>
>> ---
>>  fs/btrfs/qgroup.c     | 94 +++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/qgroup.h     | 10 +++++
>>  fs/btrfs/relocation.c | 11 ++---
>>  3 files changed, 107 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 0702953d70a7..a94027b2620e 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1953,6 +1953,100 @@ static int qgroup_trace_new_subtree_blocks(struct 
>> btrfs_trans_handle* trans,
>>      return ret;
>>  }
>>  
>> +/*
>> + * For balance subtree swap.
>> + *
>> + * Will go down the tree block pointed by @dst_eb (pointed by @dst_parent 
>> and
>> + * @dst_slot), and find any tree blocks whose generation is at 
>> @last_snapshot,
>> + * and then go down @src_eb (pointed by @src_parent and @src_slot) to find
>> + * the conterpart of the tree block, then mark both tree blocks as qgroup 
>> dirty,
>> + * and skip all tree blocks whose generation is smaller than last_snapshot.
>> + *
>> + * This would skip tons of tree blocks of original 
>> btrfs_qgroup_trace_subtree(),
>> + * which is the culprit of super slow balance if the file tree is large.
>> + *
>> + * @src_parent, @src_slot: pointer to src (file tree) eb.
>> + * @dst_parent, @dst_slot: pointer to dst (tree reloc tree) eb.
>> + */
>> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>> +                            struct extent_buffer *src_parent, int src_slot,
>> +                            struct extent_buffer *dst_parent, int dst_slot,
>> +                            u64 last_snapshot)
>> +{
>> +    struct btrfs_fs_info *fs_info = trans->fs_info;
>> +    struct btrfs_path *dst_path = NULL;
>> +    struct btrfs_key first_key;
>> +    struct extent_buffer *src_eb = NULL;
>> +    struct extent_buffer *dst_eb = NULL;
>> +    u64 child_gen;
>> +    u64 child_bytenr;
>> +    int level;
>> +    int ret;
>> +
>> +    if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
>> +            return 0;
>> +
>> +    /* Wrong parameter order */
>> +    BUG_ON(btrfs_node_ptr_generation(src_parent, src_slot) >
>> +            btrfs_node_ptr_generation(dst_parent, dst_slot));
> 
> If this is to catch errors development-time errors, then an ASSERT is
> probably better, but it looks like this needs to be an ordinary runtime
> sanity check too.

For runtime check, it's already done before entering real relocation code.

Relocation will check tree generation before really doing the swap.

So it is a development time error.

> 
>> +
>> +    /* Read out real @src_eb, pointed by @src_parent and @src_slot */
>> +    child_bytenr = btrfs_node_blockptr(src_parent, src_slot);
>> +    child_gen = btrfs_node_ptr_generation(src_parent, src_slot);
>> +    btrfs_node_key_to_cpu(src_parent, &first_key, src_slot);
>> +
>> +    src_eb = read_tree_block(fs_info, child_bytenr, child_gen,
>> +                    btrfs_header_level(src_parent) - 1, &first_key);
>> +    if (IS_ERR(src_eb)) {
>> +            ret = PTR_ERR(src_eb);
>> +            goto out;
>> +    }
>> +
>> +    /* Read out real @dst_eb, pointed by @src_parent and @src_slot */
>> +    child_bytenr = btrfs_node_blockptr(dst_parent, dst_slot);
>> +    child_gen = btrfs_node_ptr_generation(dst_parent, dst_slot);
>> +    btrfs_node_key_to_cpu(dst_parent, &first_key, dst_slot);
>> +
>> +    dst_eb = read_tree_block(fs_info, child_bytenr, child_gen,
>> +                    btrfs_header_level(dst_parent) - 1, &first_key);
>> +    if (IS_ERR(dst_eb)) {
>> +            ret = PTR_ERR(dst_eb);
>> +            goto out;
>> +    }
>> +
>> +    if (!extent_buffer_uptodate(src_eb) || !extent_buffer_uptodate(dst_eb)) 
>> {
>> +            ret = -EINVAL;
>> +            goto out;
>> +    }
>> +
>> +    level = btrfs_header_level(dst_eb);
>> +    dst_path = btrfs_alloc_path();
>> +    if (!dst_path) {
>> +            ret = -ENOMEM;
>> +            goto out;
>> +    }
>> +    /* For dst_path */
>> +    extent_buffer_get(dst_eb);
>> +    dst_path->nodes[level] = dst_eb;
>> +    dst_path->slots[level] = 0;
>> +    dst_path->locks[level] = 0;
>> +
>> +    /* Do the generation aware breadth-first search */
>> +    ret = qgroup_trace_new_subtree_blocks(trans, src_eb, dst_path, level,
>> +                                          level, last_snapshot);
>> +    if (ret < 0)
>> +            goto out;
>> +    ret = 0;
>> +
>> +out:
>> +    free_extent_buffer(src_eb);
>> +    free_extent_buffer(dst_eb);
>> +    btrfs_free_path(dst_path);
>> +    if (ret < 0)
>> +            fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>> +    return ret;
>> +}
>> +
>>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>>                             struct extent_buffer *root_eb,
>>                             u64 root_gen, int root_level)
>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>> index 54b8bb282c0e..9f9943dfd493 100644
>> --- a/fs/btrfs/qgroup.h
>> +++ b/fs/btrfs/qgroup.h
>> @@ -236,6 +236,16 @@ int btrfs_qgroup_trace_leaf_items(struct 
>> btrfs_trans_handle *trans,
>>  int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>>                             struct extent_buffer *root_eb,
>>                             u64 root_gen, int root_level);
>> +
>> +/*
>> + * Inform qgroup to trace subtree swap used in balance.
>> + * Unlike btrfs_qgroup_trace_subtree(), this function will only trace
>> + * new tree blocks whose generation is equal to (or larger than) 
>> @last_snapshot.
>> + */
>> +int btrfs_qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>> +                            struct extent_buffer *src_parent, int src_slot,
>> +                            struct extent_buffer *dst_parent, int dst_slot,
>> +                            u64 last_snapshot);
> 
> I've noticed that qgroup.h contains the function comments, but it's
> preferred to have them in the .c file, next to the code. Please move it
> and feel free to send patch moving the rest.

A new code style learned.

Will update the patchset to reflect this comment.

Thanks,
Qu

> 
>>  int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 
>> bytenr,
>>                              u64 num_bytes, struct ulist *old_roots,
>>                              struct ulist *new_roots);
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 8783a1776540..07ab61a740ae 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -1879,14 +1879,9 @@ int replace_path(struct btrfs_trans_handle *trans,
>>               *    and tree block numbers, if current trans doesn't free
>>               *    data reloc tree inode.
>>               */
>> -            ret = btrfs_qgroup_trace_subtree(trans, parent,
>> -                            btrfs_header_generation(parent),
>> -                            btrfs_header_level(parent));
>> -            if (ret < 0)
>> -                    break;
>> -            ret = btrfs_qgroup_trace_subtree(trans, path->nodes[level],
>> -                            btrfs_header_generation(path->nodes[level]),
>> -                            btrfs_header_level(path->nodes[level]));
>> +            ret = btrfs_qgroup_trace_subtree_swap(trans, parent, slot,
>> +                            path->nodes[level], path->slots[level],
>> +                            last_snapshot);
>>              if (ret < 0)
>>                      break;
>>  
>> -- 
>> 2.18.0

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to