On 2018/9/26 下午10:17, Qu Wenruo wrote:
> 
> 
> On 2018/9/26 下午10:06, David Sterba wrote:
>> On Tue, Sep 11, 2018 at 01:38:11PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees
>>> The base commit is v4.19-rc1 tag.
>>
>> I want to merge this patchset to 4.20, it's been in for-next for some
>> time and it addresses a problem that gets reported very often.
>>
>> I'm still doing some tests, there are minor issues that are not
>> blocking, more code review is always welcome.
>>
>>> There are a lot of reports of system hang for balance on quota enabled
>>> fs.
>>> It's most obvious for large fs.
>>>
>>> The hang is caused by tons of unmodified extents marked as qgroup dirty.
>>> Such unmodified/unrelated sources include:
>>> 1) Unmodified subtree
>>> 2) Subtree drop for reloc tree
>>> (BTW, other sources includes unmodified file extent items)
>>>
>>> E.g.
>>> OO = Old tree blocks from file tree
>>> NN = New tree blocks from reloc tree
>>>
>>>         file tree                              reloc tree
>>>            OO (a)                                  NN (a)
>>>           /  \                                    /  \
>>>     (b) OO    OO (c)                        (b) NN    NN (c)
>>>        / \   / \                               / \   / \
>>>      OO  OO OO  OO                           OO  OO OO  NN
>>>     (d) (e) (f) (g)                         (d) (e) (f) (g)
>>>
>>> In above case, balance will modify nodeptr in OO(a) to point NN(b) and
>>> NN(c), and modify NN(a) to point to OO(B) and OO(c).
>>>
>>> Before this patch, quota will mark the whole subtree from its parent
>>> down to the leaves as dirty.
>>> So btrfs quota need to trace all tree block from (a) to (g).
>>
>> I find the use of 'trace' a bit confusing as there are the trace events,
>> but as it's already used in the qgroup code I think it can stay as is.
> 
> I'm pretty open to rename the word "trace".
> 
> Any recommendation for a newer name?
> 
>>
>>> However tree blocks (d) (e) (f) are shared between both trees, thus
>>> there is no need to trace those 3 tree blocks.
>>>
>>> This patchset will change how this work by only tracing modified tree
>>> blocks in reloc tree, and their counter parts in file tree.
>>>
>>> With this patch, we could skip tree blocks OO(d)~OO(f) in above example,
>>> thus reduce some some overhead caused by qgroup.
>>>
>>> The improvement is mostly related to metadata relocation.
>>>
>>> Also, for metadata relocation, we don't really need to trace data
>>> extents as they're not modified at all.
>>>
>>>
>>> [[Benchmark]]
>> [...]
>>
>> Thanks for the benchmarks, it would be good to add the information to
>> the patch that's bringing the performance improvement.
> 
> Could do that.
> 
> However the current improved (from around 10 * nr_relocated to 6~7 *
> nr_relocated) is the final result of the last 4 patches.
> 
> I'm not pretty sure if any single patch could bring obvious improvement.
> 
>>
>>> Hardware:
>>>     VM 4G vRAM, 8 vCPUs,
>>>     disk is using 'unsafe' cache mode,
>>>     backing device is SAMSUNG 850 evo SSD.
>>>     Host has 16G ram.
>>>
>>> Mkfs parameter:
>>>     --nodesize 4K (To bump up tree size)
>>>
>>> Initial subvolume contents:
>>>     4G data copied from /usr and /lib.
>>>     (With enough regular small files)
>>>
>>> Snapshots:
>>>     16 snapshots of the original subvolume.
>>>     each snapshot has 3 random files modified.
>>>
>>> balance parameter:
>>>     -m
>>>
>>> So the content should be pretty similar to a real world root fs layout.
>>>
>>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>>> ---------------------------------------------------------------
>>> relocated extents    | 22874        | 22856          | +0.1%
>>> qgroup dirty extents | 225251       | 140431         | -37.7%
>>> time (sys)           | 40.161s      | 20.574s        | -48.7%
>>> time (real)          | 42.163s      | 25.173s        | -40.3%
>>>
>>> *: (val_new - val_old) / val_old * 100%
>>>
>>> And the difference is becoming more and more obvious if more snapshots
>>> are created.
>>>
>>> If we increase the number of snapshots to 64 (4 times the number of
>>> references, and 64 snapshots is already not recommended to use with
>>> quota)
>>>
>>>                      | v4.19-rc1    | w/ patchset    | diff (*)
>>> ---------------------------------------------------------------
>>> relocated extents    | 22462        | 22467          | +0.0%
>>> qgroup dirty extents | 314380       | 140826         | -55.2%
>>> time (sys)           | 158.033s     | 74.292s        | -53.0%
>>> time (real)          | 197.395s     | 90.529s        | -67.6%

BTW, this result is still far from perfection.
(and in real world case, due to extra disk IO and tree read seek, the
improvement could be less obvious)

The theoretical minimal of qgroup dirty extents should be only 2 *
nr_relocated (maybe 3, at most 4 due to other overhead, but definitely
not current 6~7).

And further more, the real big part is the backref walk code.
Currently we're using btrfs_find_all_roots(), which doesn't do any cache
for backref.

We have a cached backref implementation in
relocation.c::build_backref_tree().
Jeff is working on a global version of it with even better roots result
cache.
Which should make total time linear to qgroup dirty extents.

But currently, I'm working to reduce nr_relocated to theoretical minimal
first.
(And enhance the comment for any potential reviewer)

Thanks,
Qu

>>>
>>> For larger fs the saving should be even more obvious.
>>>
>>> Changelog:
>>> v2:
>>>   Rename "tree reloc tree" to "reloc tree".
>>>   Add patch "Don't trace subtree if we're dropping reloc tree" into the
>>>   patchset.
>>>   Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT
>>>   error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[]
>>>   directly.
>>>   
>>> v3:
>>>   Add new patch to avoid unnecessary data extents trace for metadata
>>>   relocation.
>>>   Better delayed ref time root owner detection to avoid unnecessary tree
>>>   block tracing.
>>>   Add benchmark result for the patchset.
>>>
>>> Qu Wenruo (7):
>>>   btrfs: qgroup: Introduce trace event to analyse the number of dirty
>>>     extents accounted
>>>   btrfs: qgroup: Introduce function to trace two swaped extents
>>>   btrfs: qgroup: Introduce function to find all new tree blocks of reloc
>>>     tree
>>>   btrfs: qgroup: Use generation aware subtree swap to mark dirty extents
>>>   btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
>>>   btrfs: delayed-ref: Introduce new parameter for
>>>     btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing
>>>   btrfs: qgroup: Only trace data extents in leaves if we're relocating
>>>     data block group
>>>
>>>  fs/btrfs/delayed-ref.c       |   5 +-
>>>  fs/btrfs/delayed-ref.h       |   1 +
>>>  fs/btrfs/extent-tree.c       |  25 ++-
>>>  fs/btrfs/qgroup.c            | 347 +++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/qgroup.h            |  11 ++
>>>  fs/btrfs/relocation.c        |  15 +-
>>>  include/trace/events/btrfs.h |  21 +++
>>>  7 files changed, 405 insertions(+), 20 deletions(-)
>>>
>>> -- 
>>> 2.18.0
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to