On 2018/8/12 上午8:59, Dan Merillat wrote:
> On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>>
>> It looks pretty like qgroup, but too many noise.
>> The pin point trace event would btrfs_find_all_roots().
> 
> I had this half-written when you replied.
> 
> Agreed: looks like bulk of time spent resides in qgroups.  Spent some
> time with sysrq-l and ftrace:
> 
> ? __rcu_read_unlock+0x5/0x50
> ? return_to_handler+0x15/0x36
> __rcu_read_unlock+0x5/0x50
> find_extent_buffer+0x47/0x90                    extent_io.c:4888
> read_block_for_search.isra.12+0xc8/0x350        ctree.c:2399
> btrfs_search_slot+0x3e7/0x9c0                   ctree.c:2837
> btrfs_next_old_leaf+0x1dc/0x410                 ctree.c:5702
> btrfs_next_old_item                             ctree.h:2952
> add_all_parents                                 backref.c:487
> resolve_indirect_refs+0x3f7/0x7e0               backref.c:575
> find_parent_nodes+0x42d/0x1290                  backref.c:1236
> ? find_parent_nodes+0x5/0x1290                  backref.c:1114
> btrfs_find_all_roots_safe+0x98/0x100            backref.c:1414
> btrfs_find_all_roots+0x52/0x70                  backref.c:1442
> btrfs_qgroup_trace_extent_post+0x27/0x60        qgroup.c:1503
> btrfs_qgroup_trace_leaf_items+0x104/0x130       qgroup.c:1589
> btrfs_qgroup_trace_subtree+0x26a/0x3a0          qgroup.c:1750
> do_walk_down+0x33c/0x5a0                        extent-tree.c:8883
> walk_down_tree+0xa8/0xd0                        extent-tree.c:9041
> btrfs_drop_snapshot+0x370/0x8b0                 extent-tree.c:9203
> merge_reloc_roots+0xcf/0x220
> btrfs_recover_relocation+0x26d/0x400
> ? btrfs_cleanup_fs_roots+0x16a/0x180
> btrfs_remount+0x32e/0x510
> do_remount_sb+0x67/0x1e0
> do_mount+0x712/0xc90
> 
> The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by
> the following ftrace filter:
> fileserver:/sys/kernel/tracing# cat set_ftrace_filter
> btrfs_qgroup_trace_extent
> btrfs_qgroup_trace_subtree

Yep, it's quota causing the hang.

> 
[snip]
> 
> So 10-13 minutes per cycle.
> 
>> 11T, with highly deduped usage is really the worst scenario case for qgroup.
>> Qgroup is not really good at handle hight reflinked files, nor balance.
>> When they combines, it goes worse.
> 
> I'm not really understanding the use-case of qgroup if it melts down
> on large systems with a shared base + individual changes.

The problem is, for balance btrfs is doing a trick by switch tree reloc
tree with real fs tree.
However, tree reloc tree doesn't account to quota, but for real fs tree
it contributes to quota.

And since above owner changes, btrfs needs to do a full subtree rescan.
For small subvolume it's not a problem, but for large subvolume, quota
needs to rescan thousands tree blocks, and due to highly deduped files,
each tree blocks needs extra iterations for each deduped files.

Both factors contribute to the slow mount.

There are several workaround patches in the mail list, one is to make
the balance background for mount, so it won't hang mount.
But it still makes transaction pretty slow (write will still be blocked
for a long time)

There is also plan to skip subtree rescan completely, but it needs extra
review to ensure such tree block switch won't change quota number.

Thanks,
Qu

> 
>> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
>> to disable quota offline.
> 
> Ok.  I was looking at just doing this to speed things up:
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 51b5e2da708c..c5bf937b79f0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct
> btrfs_trans_handle *trans,
>                         parent = 0;
>                 }
> 
> -               if (need_account) {
> +               if (0) {
>                         ret = btrfs_qgroup_trace_subtree(trans, root, next,
>                                                          generation, level - 
> 1);
>                         if (ret) {
> 
> 
>                         btrfs_err_rl(fs_info,
>                           "Error %d accounting shared subtree. Quota
> is out of sync, rescan required.",
>                           ret);
>              }
> 
> 
> If I follow, this will leave me with inconsistent qgroups and a full
> rescan is required.  That seems an acceptable tradeoff, since it seems
> like the best plan going forward is to nuke the qgroups anyway.
> 
> There's still the btrfs-transaction spin, but I'm hoping that's
> related to qgroups as well.
> 
>>
>> Thanks,
>> Qu
> 
> Appreciate it.  I was going to go with my hackjob patch to avoid any
> untested rewriting - there's already an error path for "something went
> wrong updating qgroups during walk_tree" so it seemed safest to take
> advantage of it.  I'll patch either the kernel or the btrfs programs,
> whichever you think is best.
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to