On 2018/8/12 上午8:59, Dan Merillat wrote: > On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: >> >> It looks pretty like qgroup, but too many noise. >> The pin point trace event would btrfs_find_all_roots(). > > I had this half-written when you replied. > > Agreed: looks like bulk of time spent resides in qgroups. Spent some > time with sysrq-l and ftrace: > > ? __rcu_read_unlock+0x5/0x50 > ? return_to_handler+0x15/0x36 > __rcu_read_unlock+0x5/0x50 > find_extent_buffer+0x47/0x90 extent_io.c:4888 > read_block_for_search.isra.12+0xc8/0x350 ctree.c:2399 > btrfs_search_slot+0x3e7/0x9c0 ctree.c:2837 > btrfs_next_old_leaf+0x1dc/0x410 ctree.c:5702 > btrfs_next_old_item ctree.h:2952 > add_all_parents backref.c:487 > resolve_indirect_refs+0x3f7/0x7e0 backref.c:575 > find_parent_nodes+0x42d/0x1290 backref.c:1236 > ? find_parent_nodes+0x5/0x1290 backref.c:1114 > btrfs_find_all_roots_safe+0x98/0x100 backref.c:1414 > btrfs_find_all_roots+0x52/0x70 backref.c:1442 > btrfs_qgroup_trace_extent_post+0x27/0x60 qgroup.c:1503 > btrfs_qgroup_trace_leaf_items+0x104/0x130 qgroup.c:1589 > btrfs_qgroup_trace_subtree+0x26a/0x3a0 qgroup.c:1750 > do_walk_down+0x33c/0x5a0 extent-tree.c:8883 > walk_down_tree+0xa8/0xd0 extent-tree.c:9041 > btrfs_drop_snapshot+0x370/0x8b0 extent-tree.c:9203 > merge_reloc_roots+0xcf/0x220 > btrfs_recover_relocation+0x26d/0x400 > ? btrfs_cleanup_fs_roots+0x16a/0x180 > btrfs_remount+0x32e/0x510 > do_remount_sb+0x67/0x1e0 > do_mount+0x712/0xc90 > > The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by > the following ftrace filter: > fileserver:/sys/kernel/tracing# cat set_ftrace_filter > btrfs_qgroup_trace_extent > btrfs_qgroup_trace_subtree
Yep, it's quota causing the hang. > [snip] > > So 10-13 minutes per cycle. > >> 11T, with highly deduped usage is really the worst scenario case for qgroup. >> Qgroup is not really good at handle hight reflinked files, nor balance. >> When they combines, it goes worse. > > I'm not really understanding the use-case of qgroup if it melts down > on large systems with a shared base + individual changes. The problem is, for balance btrfs is doing a trick by switch tree reloc tree with real fs tree. However, tree reloc tree doesn't account to quota, but for real fs tree it contributes to quota. And since above owner changes, btrfs needs to do a full subtree rescan. For small subvolume it's not a problem, but for large subvolume, quota needs to rescan thousands tree blocks, and due to highly deduped files, each tree blocks needs extra iterations for each deduped files. Both factors contribute to the slow mount. There are several workaround patches in the mail list, one is to make the balance background for mount, so it won't hang mount. But it still makes transaction pretty slow (write will still be blocked for a long time) There is also plan to skip subtree rescan completely, but it needs extra review to ensure such tree block switch won't change quota number. Thanks, Qu > >> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you >> to disable quota offline. > > Ok. I was looking at just doing this to speed things up: > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 51b5e2da708c..c5bf937b79f0 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct > btrfs_trans_handle *trans, > parent = 0; > } > > - if (need_account) { > + if (0) { > ret = btrfs_qgroup_trace_subtree(trans, root, next, > generation, level - > 1); > if (ret) { > > > btrfs_err_rl(fs_info, > "Error %d accounting shared subtree. Quota > is out of sync, rescan required.", > ret); > } > > > If I follow, this will leave me with inconsistent qgroups and a full > rescan is required. That seems an acceptable tradeoff, since it seems > like the best plan going forward is to nuke the qgroups anyway. > > There's still the btrfs-transaction spin, but I'm hoping that's > related to qgroups as well. > >> >> Thanks, >> Qu > > Appreciate it. I was going to go with my hackjob patch to avoid any > untested rewriting - there's already an error path for "something went > wrong updating qgroups during walk_tree" so it seemed safest to take > advantage of it. I'll patch either the kernel or the btrfs programs, > whichever you think is best. >
signature.asc
Description: OpenPGP digital signature