On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > It looks pretty like qgroup, but too many noise. > The pin point trace event would btrfs_find_all_roots().
I had this half-written when you replied. Agreed: looks like bulk of time spent resides in qgroups. Spent some time with sysrq-l and ftrace: ? __rcu_read_unlock+0x5/0x50 ? return_to_handler+0x15/0x36 __rcu_read_unlock+0x5/0x50 find_extent_buffer+0x47/0x90 extent_io.c:4888 read_block_for_search.isra.12+0xc8/0x350 ctree.c:2399 btrfs_search_slot+0x3e7/0x9c0 ctree.c:2837 btrfs_next_old_leaf+0x1dc/0x410 ctree.c:5702 btrfs_next_old_item ctree.h:2952 add_all_parents backref.c:487 resolve_indirect_refs+0x3f7/0x7e0 backref.c:575 find_parent_nodes+0x42d/0x1290 backref.c:1236 ? find_parent_nodes+0x5/0x1290 backref.c:1114 btrfs_find_all_roots_safe+0x98/0x100 backref.c:1414 btrfs_find_all_roots+0x52/0x70 backref.c:1442 btrfs_qgroup_trace_extent_post+0x27/0x60 qgroup.c:1503 btrfs_qgroup_trace_leaf_items+0x104/0x130 qgroup.c:1589 btrfs_qgroup_trace_subtree+0x26a/0x3a0 qgroup.c:1750 do_walk_down+0x33c/0x5a0 extent-tree.c:8883 walk_down_tree+0xa8/0xd0 extent-tree.c:9041 btrfs_drop_snapshot+0x370/0x8b0 extent-tree.c:9203 merge_reloc_roots+0xcf/0x220 btrfs_recover_relocation+0x26d/0x400 ? btrfs_cleanup_fs_roots+0x16a/0x180 btrfs_remount+0x32e/0x510 do_remount_sb+0x67/0x1e0 do_mount+0x712/0xc90 The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by the following ftrace filter: fileserver:/sys/kernel/tracing# cat set_ftrace_filter btrfs_qgroup_trace_extent btrfs_qgroup_trace_subtree # cat trace ... mount-6803 [003] .... 80407.649752: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree mount-6803 [003] .... 80407.649772: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.649797: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.649821: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.649846: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.701652: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.754547: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.754574: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.754598: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.754622: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [003] .... 80407.754646: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items ... repeats 240 times mount-6803 [002] .... 80412.568804: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [002] .... 80412.568825: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items mount-6803 [002] .... 80412.568850: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree mount-6803 [002] .... 80412.568872: btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items Looks like invocations of btrfs_qgroup_trace_subtree are taking forever: mount-6803 [006] .... 80641.627709: btrfs_qgroup_trace_subtree <-do_walk_down mount-6803 [003] .... 81433.760945: btrfs_qgroup_trace_subtree <-do_walk_down (add do_walk_down to the trace here) mount-6803 [001] .... 82124.623557: do_walk_down <-walk_down_tree mount-6803 [001] .... 82124.623567: btrfs_qgroup_trace_subtree <-do_walk_down mount-6803 [006] .... 82695.241306: do_walk_down <-walk_down_tree mount-6803 [006] .... 82695.241316: btrfs_qgroup_trace_subtree <-do_walk_down So 10-13 minutes per cycle. > 11T, with highly deduped usage is really the worst scenario case for qgroup. > Qgroup is not really good at handle hight reflinked files, nor balance. > When they combines, it goes worse. I'm not really understanding the use-case of qgroup if it melts down on large systems with a shared base + individual changes. > I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you > to disable quota offline. Ok. I was looking at just doing this to speed things up: diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 51b5e2da708c..c5bf937b79f0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans, parent = 0; } - if (need_account) { + if (0) { ret = btrfs_qgroup_trace_subtree(trans, root, next, generation, level - 1); if (ret) { btrfs_err_rl(fs_info, "Error %d accounting shared subtree. Quota is out of sync, rescan required.", ret); } If I follow, this will leave me with inconsistent qgroups and a full rescan is required. That seems an acceptable tradeoff, since it seems like the best plan going forward is to nuke the qgroups anyway. There's still the btrfs-transaction spin, but I'm hoping that's related to qgroups as well. > > Thanks, > Qu Appreciate it. I was going to go with my hackjob patch to avoid any untested rewriting - there's already an error path for "something went wrong updating qgroups during walk_tree" so it seemed safest to take advantage of it. I'll patch either the kernel or the btrfs programs, whichever you think is best.