On Thu, Sep 27, 2018 at 02:42:28PM +0800, Qu Wenruo wrote: > This patchset can be fetched from github: > https://github.com/adam900710/linux/tree/qgroup_balance_skip_trees > The base commit is v4.19-rc1 tag. > > There are a lot of reports of system hang for balance on quota enabled > fs. > It's most obvious for large fs. > > The hang is caused by tons of unmodified extents marked as qgroup dirty. > Such unmodified/unrelated sources include: > 1) Unmodified subtree > 2) Subtree drop for reloc tree > (BTW, other sources includes unmodified file extent items) > > E.g. > OO = Old tree blocks from file tree > NN = New tree blocks from reloc tree > > file tree reloc tree > OO (a) NN (a) > / \ / \ > (b) OO OO (c) (b) NN NN (c) > / \ / \ / \ / \ > OO OO OO OO OO OO OO NN > (d) (e) (f) (g) (d) (e) (f) (g) > > In above case, balance will modify nodeptr in OO(a) to point NN(b) and > NN(c), and modify NN(a) to point to OO(B) and OO(c). > > Before this patch, quota will mark the whole subtree from its parent > down to the leaves as dirty. > So btrfs quota need to trace all tree block from (a) to (g). > > However tree blocks (d) (e) (f) are shared between both trees, thus > there is no need to trace those 3 tree blocks. > > This patchset will change how this work by only tracing modified tree > blocks in reloc tree, and their counter parts in file tree. > > With this patch, we could skip tree blocks OO(d)~OO(f) in above example, > thus reduce some some overhead caused by qgroup. > > The improvement is mostly related to metadata relocation. > > Also, for metadata relocation, we don't really need to trace data > extents as they're not modified at all. > > > [[Benchmark]] > Hardware: > VM 4G vRAM, 8 vCPUs, > disk is using 'unsafe' cache mode, > backing device is SAMSUNG 850 evo SSD. > Host has 16G ram. > > Mkfs parameter: > --nodesize 4K (To bump up tree size) > > Initial subvolume contents: > 4G data copied from /usr and /lib. > (With enough regular small files) > > Snapshots: > 16 snapshots of the original subvolume. > each snapshot has 3 random files modified. > > balance parameter: > -m > > So the content should be pretty similar to a real world root fs layout. > > | v4.19-rc1 | w/ patchset | diff (*) > --------------------------------------------------------------- > relocated extents | 22874 | 22856 | +0.1% > qgroup dirty extents | 225251 | 140431 | -37.7% > time (sys) | 40.161s | 20.574s | -48.7% > time (real) | 42.163s | 25.173s | -40.3% > > *: (val_new - val_old) / val_old * 100% > > And the difference is becoming more and more obvious if more snapshots > are created. > > If we increase the number of snapshots to 64 (4 times the number of > references, and 64 snapshots is already not recommended to use with > quota) > > | v4.19-rc1 | w/ patchset | diff (*) > --------------------------------------------------------------- > relocated extents | 22462 | 22467 | +0.0% > qgroup dirty extents | 314380 | 140826 | -55.2% > time (sys) | 158.033s | 74.292s | -53.0% > time (real) | 197.395s | 90.529s | -67.6% > > For larger fs the saving should be even more obvious. > > Changelog: > v2: > Rename "tree reloc tree" to "reloc tree". > Add patch "Don't trace subtree if we're dropping reloc tree" into the > patchset. > Fix wrong btrfs_bin_search() call, which leads to unexpected ENOENT > error for btrfs_qgroup_trace_extent_swap(). Now use dst_path->slots[] > directly. > > v3: > Add new patch to avoid unnecessary data extents trace for metadata > relocation. > Better delayed ref time root owner detection to avoid unnecessary tree > block tracing. > Add benchmark result for the patchset. > > v4: > Move part of the benchmark result from cover letter to real patches. > - The result is *NEW* result, since host kernel get several updates. > Spectre fixes seem to degrade a little memory performance. > - Per-patch performance change on total balance time, compared to > *previous* patch: > 4th -5% > 5th -25% > 6th -15% > 7th 0% > - Cover letter still uses old test result. > - In patch commit message, the result is still compared to v4.19-rc1 > > Add more robust level check for qgroup_trace_new_subtree_blocks(). > > Make btrfs_qgroup_trace_subtree_swap() to report wrong parameter > order at runtime. (Since there is only one caller, report at runtime > should be enough to cover development time error). > > Move the comment for btrfs_qgroup_trace_subtree_swap() to qgroup.c. > (Other comment cleanup will be sent as a separate patch) > > Fix one uncaught "tree reloc tree" naming. > > Remove unrelated changes in patch 6. > > > Qu Wenruo (7): > btrfs: qgroup: Introduce trace event to analyse the number of dirty > extents accounted > btrfs: qgroup: Introduce function to trace two swaped extents > btrfs: qgroup: Introduce function to find all new tree blocks of reloc > tree > btrfs: qgroup: Use generation aware subtree swap to mark dirty extents > btrfs: qgroup: Don't trace subtree if we're dropping reloc tree > btrfs: delayed-ref: Introduce new parameter for > btrfs_add_delayed_tree_ref() to reduce unnecessary qgroup tracing > btrfs: qgroup: Only trace data extents in leaves if we're relocating > data block group
Now added to misc-next, with a few minor adjustments.