On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote: > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote: > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote: > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote: > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be > > > > more efficient to do this with rb_first_cached() which has the O(1) > > > > complexity. > > > > > > We'd like to see some numbers, though just by reading the code I think > > > there's going to be a noticeable improvement for some workloads. > > > > > > There's a common pattern: > > > > > > while(node = rb_first) { > > > entry = rb_entry(node) > > > next = rb_next(node) > > > rb_erase(node) > > > cleanup(entry) > > > } > > > > > > rb_first needs to traverse the tree up to logN depth, rb_erase can > > > completely reshuffle the tree. With the caching we'll skip the traversal > > > in rb_first. That's a cached memory access vs looped pointer > > > dereference trade-off that IMHO has a clear winner. > > > > > > As the pattern in many cases traverses the whole tree and removes all > > > entries, ideally we could get rid of the rebalancing too. The entries > > > would be cleaned up in postorder way, ie. reset the data and kfree in > > > the end. This requires that order of the freeing does not matter, which > > > might no apply to some trees. > > > > The order of freeing does not matter, but we need the tree to return > > the correct next node to free, IOW, we have to maintain the rbtree > > until the last two nodes. > > > > > > > > I did not find suitable rbtree functions or helpers for that, > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and > > > rb_erase itself forces the rebalancing internally. But I think we can > > > implement that. > > > > > > We could possibly use rb_next_postorder that makes sure that all > > > children are traversed before the parent so this is at least something > > > that could considerd. > > > > > > > Qu was inquiring about the same thing, I haven't got time to dig it > > further. > > > > The question we need to answer is that whether we care about how fast > > destruction can make, as some are in critical paths while some are > > not. > > Yeah, I take __btrfs_return_cluster_to_free_space as an example where > the more efficient tree destruction would shorten the time where a lock > is held. > > In other cases it might complicate the code too much, though the > performance is not critical, eg. at umount time. Although, it'd be good > to optimize that too now that we know how. > > And in the remaining cases it's not possible to avoid rb_erase. I had a > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq. > Based on that I'd like to add this series to for-next, the first node > caching is clear enough and safe. We can do the other optimizations > later.
This needs more fine-grained debugging to see what's the cost lies on. About the perf. number of rb_cached_node, I measured the time spent in extent map loop in btrfs_evict_inode(), evict_inode_truncate_pages() while (!RB_EMPTY_ROOT(&map_tree->map)) { rb_first(); rb_erase(); } for a em tree containing 10,000 em, - with rb_first_cached, the loop takes 4880454 ns, - with rb_first, the loop takes 4584148 ns, looks like the difference is not that much, ~6%. After all it's about scalability, the more rb tree gets, the more rb_first_cached() wins. thanks, -liubo