Hi! Since upgrading from 2.6.35+bits to 2.6.38 and then more recently to 3.0, our "big btrfs backup box" with 20 * 3 TB AOE-attached btrfs volumes started showing more CPU usage and backups were no longer completing in a day. I tried Linus HEAD from yesterday merged with btrfs for-linus (same as Linus HEAD as of today), and things are better again, but "perf top" output still looks pretty interesting after a night of rsync running:
samples pcnt function DSO _______ _____ __________________________________ ______________ 13537.00 59.2% rb_next [kernel] 3539.00 15.5% _raw_spin_lock [kernel] 1668.00 7.3% setup_cluster_no_bitmap [kernel] 799.00 3.5% tree_search_offset [kernel] 476.00 2.1% fill_window [kernel] 370.00 1.6% find_free_extent [kernel] 238.00 1.0% longest_match [kernel] 128.00 0.6% build_tree [kernel] 95.00 0.4% pqdownheap [kernel] 79.00 0.3% chksum_update [kernel] 72.00 0.3% btrfs_find_space_cluster [kernel] 65.00 0.3% deflate_fast [kernel] 61.00 0.3% memcpy [kernel] With call-graphs enabled: - 50.24% btrfs-transacti [kernel.kallsyms] [k] rb_next - rb_next - 97.36% setup_cluster_no_bitmap btrfs_find_space_cluster find_free_extent btrfs_reserve_extent btrfs_alloc_free_block __btrfs_cow_block + btrfs_cow_block - 2.29% btrfs_find_space_cluster find_free_extent btrfs_reserve_extent btrfs_alloc_free_block __btrfs_cow_block btrfs_cow_block - btrfs_search_slot - 56.96% lookup_inline_extent_backref - 97.23% __btrfs_free_extent run_clustered_refs - btrfs_run_delayed_refs - 91.23% btrfs_commit_transaction transaction_kthread kthread kernel_thread_helper - 8.77% btrfs_write_dirty_block_groups commit_cowonly_roots btrfs_commit_transaction transaction_kthread kthread kernel_thread_helper - 2.77% insert_inline_extent_backref __btrfs_inc_extent_ref run_clustered_refs btrfs_run_delayed_refs btrfs_commit_transaction transaction_kthread kthread kernel_thread_helper - 41.03% btrfs_insert_empty_items - 99.89% run_clustered_refs - btrfs_run_delayed_refs + 89.93% btrfs_commit_transaction + 10.07% btrfs_write_dirty_block_groups + 1.87% btrfs_write_dirty_block_groups - 7.41% btrfs-transacti [kernel.kallsyms] [k] setup_cluster_no_bitmap + setup_cluster_no_bitmap + 4.34% rsync [kernel.kallsyms] [k] _raw_spin_lock + 3.68% rsync [kernel.kallsyms] [k] rb_next + 3.09% btrfs-transacti [kernel.kallsyms] [k] tree_search_offset + 1.40% btrfs-delalloc- [kernel.kallsyms] [k] fill_window + 1.31% btrfs-transacti [kernel.kallsyms] [k] _raw_spin_lock + 1.19% btrfs-delalloc- [kernel.kallsyms] [k] longest_match + 1.18% btrfs-delalloc- [kernel.kallsyms] [k] deflate_fast + 1.09% btrfs-transacti [kernel.kallsyms] [k] find_free_extent + 0.90% btrfs-delalloc- [kernel.kallsyms] [k] pqdownheap + 0.67% btrfs-delalloc- [kernel.kallsyms] [k] compress_block + 0.66% btrfs-delalloc- [kernel.kallsyms] [k] build_tree + 0.61% rsync [kernel.kallsyms] [k] page_fault rb_next() from setup_cluster_no_bitmap() is very hot. From the annotated assembly output, it looks like the "while (window_free <= min_bytes)" loop is where the CPU is spending most of the time. A few thoughts: Shouldn't (window_free <= min_bytes) be (window_free < min_bytes)? I'm not really up to speed with SMP memory caching behaviour, but I'm thinking the constant list creation of bitmap entries from the shared free_space_cache objects might be helping bounce around these pages between CPUs, which is why instructions that deference the object pointers always seem to be cache misses...Or there's just too much of this stuff in memory for it to fit in cache. Top of slabtop -sc: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 1760061 1706286 96% 0.97K 53351 33 1707232K nfs_inode_cache 1623423 1617242 99% 0.95K 49279 33 1576928K btrfs_inode_cache 788998 676959 85% 0.55K 28204 28 451264K radix_tree_node 1379889 1344544 97% 0.19K 65709 21 262836K dentry 1399100 1248587 89% 0.16K 55964 25 223856K extent_buffers 1077876 1007921 93% 0.11K 29941 36 119764K journal_head This is all per-blockgroup, but I don't know how many blockgroups the thing keeps looking at. There are 20 mounted volumes, as I mentioned. 16 GB of memory, and 4 apparent cores (dual HT Xeon). The calculation for comparison with max_gap and entry->offset - window_start > (min_bytes * 2) are also the hot parts of the loop, but this is not much compared to the initial deference within rb_next() that pretty much always looks to be a cache miss. It would seem that not walking so much would be worthwhile, if possible. So, are all of the gap avoidance and stuff really necessary? I presume this is to try to avoid fragmentation. Would it make sense to leave some kind of pointer hanging around pointing to the last useful offset, or something? (eg: make the block group checking circular instead of walking the whole thing.) I'm just stabbing in the dark without more counters to see what's really going on here. I see Josef's 86d4a77ba3dc4ace238a0556541a41df2bd71d49 introduced the bitmaps list. I could try temporarily reverting this (some fixups needed) if anybody thinks my cache bouncing idea might be slightly possible. Cheers! Simon- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html