From: Omar Sandoval <osan...@fb.com> Hi,
At Facebook, we are still running into the issue of long commit stalls on large filesystems (say, 10s of TBs). 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") was a stopgap, but it wasn't enough, as it's still possible for a very busy filesystem to have a lot of block groups dirtied between the time we do the initial free space cache writeout and redo it in the critical section. Like Chris mentioned at LPC, I've been working on another solution to put these issues behind us. The solution we came up with is to track free space in a separate B-tree and update it in tandem with the extent tree. Using a B-tree rather than another ad-hoc mechanism has the advantage of being well-understood and giving us the proper metadata allocation profile by default, even if it could be slightly less efficient than something else purpose-built. In any case, the scalability win is clear. All of the tests below were run on a Fusion-io card. My stress testing workload, fallocating 50 TB worth of space on a 100 TB sparse filesystem image and then freeing it all, could cause stalls in the critical section for tens of seconds writing out the free space cache. Using no free space cache or the free space tree, commits spend only about a tenth of a second in the critical section. The time to load the free space tree is still reasonable as well. To test this, I created extremely fragmented block groups and then ran a workload that dirtied every inode in the filesystem, measuring how long we spent loading free space. The free space cache comes out on top, costing only ~30 ms. Using no cache is much worse, costing about 3-5 seconds. The free space tree is in between, taking 100-500 ms total (keep in mind that this is for the whole test lasting several minutes, not just for one block group). A lot of this overhead is actually manipulating the in-memory free space structures, so there's room for improvement in the future. Finally, we keep the disk usage under control by switching to a bitmap format when it becomes more efficient than using extents. Using 256 byte bitmaps and 4096 blocks, a 1 GB block groups in the worst case requires 1 GB / (256 * 8 * 4096 B) = 128 bitmaps, at 256 + sizeof(btrfs_item) = 256 + 25 = 281 bytes per bitmap for a grand total of ~35 KB of overhead per block group, comparable to the free space cache. This incurs the cost of converting between the two formats while running delayed refs, but I found that this takes <5 ms on my device. There are a couple of things that I wouldn't mind some comments on. Firstly, when doing the conversion between the extent and bitmap formats, we vmalloc a chunk of memory to buffer the free space in. For a 1 GB block group, this is 32 KB of memory. This happens during a transaction commit, so I'm worried about whether this will cause problems in low-memory situations. I chose to do this instead of handling the extent or bitmap items one by one because this is the simplest way to guarantee that the free space tree does not become larger at any point during the conversion (for example, imagine that our metadata space is almost full and we try to process a bitmap that becomes several extents). I'm wondering if anyone has any better ideas about how to handle this. Secondly, I *think* that using the commit root in load_free_space_tree() like is done in caching_thread() is correct, but I'm not 100% sure. This series is on top of v4.2. I've run it through xfstests and some manual stress tests as well. I'm sending it from my personal email address because soon I'll be back to finish up school, but I'll still be looking at any comments that anyone might have. Thanks! Omar Sandoval (6): Btrfs: add extent buffer bitmap operations Btrfs: add helpers for read-only compat bits Btrfs: introduce the free space B-tree on-disk format Btrfs: implement the free space B-tree Btrfs: wire up the free space tree to the extent tree Btrfs: add free space tree mount option fs/btrfs/Makefile | 2 +- fs/btrfs/ctree.h | 104 ++- fs/btrfs/disk-io.c | 26 + fs/btrfs/extent-tree.c | 88 ++- fs/btrfs/extent_io.c | 101 +++ fs/btrfs/extent_io.h | 6 + fs/btrfs/free-space-tree.c | 1468 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/free-space-tree.h | 39 ++ fs/btrfs/super.c | 21 +- include/trace/events/btrfs.h | 3 +- 10 files changed, 1843 insertions(+), 15 deletions(-) create mode 100644 fs/btrfs/free-space-tree.c create mode 100644 fs/btrfs/free-space-tree.h -- 2.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html