[PATCH v2] btrfs: use tagged writepage to mitigate livelock of snapshot
Snapshot is expected to be fast. But if there are writers steadily create dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in generic write_cache_pages(): we quickly tag all dirty pages with a TOWRITE tag, then do the hard work of writepage only on those pages with TOWRITE tag, so we ommit pages dirtied after the snapshot command. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is less compared with the sequential write case, since we omit only half of the pages dirtied after snapshot command. Signed-off-by: Ethan Lien --- V2: Add more details in commit message. rename BTRFS_INODE_TAGGED_FLUSH to BTRFS_INODE_SNAPSHOT_FLUSH. remove unnecessary sync_mode check. start_delalloc_inodes use boolean argument. fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/ctree.h | 2 +- fs/btrfs/extent_io.c | 14 -- fs/btrfs/inode.c | 10 ++ fs/btrfs/ioctl.c | 2 +- 5 files changed, 21 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 1343ac57b438..ffc9a1c77375 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -29,6 +29,7 @@ enum { BTRFS_INODE_IN_DELALLOC_LIST, BTRFS_INODE_READDIO_NEED_LOCK, BTRFS_INODE_HAS_PROPS, + BTRFS_INODE_SNAPSHOT_FLUSH, }; /* in memory btrfs inode */ diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2cddfe7806a4..82682da5a40d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3155,7 +3155,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, struct inode *inode, u64 new_size, u32 min_type); -int btrfs_start_delalloc_inodes(struct btrfs_root *root); +int btrfs_start_delalloc_snapshot(struct btrfs_root *root); int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4dd6faab02bb..93f2e413535d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3928,12 +3928,22 @@ static int extent_write_cache_pages(struct address_space *mapping, range_whole = 1; scanned = 1; } - if (wbc->sync_mode == WB_SYNC_ALL) + + /* +* We do the tagged writepage as long as the snapshot flush bit is set +* and we are the first one who do the filemap_flush() on this inode. +*/ + if (range_whole && wbc->nr_to_write == LONG_MAX && + test_and_clear_bit(BTRFS_INODE_SNAPSHOT_FLUSH, + &BTRFS_I(inode)->runtime_flags)) + wbc->tagged_writepages = 1; + + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; retry: - if (wbc->sync_mode == WB_SYNC_ALL) + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, index, end); done_index = index; while (!done && !nr_to_write_done && (index <= end) && diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3ea5339603cf..593445d122ed 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9975,7 +9975,7 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode * some fairly slow code that needs optimization. This walks the list * of all the inodes with pending delalloc and forces them to disk. */ -static int start_delalloc_inodes(struct btrfs_root *root, int nr) +static int start_delalloc_inodes(struct btrfs_root *root, int nr, bool snapshot) { struct btrfs_inode *binode; struct inode *inode; @@ -10003,6 +10003,8 @@ static int start_delalloc_inodes(struct btrfs_root *root, int nr) } spin_unlock(&root->delalloc_lock); +
[PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot
Snapshot is expected to be fast. But if there are writers steadily create dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can ommit pages dirtied after the snapshot command. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is less compared with the sequential write case, since we omit only half of the pages dirtied after snapshot command. Signed-off-by: Ethan Lien --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/ctree.h | 2 +- fs/btrfs/extent_io.c | 16 ++-- fs/btrfs/inode.c | 10 ++ fs/btrfs/ioctl.c | 2 +- 5 files changed, 23 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 1343ac57b438..4182bfbb56be 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -29,6 +29,7 @@ enum { BTRFS_INODE_IN_DELALLOC_LIST, BTRFS_INODE_READDIO_NEED_LOCK, BTRFS_INODE_HAS_PROPS, + BTRFS_INODE_TAGGED_FLUSH, }; /* in memory btrfs inode */ diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2cddfe7806a4..82682da5a40d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3155,7 +3155,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, struct inode *inode, u64 new_size, u32 min_type); -int btrfs_start_delalloc_inodes(struct btrfs_root *root); +int btrfs_start_delalloc_snapshot(struct btrfs_root *root); int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4dd6faab02bb..c21d8a0e010a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3928,12 +3928,24 @@ static int extent_write_cache_pages(struct address_space *mapping, range_whole = 1; scanned = 1; } - if (wbc->sync_mode == WB_SYNC_ALL) + + /* +* We don't care if we are the one who set BTRFS_INODE_TAGGED_FLUSH in +* start_delalloc_inodes(). We do the tagged writepage as long as we are +* the first one who do the filemap_flush() on this inode. +*/ + if (range_whole && wbc->nr_to_write == LONG_MAX && + wbc->sync_mode == WB_SYNC_NONE && + test_and_clear_bit(BTRFS_INODE_TAGGED_FLUSH, + &BTRFS_I(inode)->runtime_flags)) + wbc->tagged_writepages = 1; + + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; retry: - if (wbc->sync_mode == WB_SYNC_ALL) + if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, index, end); done_index = index; while (!done && !nr_to_write_done && (index <= end) && diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3ea5339603cf..3df3cbbe91c5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9975,7 +9975,7 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode * some fairly slow code that needs optimization. This walks the list * of all the inodes with pending delalloc and forces them to disk. */ -static int start_delalloc_inodes(struct btrfs_root *root, int nr) +static int start_delalloc_inodes(struct btrfs_root *root, int nr, int snapshot) { struct btrfs_inode *binode; struct inode *inode; @@ -10003,6 +10003,8 @@ static int start_delalloc_inodes(struct btrfs_root *root, int nr) } spin_unlock(&root->delalloc_lock); + if (snapshot) + set_bit(BTRFS_INODE_TAGGED_FLUSH, &binode->runtime_flags); work = btrfs_alloc_delalloc_work(inode); if (!work) {
[PATCH v3] btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We test the patch on a 4-cores x86 machine: 1. falloate a 16GiB size test file. 2. take snapshot (so all following writes will be cow write). 3. run a 180 sec, 4 jobs, 4K random write fio on test file. We also add a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track lock_stat. [Results] unpatched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu:82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protect a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock get transferred between different cpus, so the patch can really reduce cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien --- V2: Rewrite commit comments. Add lock_stat test. Pull dirty_metadata_bytes out to a separate patch. V3: Declare batch size as a constant. fs/btrfs/extent-tree.c | 47 -- 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..3788e87c174e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -51,6 +51,14 @@ enum { CHUNK_ALLOC_FORCE = 2, }; +/* + * Use large batch size to reduce update overhead. + * For reader side, we only read it when we are close to ENOSPC and + * the read overhead is mostly related to # of cpus, so it is OK to + * use arbitrary large value here. + */ +#define TOTAL_BYTES_PINNED_BATCH SZ_128M + static int __btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *node, u64 parent, @@ -758,7 +766,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes, + TOTAL_BYTES_PINNED_BATCH); } /* @@ -2598,8 +2607,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, - -head->num
[PATCH v2] btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We test the patch on a 4-cores x86 machine: 1. falloate a 16GiB size test file. 2. take snapshot (so all following writes will be cow write). 3. run a 180 sec, 4 jobs, 4K random write fio on test file. We also add a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track lock_stat. [Results] unpatched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu:82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protect a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock get transferred between different cpus, so the patch can really recude cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien --- V2: Rewrite commit comments. Add lock_stat test. Pull dirty_metadata_bytes out to a separate patch. fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 46 -- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; struct list_head list; /* Protected by the spinlock 'lock'. */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..937113534ef4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes, + space_info->total_bytes_pinned_batch); } /* @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, - -head->num_bytes); + percpu_counter_add_batch(&space_info->total_bytes_pinned, +
[PATCH] btrfs: use correct compare function of dirty_metadata_bytes
We use customized, nodesize batch value to update dirty_metadata_bytes. We should also use batch version of compare function or we will easily goto fast path and get false result from percpu_counter_compare(). Signed-off-by: Ethan Lien --- fs/btrfs/disk-io.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 205092dc9390..dfed08e70ec1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping, fs_info = BTRFS_I(mapping->host)->root->fs_info; /* this is a bit racy, but that's ok */ - ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret < 0) return 0; } @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info, if (flush_delayed) btrfs_balance_delayed_items(fs_info); - ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret > 0) { balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: use customized batch size for total_bytes_pinned
We use percpu_counter to reduce lock contention of frequently updated counter. But the default 32 batch size is suitable only for inc/dec update, not for sector size update. The end result is we still acquire spin lock for every percpu_counter_add(). So use customized batch size for total_bytes_pinned as we already done for dirty_metadata_bytes and delalloc_bytes. Also we fix dirty_metadata_bytes to use __percpu_counter_compare of customized batch so we get precise value of dirty_metadata_bytes. Signed-off-by: Ethan Lien --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 10 ++ fs/btrfs/extent-tree.c | 43 +++--- 3 files changed, 35 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; struct list_head list; /* Protected by the spinlock 'lock'. */ diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 205092dc9390..dfed08e70ec1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping, fs_info = BTRFS_I(mapping->host)->root->fs_info; /* this is a bit racy, but that's ok */ - ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret < 0) return 0; } @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info, if (flush_delayed) btrfs_balance_delayed_items(fs_info); - ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes, -BTRFS_DIRTY_METADATA_THRESH); + ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes, +BTRFS_DIRTY_METADATA_THRESH, +fs_info->dirty_metadata_batch); if (ret > 0) { balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..a067c047904c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes, + space_info->total_bytes_pinned_batch); } /* @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(&space_info->total_bytes_pinned, - -head->num_bytes); + percpu_counter_add_batch(&space_info->total_bytes_pinned, + -head->num_bytes, + space_info->total_bytes_pinned_batch); if (head->is_data) { spin_lock(&delayed_refs->lock); @@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags) INIT_LIST_HEAD(&space_info->ro_bgs); INIT_LIST_HEAD(&space_info->tickets); INIT_LIST_HEAD(&space_info->priority_tickets); + if (flags & BTRFS_BLOCK_GROUP_DATA) + space_info->total_bytes_pinned_batch = info->delalloc_batch; + else + space_info->total_bytes_pinned_batch = info->dirty_metadata_batch; ret = kobject_init_and_add(&space_info->kobj, &space_info_ktype, info->space_info_kobj, "%s", @@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * allocation, and no removed chunk in current transaction, * don't bother committing the transaction. */ - have_pinned_space = percpu_counter_compare( + have_pinned_space = __percpu_counter_compare( &data_
[PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien --- V2: Replace btrfs_btree_balance_dirty with btrfs_btree_balance_dirty_nodelay. Add reproduce steps. fs/btrfs/inode.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8e604e7071f1..e54547df24ee 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) /* once for the tree */ btrfs_put_ordered_extent(ordered_extent); + btrfs_btree_balance_dirty_nodelay(fs_info); + return ret; } -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: Skip some btrfs_cross_ref_exist() check in nocow path
In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows: [global] group_reporting time_based thread=1 ioengine=libaio bs=4k iodepth=32 size=64G runtime=180 numjobs=8 rw=randwrite [file1] filename=/mnt/nocow/testfile IOPS result: unpatched patched 1 fio round: 4667046958 snapshot 2 fio round: 5182654498 3 fio round: 5976761289 After snapshot, the first fio get about 5% performance gain. As we continually write to the same file, all writes will resume to nocow mode and eventually we have no performance gain. Signed-off-by: Ethan Lien --- V2: Add comment and performance test. fs/btrfs/inode.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d241285a0d2a..177630337108 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1373,6 +1373,13 @@ static noinline int run_delalloc_nocow(struct inode *inode, btrfs_file_extent_encryption(leaf, fi) || btrfs_file_extent_other_encoding(leaf, fi)) goto out_check; + /* +* We can skip the checking of generation of +* extent item in btrfs_cross_ref_exist(). +*/ + if (btrfs_file_extent_generation(leaf, fi) <= + btrfs_root_last_snapshot(&root->root_item)) + goto out_check; if (extent_type == BTRFS_FILE_EXTENT_REG && !force) goto out_check; if (btrfs_extent_readonly(fs_info, disk_bytenr)) @@ -7368,6 +7375,14 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, btrfs_file_extent_other_encoding(leaf, fi)) goto out; + /* +* We can skip the checking of generation of +* extent item in btrfs_cross_ref_exist(). +*/ + if (btrfs_file_extent_generation(leaf, fi) <= + btrfs_root_last_snapshot(&root->root_item)) + goto out; + backref_offset = btrfs_file_extent_offset(leaf, fi); if (orig_start) { -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Skip some btrfs_cross_ref_exist() check in nocow path
In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. Signed-off-by: Ethan Lien --- fs/btrfs/inode.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d241285a0d2a..96927f2ccd4b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1373,6 +1373,9 @@ static noinline int run_delalloc_nocow(struct inode *inode, btrfs_file_extent_encryption(leaf, fi) || btrfs_file_extent_other_encoding(leaf, fi)) goto out_check; + if (btrfs_file_extent_generation(leaf, fi) <= + btrfs_root_last_snapshot(&root->root_item)) + goto out_check; if (extent_type == BTRFS_FILE_EXTENT_REG && !force) goto out_check; if (btrfs_extent_readonly(fs_info, disk_bytenr)) @@ -7368,6 +7371,10 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, btrfs_file_extent_other_encoding(leaf, fi)) goto out; + if (btrfs_file_extent_generation(leaf, fi) <= + btrfs_root_last_snapshot(&root->root_item)) + goto out; + backref_offset = btrfs_file_extent_offset(leaf, fi); if (orig_start) { -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far onlys counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. Signed-off-by: Ethan Lien --- fs/btrfs/inode.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d241285a0d2a..d09433ab126e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3151,6 +3151,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) /* once for the tree */ btrfs_put_ordered_extent(ordered_extent); + btrfs_btree_balance_dirty(fs_info); + return ret; } -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html