[PATCH] btrfs: Simplify the calculation of variables
Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot Suggested-by: Jiapeng Zhong Signed-off-by: Abaci Team --- fs/btrfs/delayed-inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 70c0340..ec0b50b8 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1154,7 +1154,7 @@ static int __btrfs_run_delayed_items(struct btrfs_trans_handle *trans, int nr) delayed_root = fs_info->delayed_root; curr_node = btrfs_first_delayed_node(delayed_root); - while (curr_node && (!count || (count && nr--))) { + while (curr_node && (!count || nr--)) { ret = __btrfs_commit_inode_delayed_items(trans, path, curr_node); if (ret) { -- 1.8.3.1
Re: btrfs becomes read-only
kernel version: aleksey@host:~$ sudo uname --all Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12 18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux drive make/model: Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3 (made by Orico https://www.orico.cc/us/product/detail/3622.html) with 5 WD Red 10 Tb. We use this drive for backups. When i try to run btrfs check i get error message: aleksey@host:~$ sudo btrfs check --readonly /dev/sdg Couldn't open file system aleksey@host:~$ sudo smartctl -x /dev/sdg smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.15.0-132-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: JMicron H/W RAID5 Serial Number: 1IZ6AZCKMIFZYJ8A7V0W Firmware Version: 0964 User Capacity: 40,003,191,177,216 bytes [40.0 TB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is: Wed Jan 27 08:51:02 2021 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Disabled Rd look-ahead is: Disabled Write cache is: Enabled ATA Security is: Disabled, NOT FROZEN [SEC1] Wt Cache Reorder: Unavailable === START OF READ SMART DATA SECTION === SMART Status not supported: Incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x00) Offline data collection not supported. SMART capabilities: (0x) Automatic saving of SMART data is not implemented. Error logging capability: (0x00) Error logging NOT supported. No General Purpose Logging support. General Purpose Log Directory not supported SMART Log Directory Version 0 Address Access R/W Size Description 0x00 SL R/O 1 Log Directory SMART Extended Comprehensive Error Log (GP Log 0x03) not supported SMART Error Log not supported SMART Extended Self-test Log (GP Log 0x07) not supported SMART Self-test Log not supported Selective Self-tests/Logging not supported SCT Commands not supported Device Statistics (GP/SMART Log 0x04) not supported SATA Phy Event Counters (GP Log 0x11) not supported С уважением, Алексей Исаев, RQC 27.01.2021 10:38, Chris Murphy пишет: On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev wrote: Hello! BTRFS volume becomes read-only with this messages in dmesg. What can i do to repair btrfs partition? [Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.007587] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.000132] BTRFS error (device sdg): qgroup scan failed with -5 [Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.009783] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176: errno=-5 IO failure [ +0.60] BTRFS info (device sdg): forced readonly [ +0.04] BTRFS info (device sdg): failed to delete reference to ftrace.h, inode 2986197 parent 2989315 [ +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220: errno=-5 IO failure [ +0.006071] BTRFS error (device sdg): pending csums is 430080 What kernel version? What drive make/model? wanted 132477 found 132432 indicates the drive has lost ~45 transactions, that's not good and also weird. There's no crash or any other errors? A complete dmesg might be more revealing. And also smartctl -x /dev/sdg btrfs check --readonly /dev/sdg After that I suggest https://btrfs.wiki.kernel.org/index.php/Restore And try to get any important data out if it's not backed up. You can try btrfs-find-root to get a listing of roots, most recent to oldest. Start at the top, and plug that address in as 'btrfs restore -t' and see if it'll pull anything out. You likely need -i and -v options as well.
[PATCH] fs: btrfs: Select SHA256 in Kconfig
From: Matthias Brugger Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms") btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select it. This leads to compilation errors: fs/built-in.o: In function `hash_sha256': fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts' fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update' fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish' Signed-off-by: Matthias Brugger --- fs/btrfs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index f302b1fbef..2a32f42ad1 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -4,6 +4,7 @@ config FS_BTRFS select LZO select ZSTD select RBTREE + select SHA256 help This provides a single-device read-only BTRFS support. BTRFS is a next-generation Linux file system based on the copy-on-write -- 2.30.0
Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig
On 2021/1/27 下午5:42, matthias@kernel.org wrote: From: Matthias Brugger Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms") btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select it. This leads to compilation errors: fs/built-in.o: In function `hash_sha256': fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts' fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update' fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish' Signed-off-by: Matthias Brugger Oh, forgot to select sha256. Thanks for the fix. Reviewed-by: Qu Wenruo Thanks, Qu --- fs/btrfs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index f302b1fbef..2a32f42ad1 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -4,6 +4,7 @@ config FS_BTRFS select LZO select ZSTD select RBTREE + select SHA256 help This provides a single-device read-only BTRFS support. BTRFS is a next-generation Linux file system based on the copy-on-write
[PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice
From: Michal Rostecki Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. This change fixes that by passing the extent mapping to the btrfs_get_io_geometry() function as an argument. Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately") Signed-off-by: Michal Rostecki --- fs/btrfs/inode.c | 37 - fs/btrfs/volumes.c | 39 --- fs/btrfs/volumes.h | 5 +++-- 3 files changed, 47 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0dbe1aaa0b71..a4ce8501ed4d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, struct inode *inode = page->mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); u64 logical = bio->bi_iter.bi_sector << 9; + struct extent_map *em; u64 length = 0; u64 map_length; - int ret; + int ret = 0; struct btrfs_io_geometry geom; if (bio_flags & EXTENT_BIO_COMPRESSED) @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, length = bio->bi_iter.bi_size; map_length = length; - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length, - &geom); + em = btrfs_get_chunk_map(fs_info, logical, map_length); + if (IS_ERR(em)) + return PTR_ERR(em); + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, + map_length, &geom); if (ret < 0) - return ret; + goto out; - if (geom.len < length + size) - return 1; - return 0; + if (geom.len < length + size) { + ret = 1; + goto out; + } +out: + free_extent_map(em); + return ret; } /* @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, u64 submit_len; int clone_offset = 0; int clone_len; + int logical; int ret; blk_status_t status; struct btrfs_io_geometry geom; struct btrfs_dio_data *dio_data = iomap->private; + struct extent_map *em; dip = btrfs_create_dio_private(dio_bio, inode, file_offset); if (!dip) { @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, } start_sector = dio_bio->bi_iter.bi_sector; + logical = start_sector << 9; submit_len = dio_bio->bi_iter.bi_size; do { - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio), - start_sector << 9, submit_len, + em = btrfs_get_chunk_map(fs_info, logical, submit_len); + if (IS_ERR(em)) { + status = errno_to_blk_status(ret); + goto out_err; + } + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), + logical, submit_len, &geom); if (ret) { status = errno_to_blk_status(ret); @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, clone_offset += clone_len; start_sector += clone_len >> 9; file_offset += clone_len; + + free_extent_map(em); } while (submit_len > 0); return BLK_QC_T_NONE; out_err: dip->dio_bio->bi_status = status; btrfs_dio_private_put(dip); + free_extent_map(em); return BLK_QC_T_NONE; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a8ec8539cd8d..4c753b17c0a2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5940,23 +5940,24 @@ static bool need_full_stripe(enum btrfs_map_op op) } /* - * btrfs_get_io_geometry - calculates the geomery of a particular (address, len) + * btrfs_get_io_geometry - calculates the geometry of a particular (address, len) *tuple. This information is used to calculate how big a *particular bio can get before it straddles a stripe. * - * @fs_info - the filesystem - * @logical - address that we want to figure out the geometry of - * @len- the length o
Re: [PATCH 03/17] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
On Tue, Jan 26, 2021 at 03:52:33PM +0100, Christoph Hellwig wrote: > Use bio_kmalloc instead of open coding it. > > Signed-off-by: Christoph Hellwig > --- > block/blk-crypto-fallback.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c > index 50c225398e4d60..e8327c50d7c9f4 100644 > --- a/block/blk-crypto-fallback.c > +++ b/block/blk-crypto-fallback.c > @@ -164,7 +164,7 @@ static struct bio *blk_crypto_clone_bio(struct bio > *bio_src) > struct bio_vec bv; > struct bio *bio; > > - bio = bio_alloc_bioset(GFP_NOIO, bio_segments(bio_src), NULL); > + bio = bio_kmalloc(GFP_NOIO, bio_segments(bio_src)); > if (!bio) > return NULL; > bio->bi_bdev= bio_src->bi_bdev; > -- Looks good, Reviewed-by: Eric Biggers
[PATCH v4 11/12] btrfs: adjust the flush trace point to include the source
Since we have normal ticketed flushing and preemptive flushing, adjust the tracepoint so that we know the source of the flushing action to make it easier to debug problems. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/space-info.c| 20 include/trace/events/btrfs.h | 10 ++ 2 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 6afb9cac694a..48c2a4eae235 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -667,7 +667,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, */ static void flush_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, u64 num_bytes, - enum btrfs_flush_state state) + enum btrfs_flush_state state, bool for_preempt) { struct btrfs_root *root = fs_info->extent_root; struct btrfs_trans_handle *trans; @@ -750,7 +750,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, } trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state, - ret); + ret, for_preempt); return; } @@ -989,7 +989,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) flush_state = FLUSH_DELAYED_ITEMS_NR; do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, flush_state, + false); spin_lock(&space_info->lock); if (list_empty(&space_info->tickets)) { space_info->flush = 0; @@ -1125,7 +1126,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work) to_reclaim >>= 2; if (!to_reclaim) to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1); - flush_space(fs_info, space_info, to_reclaim, flush); + flush_space(fs_info, space_info, to_reclaim, flush, true); cond_resched(); spin_lock(&space_info->lock); } @@ -1222,7 +1223,8 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) spin_unlock(&space_info->lock); while (!space_info->full) { - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE); + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, + false); spin_lock(&space_info->lock); if (list_empty(&space_info->tickets)) { space_info->flush = 0; @@ -1235,7 +1237,7 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) while (flush_state < ARRAY_SIZE(data_flush_states)) { flush_space(fs_info, space_info, U64_MAX, - data_flush_states[flush_state]); + data_flush_states[flush_state], false); spin_lock(&space_info->lock); if (list_empty(&space_info->tickets)) { space_info->flush = 0; @@ -1308,7 +1310,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, flush_state = 0; do { - flush_space(fs_info, space_info, to_reclaim, states[flush_state]); + flush_space(fs_info, space_info, to_reclaim, states[flush_state], + false); flush_state++; spin_lock(&space_info->lock); if (ticket->bytes == 0) { @@ -1324,7 +1327,8 @@ static void priority_reclaim_data_space(struct btrfs_fs_info *fs_info, struct reserve_ticket *ticket) { while (!space_info->full) { - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE); + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, + false); spin_lock(&space_info->lock); if (ticket->bytes == 0) { spin_unlock(&space_info->lock); diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 0cf02dfd4c01..74e5cc247b80 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1113,15 +1113,16 @@ TRACE_EVENT(btrfs_trigger_flush, TRACE_EVENT(btrfs_flush_space, TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes, -int state, int ret), +int state, int ret, bool for_preempt), - TP_ARGS(fs_info, flags, num_bytes, state, ret), + TP_ARGS(fs_info, flags, num_bytes, state, ret, for_preempt), TP_STRUCT__entry_btrfs( __field(u64,flags ) __field(u64,num_bytes ) __fi
[PATCH v4 08/12] btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no tickets it calculates a value for the preemptive flushing. However for the normal ticketed flushing we really only care about the space required for tickets. We will accidentally come in and flush one time, but as soon as we see there are no tickets we bail out of our flushing. Fix this by making btrfs_calc_reclaim_metadata_size really only tell us what is required for flushing if we have people waiting on space. Then move the preemptive flushing logic into need_preemptive_reclaim(). We ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim() because if we are in this path then we made our reservation and there are not pending tickets currently, so we do not need to check it, simply do the fuzzy logic to check if we're getting low on space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/space-info.c | 44 --- 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index c3c586b33b4b..8f3b4cc8b812 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -759,7 +759,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, { u64 used; u64 avail; - u64 expected; u64 to_reclaim = space_info->reclaim_size; lockdep_assert_held(&space_info->lock); @@ -777,28 +776,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, if (space_info->total_bytes + avail < used) to_reclaim += used - (space_info->total_bytes + avail); - if (to_reclaim) - return to_reclaim; - - to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M); - if (btrfs_can_overcommit(fs_info, space_info, to_reclaim, -BTRFS_RESERVE_FLUSH_ALL)) - return 0; - - used = btrfs_space_info_used(space_info, true); - - if (btrfs_can_overcommit(fs_info, space_info, SZ_1M, -BTRFS_RESERVE_FLUSH_ALL)) - expected = div_factor_fine(space_info->total_bytes, 95); - else - expected = div_factor_fine(space_info->total_bytes, 90); - - if (used > expected) - to_reclaim = used - expected; - else - to_reclaim = 0; - to_reclaim = min(to_reclaim, space_info->bytes_may_use + -space_info->bytes_reserved); return to_reclaim; } @@ -807,6 +784,7 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info, u64 used) { u64 thresh = div_factor_fine(space_info->total_bytes, 98); + u64 to_reclaim, expected; /* If we're just plain full then async reclaim just slows us down. */ if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh) @@ -819,7 +797,25 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info, if (space_info->reclaim_size) return 0; - if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info)) + to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M); + if (btrfs_can_overcommit(fs_info, space_info, to_reclaim, +BTRFS_RESERVE_FLUSH_ALL)) + return 0; + + used = btrfs_space_info_used(space_info, true); + if (btrfs_can_overcommit(fs_info, space_info, SZ_1M, +BTRFS_RESERVE_FLUSH_ALL)) + expected = div_factor_fine(space_info->total_bytes, 95); + else + expected = div_factor_fine(space_info->total_bytes, 90); + + if (used > expected) + to_reclaim = used - expected; + else + to_reclaim = 0; + to_reclaim = min(to_reclaim, space_info->bytes_may_use + +space_info->bytes_reserved); + if (!to_reclaim) return 0; return (used >= thresh && !btrfs_fs_closing(fs_info) && -- 2.26.2
[PATCH v4 09/12] btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing if we've reserved more than half of our available free space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/space-info.c | 73 --- 1 file changed, 48 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 8f3b4cc8b812..1c4226f78e27 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -780,11 +780,11 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, } static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info, - struct btrfs_space_info *space_info, - u64 used) + struct btrfs_space_info *space_info) { + u64 ordered, delalloc; u64 thresh = div_factor_fine(space_info->total_bytes, 98); - u64 to_reclaim, expected; + u64 used; /* If we're just plain full then async reclaim just slows us down. */ if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh) @@ -797,26 +797,52 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info, if (space_info->reclaim_size) return 0; - to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M); - if (btrfs_can_overcommit(fs_info, space_info, to_reclaim, -BTRFS_RESERVE_FLUSH_ALL)) - return 0; + /* +* If we have over half of the free space occupied by reservations or +* pinned then we want to start flushing. +* +* We do not do the traditional thing here, which is to say +* +* if (used >= ((total_bytes + avail) >> 1)) +* return 1; +* +* because this doesn't quite work how we want. If we had more than 50% +* of the space_info used by bytes_used and we had 0 available we'd just +* constantly run the background flusher. Instead we want it to kick in +* if our reclaimable space exceeds 50% of our available free space. +*/ + thresh = calc_available_free_space(fs_info, space_info, + BTRFS_RESERVE_FLUSH_ALL); + thresh += (space_info->total_bytes - space_info->bytes_used - + space_info->bytes_reserved - space_info->bytes_readonly); + thresh >>= 1; - used = btrfs_space_info_used(space_info, true); - if (btrfs_can_overcommit(fs_info, space_info, SZ_1M, -BTRFS_RESERVE_FLUSH_ALL)) - expected = div_factor_fine(space_info->total_bytes, 95); - else - expected = div_factor_fine(space_info->total_bytes, 90); + used = space_info->bytes_pinned; - if (used > expected) - to_reclaim = used - expected; + /* +* If we have more ordered bytes than delalloc bytes then we're either +* doing a lot of DIO, or we simply don't have a lot of delalloc waiting +* around. Preemptive flushing is only useful in that it can free up +* space before tickets need to wait for things to finish. In the case +* of ordered extents, preemptively waiting on ordered extents gets us +* nothing, if our reservations are tied up in ordered extents we'll +* simply have to slow down writers by forcing them to wait on ordered +* extents. +* +* In the case that ordered is larger than delalloc, only include the +* block reserves that we would actually be able to directly reclaim +* from. In this case if we're heavy on metadata operations this will +* clearly be heavy enough to warrant preemptive flushing. In the case +* of heavy DIO or ordered reservations, preemptive flushing will just +* waste time and cause us to slow down. +*/ + ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes); + delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes); + if (ordered >= delalloc) + used += fs_info->delayed_refs_rsv.reserved + + fs_info->delayed_block_rsv.reserved; else - to_reclaim = 0; - to_reclaim = min(to_reclaim, space_info->bytes_may_use + -space_info->bytes_reserved); - if (!to_reclaim) - return 0; + used += space_info->bytes_may_use; return (used >= thresh && !btrfs_fs_closing(fs_info) && !test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state)); @@ -1013,7 +1039,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work) struct btrfs_block_rsv *delayed_refs_rsv; struct btrfs_block_rsv *global_rsv; struct btrfs_block_rsv *trans_rsv;
[PATCH v4 02/12] btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point to report when we handle a reserve ticket. I opted to spit out start_ns itself without calculating the difference because there could be a gap between enabling the tracpoint and setting start_ns. Doing it this way allows us to filter on 0 start_ns so we don't get bogus entries, and we can easily calculate the time difference with bpftrace or something else. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/space-info.c| 12 +++- include/trace/events/btrfs.h | 29 + 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 975bb109e8b9..af6ab30e36e7 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -1220,6 +1220,8 @@ static void wait_reserve_ticket(struct btrfs_fs_info *fs_info, * @fs_info:the filesystem * @space_info: space info for the reservation * @ticket: ticket for the reservation + * @start_ns: timestamp when the reservation started + * @orig_bytes:amount of bytes originally reserved * @flush: how much we can flush * * This does the work of figuring out how to flush for the ticket, waiting for @@ -1228,6 +1230,7 @@ static void wait_reserve_ticket(struct btrfs_fs_info *fs_info, static int handle_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket, +u64 start_ns, u64 orig_bytes, enum btrfs_reserve_flush_enum flush) { int ret; @@ -1283,6 +1286,8 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info, * space wasn't reserved at all). */ ASSERT(!(ticket->bytes == 0 && ticket->error)); + trace_btrfs_reserve_ticket(fs_info, space_info->flags, orig_bytes, + start_ns, flush, ticket->error); return ret; } @@ -1317,6 +1322,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info, { struct work_struct *async_work; struct reserve_ticket ticket; + u64 start_ns = 0; u64 used; int ret = 0; bool pending_tickets; @@ -1369,6 +1375,9 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info, space_info->reclaim_size += ticket.bytes; init_waitqueue_head(&ticket.wait); ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL); + if (trace_btrfs_reserve_ticket_enabled()) + start_ns = ktime_get_ns(); + if (flush == BTRFS_RESERVE_FLUSH_ALL || flush == BTRFS_RESERVE_FLUSH_ALL_STEAL || flush == BTRFS_RESERVE_FLUSH_DATA) { @@ -1405,7 +1414,8 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info, if (!ret || flush == BTRFS_RESERVE_NO_FLUSH) return ret; - return handle_reserve_ticket(fs_info, space_info, &ticket, flush); + return handle_reserve_ticket(fs_info, space_info, &ticket, start_ns, +orig_bytes, flush); } /** diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index b9896fc06160..b0ea2a108be3 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -2026,6 +2026,35 @@ TRACE_EVENT(btrfs_convert_extent_bit, __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS)) ); +TRACE_EVENT(btrfs_reserve_ticket, + TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 bytes, +u64 start_ns, int flush, int error), + + TP_ARGS(fs_info, flags, bytes, start_ns, flush, error), + + TP_STRUCT__entry_btrfs( + __field(u64,flags ) + __field(u64,bytes ) + __field(u64,start_ns) + __field(int,flush ) + __field(int,error ) + ), + + TP_fast_assign_btrfs(fs_info, + __entry->flags = flags; + __entry->bytes = bytes; + __entry->start_ns = start_ns; + __entry->flush = flush; + __entry->error = error; + ), + + TP_printk_btrfs("flags=%s bytes=%llu start_ns=%llu flush=%s error=%d", + __print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS), + __entry->bytes, __entry->start_ns, + __print_symbolic(__entry->flush, FLUSH_ACTIONS), + __entry->error) +); + DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock, TP_PROTO(const struct extent_buffer *eb, u64 start_ns), -- 2.26.2
[PATCH v4 04/12] btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Sole-y for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 1 + fs/btrfs/space-info.c| 14 ++ include/trace/events/btrfs.h | 3 ++- 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7d8660227520..90726954b883 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2740,6 +2740,7 @@ enum btrfs_flush_state { ALLOC_CHUNK_FORCE = 8, RUN_DELAYED_IPUTS = 9, COMMIT_TRANS= 10, + FORCE_COMMIT_TRANS = 11, }; int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index ff138cef7d0b..94c9534505c5 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -735,6 +735,14 @@ static void flush_space(struct btrfs_fs_info *fs_info, case COMMIT_TRANS: ret = may_commit_transaction(fs_info, space_info); break; + case FORCE_COMMIT_TRANS: + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + ret = btrfs_commit_transaction(trans); + break; default: ret = -ENOSPC; break; @@ -1037,6 +1045,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) * For data we start with alloc chunk force, however we could have been full * before, and then the transaction commit could have freed new block groups, * so if we now have space to allocate do the force chunk allocation. + * + * FORCE_COMMIT_TRANS + * For use by the preemptive flusher. We use this to bypass the ticketing + * checks in may_commit_transaction, as we have more information about the + * overall state of the system and may want to commit the transaction ahead of + * actual ENOSPC conditions. */ static const enum btrfs_flush_state data_flush_states[] = { FLUSH_DELALLOC_WAIT, diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index b0ea2a108be3..0cf02dfd4c01 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -99,7 +99,8 @@ struct btrfs_space_info; EM( ALLOC_CHUNK,"ALLOC_CHUNK") \ EM( ALLOC_CHUNK_FORCE, "ALLOC_CHUNK_FORCE")\ EM( RUN_DELAYED_IPUTS, "RUN_DELAYED_IPUTS")\ - EMe(COMMIT_TRANS, "COMMIT_TRANS") + EM(COMMIT_TRANS,"COMMIT_TRANS") \ + EMe(FORCE_COMMIT_TRANS, "FORCE_COMMIT_TRANS") /* * First define the enums in the above macros to be exported to userspace via -- 2.26.2
Re: [PATCH v3 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int
On 1/26/21 1:36 PM, David Sterba wrote: On Fri, Oct 09, 2020 at 09:28:18AM -0400, Josef Bacik wrote: I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this. I can't find any such public report and none of the clang warnings seem to be specific about that. Local run with clang 11 does not produce any warning. IDK, it was a private email just to me with the following output from clang s/btrfs/space-info.c:1115:12: warning: implicit conversion from enumeration type 'enum btrfs_flush_state' to different enumeration type 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] flush = FLUSH_DELALLOC; ~ ^~ fs/btrfs/space-info.c:1120:12: warning: implicit conversion from enumeration type 'enum btrfs_flush_state' to different enumeration type 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] flush = FORCE_COMMIT_TRANS; ~ ^~ fs/btrfs/space-info.c:1124:12: warning: implicit conversion from enumeration type 'enum btrfs_flush_state' to different enumeration type 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] flush = FLUSH_DELAYED_ITEMS_NR; ~ ^~ fs/btrfs/space-info.c:1127:12: warning: implicit conversion from enumeration type 'enum btrfs_flush_state' to different enumeration type 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] flush = FLUSH_DELAYED_REFS_NR; ~ ^ I figure it made sense, might as well fix it. Do we have that option for our shiny new -W build options? Thanks, Josef
[PATCH v4 00/12] Improve preemptive ENOSPC flushing
v3->v4: - Added some comments. - Updated the tracepoint for the preemptive flushing to take a bool instead of an int. - Removed 'inline' from need_preemptive_flushing. v2->v3: - Added a cleanup to make sure we pass the right enum around for flush_space(). - Fixed a problem reported by a clang build where I used the wrong enum in the preemptive flushing for flush_space(). - Fixed up some nits pointed out by Nikolay. v1->v2: - Added a FORCE_COMMIT_TRANS flush operation so we can keep the flush_space stuff consistent and get all the normal tracepoints. - Renamed fs_info->dio_bytes to ->ordered_bytes and changed it to count all ordered extents that were pending, not just DIO ordered extents that were pending. - Reworked the clamping to not apply if we're not doing a lot of delalloc reservations. - Reworked the preempt flushing loop to be more straightforward. - Fixed the need_preemptive_flushing() helper to take into account DIO heavy workloads. --- Original email --- A while ago Nikolay started digging into a problem where they were seeing an around 20% regression on random writes, and he bisected it down to btrfs: don't end the transaction for delayed refs in throttle However this wasn't actually the cause of the problem. This patch removed the code that would preemptively end the transactions if we were low on space. Because we had just introduced the ticketing code, this was no longer necessary and was causing a lot of transaction commits. And in Nikolay's testing he validated this, we would see like 100x more transaction commits without that patch than with it, but the write regression clearly appeared when this patch was applied. The root cause of this is that the transaction commits were essentially happening so quickly that we didn't end up needing to wait on space in the ENOSPC ticketing code as much, and thus were able to write pretty quickly. With this gone, we now were getting a sawtoothy sort of behavior where we'd run up, stop while we flushed metadata space, run some more, stop again etc. When I implemented the ticketing infrastructure, I was trying to get us out of excessively flushing space because we would sometimes over create block groups, and thus short circuited flushing if we no longer had tickets. This had the side effect of breaking the preemptive flushing code, where we attempted to flush space in the background before we were forced to wait for space. Enter this patchset. We still have some of this preemption logic sprinkled everywhere, so I've separated it out of the normal ticketed flushing code, and made preemptive flushing it's own thing. The preemptive flushing logic is more specialized than the standard flushing logic. It attempts to flush in whichever pool has the highest usage. This means that if most of our space is tied up in pinned extents, we'll commit the transaction. If most of the space is tied up in delalloc, we'll flush delalloc, etc. To test this out I used the fio job that Nikolay used, this needs to be adjusted so the overall IO size is at least 2x the RAM size for the box you are testing fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \ --group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \ --name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \ --size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \ --filename_format=FioWorkloads.\$jobnum I got the following results misc-next: bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), io=4015MiB (4210MB), run=300323-300323msec pre-throttling: bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s (17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec my patches: bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), io=5403MiB (5666MB), run=31-31msec Thanks, Josef Josef Bacik (12): btrfs: make flush_space take a enum btrfs_flush_state instead of int btrfs: add a trace point for reserve tickets btrfs: track ordered bytes instead of just dio ordered bytes btrfs: introduce a FORCE_COMMIT_TRANS flush operation btrfs: improve preemptive background space flushing btrfs: rename need_do_async_reclaim btrfs: check reclaim_size in need_preemptive_reclaim btrfs: rework btrfs_calc_reclaim_metadata_size btrfs: simplify the logic in need_preemptive_flushing btrfs: implement space clamping for preemptive flushing btrfs: adjust the flush trace point to include the source btrfs: add a trace class for dumping the current ENOSPC state fs/btrfs/ctree.h | 4 +- fs/btrfs/disk-io.c | 9 +- fs/btrfs/ordered-data.c | 13 +- fs/btrfs/space-info.c| 295 +-- fs/btrfs/space-info.h| 4 + include/trace/events/btrfs.h | 104 +++- 6 files changed, 362 insertions(+), 67 deletions(-) -- 2.26.2
[PATCH v4 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/space-info.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index fd8e79e3c10e..975bb109e8b9 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -670,7 +670,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, */ static void flush_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, u64 num_bytes, - int state) + enum btrfs_flush_state state) { struct btrfs_root *root = fs_info->extent_root; struct btrfs_trans_handle *trans; @@ -923,7 +923,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) struct btrfs_fs_info *fs_info; struct btrfs_space_info *space_info; u64 to_reclaim; - int flush_state; + enum btrfs_flush_state flush_state; int commit_cycles = 0; u64 last_tickets_id; -- 2.26.2
Re: [PATCH] btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb
On Mon, Jan 25, 2021 at 08:43:25PM +0100, Roman Anasal wrote: > btrfs_compare_trees and changed_cb use a void *ctx parameter instead of > struct send_ctx *sctx but when used in changed_cb it is immediately > casted to `struct send_ctx *sctx = ctx;`. > > changed_cb is only ever called from btrfs_compare_trees and full_send_tree: > - full_send_tree already passes a struct send_ctx *sctx > - btrfs_compare_trees is only called by send_subvol with a struct send_ctx > *sctx > - void *ctx in btrfs_compare_trees is only used to be passed to changed_cb > > So casting to/from void *ctx seems unnecessary and directly using > struct send_ctx *sctx instead provides better type-safety. > > The original reason for using void *ctx in the first place seems to have > been dropped with > 1b51d6f ("btrfs: send: remove indirect callback parameter for changed_cb") > > Signed-off-by: Roman Anasal Makes sense, added to misc-next, thanks.
Re: [PATCH v3 06/12] btrfs: rename need_do_async_reclaim
On Fri, Oct 09, 2020 at 09:28:23AM -0400, Josef Bacik wrote: > All of our normal flushing is asynchronous reclaim, so this helper is > poorly named. This is more checking if we need to preemptively flush > space, so rename it to need_preemptive_reclaim. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Josef Bacik > --- > fs/btrfs/space-info.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index 0f84bee57c29..f37ead28bd05 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -799,9 +799,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info > *fs_info, > return to_reclaim; > } > > -static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, > - struct btrfs_space_info *space_info, > - u64 used) > +static inline bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info, Following patches add more code to that function so it's not really suitable for 'static inline', I'd rather drop it.
Re: [PATCH 02/17] btrfs: use bio_kmalloc in __alloc_device
On 1/26/21 7:04 AM, Christoph Hellwig wrote: > Use bio_kmalloc instead of open coding it. > > Signed-off-by: Christoph Hellwig Looks good. Reviewed-by: Chaitanya Kulkarni
Re: [PATCH 01/17] zonefs: use bio_alloc in zonefs_file_dio_append
On 1/26/21 7:01 AM, Christoph Hellwig wrote: > Use bio_alloc instead of open coding it. > > Signed-off-by: Christoph Hellwig Ha Ha I was going send out a patch for this :P. Reviewed-by: Chaitanya Kulkarni
Re: [PATCH v3 11/12] btrfs: adjust the flush trace point to include the source
On Fri, Oct 09, 2020 at 09:28:28AM -0400, Josef Bacik wrote: > Since we have normal ticketed flushing and preemptive flushing, adjust > the tracepoint so that we know the source of the flushing action to make > it easier to debug problems. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Josef Bacik > --- > fs/btrfs/space-info.c| 20 > include/trace/events/btrfs.h | 10 ++ > 2 files changed, 18 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index 5ee698c12a7b..30474fa30985 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -664,7 +664,7 @@ static int may_commit_transaction(struct btrfs_fs_info > *fs_info, > */ > static void flush_space(struct btrfs_fs_info *fs_info, > struct btrfs_space_info *space_info, u64 num_bytes, > -enum btrfs_flush_state state) > +enum btrfs_flush_state state, bool for_preempt) > { > struct btrfs_root *root = fs_info->extent_root; > struct btrfs_trans_handle *trans; > @@ -747,7 +747,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, > } > > trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state, > - ret); > + ret, for_preempt); > return; > } > > @@ -973,7 +973,8 @@ static void btrfs_async_reclaim_metadata_space(struct > work_struct *work) > > flush_state = FLUSH_DELAYED_ITEMS_NR; > do { > - flush_space(fs_info, space_info, to_reclaim, flush_state); > + flush_space(fs_info, space_info, to_reclaim, flush_state, > + false); > spin_lock(&space_info->lock); > if (list_empty(&space_info->tickets)) { > space_info->flush = 0; > @@ -1109,7 +1110,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct > work_struct *work) > to_reclaim >>= 2; > if (!to_reclaim) > to_reclaim = btrfs_calc_insert_metadata_size(fs_info, > 1); > - flush_space(fs_info, space_info, to_reclaim, flush); > + flush_space(fs_info, space_info, to_reclaim, flush, true); > cond_resched(); > spin_lock(&space_info->lock); > } > @@ -1200,7 +1201,8 @@ static void btrfs_async_reclaim_data_space(struct > work_struct *work) > spin_unlock(&space_info->lock); > > while (!space_info->full) { > - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE); > + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, > + false); > spin_lock(&space_info->lock); > if (list_empty(&space_info->tickets)) { > space_info->flush = 0; > @@ -1213,7 +1215,7 @@ static void btrfs_async_reclaim_data_space(struct > work_struct *work) > > while (flush_state < ARRAY_SIZE(data_flush_states)) { > flush_space(fs_info, space_info, U64_MAX, > - data_flush_states[flush_state]); > + data_flush_states[flush_state], false); > spin_lock(&space_info->lock); > if (list_empty(&space_info->tickets)) { > space_info->flush = 0; > @@ -1286,7 +1288,8 @@ static void priority_reclaim_metadata_space(struct > btrfs_fs_info *fs_info, > > flush_state = 0; > do { > - flush_space(fs_info, space_info, to_reclaim, > states[flush_state]); > + flush_space(fs_info, space_info, to_reclaim, > states[flush_state], > + false); > flush_state++; > spin_lock(&space_info->lock); > if (ticket->bytes == 0) { > @@ -1302,7 +1305,8 @@ static void priority_reclaim_data_space(struct > btrfs_fs_info *fs_info, > struct reserve_ticket *ticket) > { > while (!space_info->full) { > - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE); > + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, > + false); > spin_lock(&space_info->lock); > if (ticket->bytes == 0) { > spin_unlock(&space_info->lock); > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h > index 0a3d35d952c4..6d93637bae02 100644 > --- a/include/trace/events/btrfs.h > +++ b/include/trace/events/btrfs.h > @@ -1112,15 +1112,16 @@ TRACE_EVENT(btrfs_trigger_flush, > TRACE_EVENT(btrfs_flush_space, > > TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes, > - int state, int ret), > + int state, int ret, int for_preempt), for_preempt should be bool > > - TP_ARGS(fs_info, flags, num_bytes, state, ret), > + TP_ARGS(fs_info, flags, num_bytes, state, ret, for_pre
Re: [PATCH 05/17] block: use an on-stack bio in blkdev_issue_flush
On 1/26/21 7:07 AM, Christoph Hellwig wrote: > There is no point in allocating memory for a synchronous flush. > > Signed-off-by: Christoph Hellwig True, looks good. Reviewed-by: Chaitanya Kulkarni
Re: [PATCH 16/17] nilfs2: remove cruft in nilfs_alloc_seg_bio
On 1/26/21 7:27 AM, Christoph Hellwig wrote: > bio_alloc never returns NULL when it can sleep. > > Signed-off-by: Christoph Hellwig > --- Looks good. Reviewed-by: Chaitanya Kulkarni
Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen
Am Montag, den 25.01.2021, 20:51 + schrieb Filipe Manana: > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal > wrote: > > This is analogous to the preceding patch ("btrfs: send: fix invalid > > commands for inodes with changed type but same gen") but for > > changed > > rdev: > > > > When doing an incremental send, if a new inode has the same number > > as an > > inode in the parent subvolume, was created with the same generation > > but > > has differing rdev it will not be detected as changed and thus not > > recreated. This will lead to incorrect results on the receiver > > where the > > inode will keep the rdev of the inode in the parent subvolume or > > even > > fail when also the ref is unchanged. > > > > This case does not happen when doing incremental sends with > > snapshots > > that are kept read-only by the user all the time, but it may happen > > if > > - a snapshot was modified in the same transaction as its parent > > after it > > was created > > - the subvol used as parent was created independently from the sent > > subvol > > > > Example reproducers: > > > > # case 1: same ino at same path > > btrfs subvolume create subvol1 > > btrfs subvolume create subvol2 > > mknod subvol1/a c 1 3 > > mknod subvol2/a c 1 5 > > btrfs property set subvol1 ro true > > btrfs property set subvol2 ro true > > btrfs send -p subvol1 subvol2 | btrfs receive --dump > > > > The produced tree state here is: > > |-- subvol1 > > | `-- a (ino 257, c 1,3) > > | > > `-- subvol2 > > `-- a (ino 257, c 1,5) > > > > Where subvol1/a and subvol2/a are character devices with differing > > minor > > numbers but same inode number and same generation. > > > > Example output of the receive command: > > At subvol subvol2 > > snapshot./subvol2 uuid=7513941c- > > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226- > > 2548-9e39-048c7fdbec99 parent_transid=9 > > utimes ./subvol2/ atime=2021-01- > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > 25T17:14:36+ > > link./subvol2/a dest=a > > unlink ./subvol2/a > > utimes ./subvol2/ atime=2021-01- > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > 25T17:14:36+ > > utimes ./subvol2/a atime=2021-01- > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > 25T17:14:36+ > > > > => the `link` command causes the receiver to fail with: > >ERROR: link a -> a failed: File exists > > > > Second example: > > # case 2: same ino at different path > > btrfs subvolume create subvol1 > > btrfs subvolume create subvol2 > > mknod subvol1/a c 1 3 > > mknod subvol2/b c 1 5 > > btrfs property set subvol1 ro true > > btrfs property set subvol2 ro true > > btrfs send -p subvol1 subvol2 | btrfs receive --dump > > As I've told you before for the v1 patchset from a week or two ago, > this is not a supported scenario for incremental sends. > Incremental sends are meant to be used on RO snapshots of the same > subvolume, and those snapshots must never be changed after they were > created. > > Incremental sends were simply not designed for these cases, and can > never be guaranteed to work with such cases. > > The bug is not having incremental sends fail right away, with an > explicit error message, when the send and parent roots aren't RO > snapshots of the same subvolume. I am sorry, I didn't want to anger you or to appear to be just stubborn by posting this. As I wrote in the cover letter I am aware that this is not a supported use case and I understand that that makes the patches likely to be rejected. As said the reason I _wrote_ the patches was simply to learn more about the btrfs code and its internals and see if I would be able to understand it enough. The reason I _submitted_ them was just to document what I found out so others could have a look into it and just in case it maybe useful at a later time. I also don't want to claim that these will add full support for sending unrelated roots - they don't! They only handle those very specific edge cases I found, which are currently _possible_, although still not supported. I took a deeper look into the rest to see if it could be supported: the comparing algorithm actually works fine, even with completely unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb, changed_inode etc.), but the processing of the changes (i.e. process_recorded_refs etc.) is heavily based on (ino, gen) as identifying handle, which can not be changed without the high risk of regression - just as you said in your earlier comments - since side effects of any changes are hard to see or understand without a very deep understanding of the whole code; which is why I didn't even try to touch that parts. I apologize if I appeared to be stubborn or ignorant of
Re: [PATCH v3 02/12] btrfs: add a trace point for reserve tickets
On Fri, Oct 09, 2020 at 09:28:19AM -0400, Josef Bacik wrote: > While debugging a ENOSPC related performance problem I needed to see the > time difference between start and end of a reserve ticket, so add a > trace point to report when we handle a reserve ticket. > > I opted to spit out start_ns itself without calculating the difference > because there could be a gap between enabling the tracpoint and setting > start_ns. Doing it this way allows us to filter on 0 start_ns so we > don't get bogus entries, and we can easily calculate the time difference > with bpftrace or something else. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Josef Bacik > --- > fs/btrfs/space-info.c| 10 +- > include/trace/events/btrfs.h | 29 + > 2 files changed, 38 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index ba2b72409d46..ac7269cf1904 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -1224,6 +1224,7 @@ static void wait_reserve_ticket(struct btrfs_fs_info > *fs_info, > static int handle_reserve_ticket(struct btrfs_fs_info *fs_info, >struct btrfs_space_info *space_info, >struct reserve_ticket *ticket, > + u64 start_ns, u64 orig_bytes, Added + * @start_ns: timestamp when the reservation started + * @orig_bytes: amount of bytes originally reserved caught by the recently added kdoc verification. >enum btrfs_reserve_flush_enum flush)
Re: [PATCH v5 0/8] A variety of lock contention fixes
On Fri, Dec 18, 2020 at 02:24:18PM -0500, Josef Bacik wrote: > v4->v5: > - Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay > pointed out I needed to explain why we no longer needed one of the delayed > ref > flushes, which led me down the rabbit hole of trying to understand why it > wasn't a problem anymore. Turned out the BUG_ON() is bogus. > - Added "btrfs: move delayed ref flushing for qgroup into qgroup helper", > instead of removing the flushing for qgroups completely, we still sort of > need > it, even though it's actually still broken, so I've moved it into the qgroup > helper. > - Added Nikolay's rb for the last patch. Moved from topic branch to misc-next, thanks.
Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit
On Tue, Jan 12, 2021 at 11:17:45AM +0200, Nikolay Borisov wrote: > > > On 11.01.21 г. 23:50 ч., David Sterba wrote: > > On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote: > >> On 8.01.21 г. 18:01 ч., David Sterba wrote: > >>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote: > @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct > btrfs_trans_handle *trans) > btrfs_trans_release_metadata(trans); > trans->block_rsv = NULL; > > -/* make a pass through all the delayed refs we have so far > - * any runnings procs may add more while we are here > - */ > -ret = btrfs_run_delayed_refs(trans, 0); > -if (ret) { > -btrfs_end_transaction(trans); > -return ret; > -} > - > -cur_trans = trans->transaction; > - > /* > - * set the flushing flag so procs in this transaction have to > - * start sending their work down. > + * We only want one transaction commit doing the flushing so we > do not > + * waste a bunch of time on lock contention on the extent root > node. > */ > -cur_trans->delayed_refs.flushing = 1; > -smp_wmb(); > >>> > >>> This barrier obviously separates the flushing = 1 and the rest of the > >>> code, now implemented as test_and_set_bit, which implies full barrier. > >>> > >>> However, hunk in btrfs_should_end_transaction removes the barrier and > >>> I'm not sure whether this is correct: > >>> > >>> - smp_mb(); > >>> if (cur_trans->state >= TRANS_STATE_COMMIT_START || > >>> - cur_trans->delayed_refs.flushing) > >>> + test_bit(BTRFS_DELAYED_REFS_FLUSHING, > >>> + &cur_trans->delayed_refs.flags)) > >>> return true; > >>> > >>> This is never called under locks so we don't have complete > >>> synchronization of neither the transaction state nor the flushing bit. > >>> btrfs_should_end_transaction is merely a hint and not called in critical > >>> places so we could probably afford to keep it without a barrier, or keep > >>> it with comment(s). > >> > >> I think the point is moot in this case, because the test_bit either sees > >> the flag or it doesn't. It's not possible for the flag to be set AND > >> should_end_transaction return false that would be gross violation of > >> program correctness. > > > > So that's for the flushing part, but what about cur_trans->state? > > Looking at the code, the barrier was there to order the publishing of > the delayed_ref.flushing (now replaced by the bit flag) against > surrounding code. > > So independently of this patch, let's reason about trans state. In > should_end_transaction it's read without holding any locks. (U) > > It's modified in btrfs_cleanup_transaction without holding the > fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. > > set in cleanup_transaction under fs_info->trans_lock (L) > set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) > set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) > set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) > > set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this > point the transaction is finished and fs_info->running_trans is NULL (U > but irrelevant). > > So by the looks of it we can have a concurrent READ race with a Write, > due to reads not taking a lock. In this case what we want to ensure is > we either see new or old state. I consulted with Will Deacon and he said > that in such a case we'd want to annotate the accesses to ->state with > (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I > don't think this could happen but I imagine at some point kcsan would > flag such an access as racy (which it is). Thanks for the analysis, I've copied it to the changelog as there's probably no shorter way to explain it.
[PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode
From: Filipe Manana When we fsync a new file, created in the current transaction, we check all its ancestor inodes and always log them if they were created in the current transaction - even if we have already logged them before, which is a waste of time. So avoid logging new ancestor inodes if they were already logged before and have no xattrs added/updated/removed since they were last logged. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 35 +-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index be62759f0aac..105cf316ee27 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5272,6 +5272,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, if (S_ISDIR(inode->vfs_inode.i_mode)) { int max_key_type = BTRFS_DIR_LOG_INDEX_KEY; + clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); if (inode_only == LOG_INODE_EXISTS) max_key_type = BTRFS_XATTR_ITEM_KEY; ret = drop_objectid_items(trans, log, path, ino, max_key_type); @@ -5520,6 +5521,34 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans, return ret; } +/* + * Check if we need to log an inode. This is used in contexts where while + * logging an inode we need to log another inode (either that it exists or in + * full mode). This is used instead of btrfs_inode_in_log() because the later + * requires the inode to be in the log and have the log transaction committed, + * while here we do not care if the log transaction was already committed - our + * caller will commit the log later - and we want to avoid logging an inode + * multiple times when multiple tasks have joined the same log transaction. + */ +static bool need_log_inode(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode) +{ + /* +* If this inode does not have new/updated/deleted xattrs since the last +* time it was logged and is flagged as logged in the current transaction, +* we can skip logging it. As for new/deleted names, those are updated in +* the log by link/unlink/rename operations. +* In case the inode was logged and then evicted and reloaded, its +* logged_trans will be 0, in which case we have to fully log it since +* logged_trans is a transient field, not persisted. +*/ + if (inode->logged_trans == trans->transid && + !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) + return false; + + return true; +} + struct btrfs_dir_list { u64 ino; struct list_head list; @@ -5848,7 +5877,8 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, if (IS_ERR(inode)) return PTR_ERR(inode); - if (BTRFS_I(inode)->generation >= trans->transid) + if (BTRFS_I(inode)->generation >= trans->transid && + need_log_inode(trans, BTRFS_I(inode))) ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); @@ -5902,7 +5932,8 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, if (root != inode->root) break; - if (inode->generation >= trans->transid) { + if (inode->generation >= trans->transid && + need_log_inode(trans, inode)) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, ctx); if (ret) -- 2.28.0
[PATCH 4/7] btrfs: skip logging directories already logged when logging all parents
From: Filipe Manana Some times when we fsync an inode we need to do a full log of all its ancestors (due to unlink, link or rename operations), which can be an expensive operation, specially if the directories are large. However if we find an ancestor directory inode that is already logged in the current transaction, and has no inserted/updated/deleted xattrs since it was last logged, we can skip logging the directory again. We are safe to skip that since we know that for logged directories, any link, unlink or rename operations that implicate the directory will update the log as necessary. So use the helper need_log_dir(), introduced in a previous commit, to detect already logged directories that can be skipped. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 105cf316ee27..c0dce99c2c14 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5826,6 +5826,11 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, goto out; } + if (!need_log_inode(trans, BTRFS_I(dir_inode))) { + btrfs_add_delayed_iput(dir_inode); + continue; + } + if (ctx) ctx->log_new_dentries = false; ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode), -- 2.28.0
[PATCH 0/7] btrfs: more performance improvements for dbench workloads
From: Filipe Manana The following patchset brings one more batch of performance improvements with dbench workloads, or anything that mixes file creation, file writes, renames, unlinks, etc with fsync like dbench does. This patchset is mostly based on avoiding logging directory inodes multiple times when not necessary, falling back to transaction commits less frequently and often waiting less time for transaction commits to complete. Performance results are listed in the change log of the last patch, but in short, I've experienced a reduction of maximum latency up to about -40% and throuhput gains up to about +6%. Filipe Manana (7): btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fs/btrfs/file.c| 1 + fs/btrfs/transaction.c | 39 +++-- fs/btrfs/transaction.h | 2 + fs/btrfs/tree-log.c| 195 - 4 files changed, 92 insertions(+), 145 deletions(-) -- 2.28.0
[PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
From: Filipe Manana Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation CountAvgLatMaxLat NTCreateX9627107 0.15361.938 Close7072076 0.001 3.175 Rename407633 1.22244.439 Unlink 1943895 0.65844.440 Deltree 25617.339 110.891 Mkdir128 0.003 0.009 Qpathinfo8725406 0.06417.850 Qfileinfo1529516 0.001 2.188 Qfsinfo 1599884 0.002 1.457 Sfileinfo 784200 0.005 3.562 Find 3373513 0.41130.312 WriteX 4802132 0.05329.054 ReadX15089959 0.002 5.801 LockX 31344 0.002 0.425 UnlockX31344 0.001 0.173 Flush 674724 5.952 341.830 Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms After applying patchset, 32 clients: After patchset, with 32 clients: Operation CountAvgLatMaxLat NTCreateX9931568 0.11125.597 Close7295730 0.001 2.171 Rename420549 0.98249.714 Unlink 2005366 0.49739.015 Deltree 25611.14989.242 Mkdir128 0.002 0.014 Qpathinfo9001863 0.04920.761 Qfileinfo1577730 0.001 2.546 Qfsinfo 1650508 0.002 3.531 Sfileinfo 809031 0.005 5.846 Find 3480259 0.30923.977 WriteX 4952505 0.04341.283 ReadX15568127 0.002 5.476 LockX 32338 0.002 0.978 UnlockX32338 0.001 2.032 Flush 696017 7.485 228.835 Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms --> +4.1% throughput, -39.6% max latency Before applying patchset, 64 clients: Operation CountAvgLatMaxLat NTCreateX8956748 0.342 108.312 Close6579660 0.001 3.823 Rename379209 2.39681.897 Unlink 1808625 1.108 131.148 Deltree 25625.632 172.176 Mkdir128 0.003 0.018 Qpathinfo8117615 0.13155.916 Qfileinfo1423495 0.001 2.635 Qfsinfo 1488496 0.002 5.412 Sfileinfo 729472
[PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry
From: Filipe Manana When we remove a directory entry, as part of an unlink operation, if the directory was logged before we must remove the directory index items from the log. We are also updating the inode item of the directory to update its i_size, but that is not necessary because during log replay we do not need it and we correctly adjust the i_size in the inode item of the subvolume as we process directory index items and replay deletes. This is not needed since commit d555438b6e1dad ("Btrfs: drop dir i_size when adding new names on replay"), where we explicitly ignore the i_size of directory inode items on log replay. Before that we used it but it was buggy as mentioned in that commit's change log (i_size got a larger value then it should have). So stop updating the i_size of the directory inode item in the log, as that is a waste of time, adds more log contention to the log tree and often results in COWing more extent buffers for the log tree. This code path is triggered often during dbench workloads for example. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 39 --- 1 file changed, 4 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 8ee0700a980f..5d87afc6058a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3379,7 +3379,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, struct btrfs_path *path; int ret; int err = 0; - int bytes_del = 0; u64 dir_ino = btrfs_ino(dir); if (!inode_logged(trans, dir)) @@ -3406,7 +3405,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, } if (di) { ret = btrfs_delete_one_dir_name(trans, log, path, di); - bytes_del += name_len; if (ret) { err = ret; goto fail; @@ -3421,46 +3419,17 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, } if (di) { ret = btrfs_delete_one_dir_name(trans, log, path, di); - bytes_del += name_len; if (ret) { err = ret; goto fail; } } - /* update the directory size in the log to reflect the names -* we have removed + /* +* We do not need to update the size field of the directory's inode item +* because on log replay we update the field to reflect all existing +* entries in the directory (see overwrite_item()). */ - if (bytes_del) { - struct btrfs_key key; - - key.objectid = dir_ino; - key.offset = 0; - key.type = BTRFS_INODE_ITEM_KEY; - btrfs_release_path(path); - - ret = btrfs_search_slot(trans, log, &key, path, 0, 1); - if (ret < 0) { - err = ret; - goto fail; - } - if (ret == 0) { - struct btrfs_inode_item *item; - u64 i_size; - - item = btrfs_item_ptr(path->nodes[0], path->slots[0], - struct btrfs_inode_item); - i_size = btrfs_inode_size(path->nodes[0], item); - if (i_size > bytes_del) - i_size -= bytes_del; - else - i_size = 0; - btrfs_set_inode_size(path->nodes[0], item, i_size); - btrfs_mark_buffer_dirty(path->nodes[0]); - } else - ret = 0; - btrfs_release_path(path); - } fail: btrfs_free_path(path); out_unlock: -- 2.28.0
[PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries
From: Filipe Manana When logging new directory entries of a directory, we log the inodes of new dentries and the inodes of dentries pointing to directories that may have been created in past transactions. For the case of directories we log in full mode, which can be particularly expensive for large directories. We do use btrfs_inode_in_log() to skip already logged inodes, however for that helper to return true, it requires that the log transaction used to log the inode to be already committed. This means that when we have more than one task using the same log transaction we can end up logging an inode multiple times, which is a waste of time and not necessary since the log will be committed by one of the tasks and the others will wait for the log transaction to be committed before returning to user space. So simply replace the use of btrfs_inode_in_log() with the new helper function need_log_inode(), introduced in a previous commit. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c0dce99c2c14..6dc376a16cf2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5676,7 +5676,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, goto next_dir_inode; } - if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { + if (!need_log_inode(trans, BTRFS_I(di_inode))) { btrfs_add_delayed_iput(di_inode); break; } -- 2.28.0
[PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging
From: Filipe Manana When we fill an inode item for logging we are setting its nbytes field with the value returned by inode_get_bytes() (a VFS API), however we do not need it because it is not used during log replay. In fact, for fast fsyncs, when we call inode_get_bytes() we may even get an outdated value for nbytes because the nbytes field of the inode is only updated when ordered extents complete, and a fast fsync only waits for writeback to complete, it does not wait for ordered extent completion. So just remove the setup of nbytes and add an explicit comment mentioning why we do not set it. This also avoids adding contention on the inode's i_lock (VFS) with concurrent stat() calls, since that spinlock is used by inode_get_bytes() which is also called by our stat callback (btrfs_getattr()). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 5d87afc6058a..be62759f0aac 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3858,7 +3858,14 @@ static void fill_inode_item(struct btrfs_trans_handle *trans, btrfs_set_token_timespec_nsec(&token, &item->ctime, inode->i_ctime.tv_nsec); - btrfs_set_token_inode_nbytes(&token, item, inode_get_bytes(inode)); + /* +* We do not need to set the nbytes field, in fact during a fast fsync +* its value may not even be correct, since a fast fsync does not wait +* for ordered extent completion, which is where we update nbytes, it +* only waits for writeback to complete. During log replay as we find +* file extent items and replay them, we adjust the nbytes field of the +* inode item in subvolume tree as needed (see overwrite_item()). +*/ btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode)); btrfs_set_token_inode_transid(&token, item, trans->transid); -- 2.28.0
[PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
From: Filipe Manana Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 121 ++-- 1 file changed, 15 insertions(+), 106 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6dc376a16cf2..4c7b283ed2b2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5265,6 +5265,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, mutex_lock(&inode->log_mutex); } + /* +* This is for cases where logging a directory could result in losing a +* a file after replaying the log. For example, if we move a file from a +* directory A to a directory B, then fsync directory A, we have no way +* to known the file was moved from A to B, so logging just A would +* result in losing the file after a log replay. +*/ + if (S_ISDIR(inode->vfs_inode.i_mode) && + inode_only == LOG_INODE_ALL && + inode->last_unlink_trans >= trans->transid) { + btrfs_set_log_full_commit(trans); + err = 1; + goto out_unlock; + } + /* * a brute force approach to making sure we get the most uptodate * copies of everything. @@ -5428,99 +5443,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, return err; } -/* - * Check if we must fallback to a transaction commit when logging an inode. - * This must be called after logging the inode and is used only in the context - * when fsyncing an inode requires the need to log some other inode - in which - * case we can't lock the i_mutex of each other inode we need to log as that - * can lead to deadlocks with concurrent fsync against other inodes (as we can - * log inodes up or down in the hierarchy) or rename operations for example. So - * we take the log_mutex of the inode after we have logged it and then check for - * its last_unlink_trans value - this is safe because any task setting - * last_unlink_trans must take the log_mutex and it must do this before it does - * the actual unlink operation, so if we do this check before a concurrent task - * sets last_unlink_trans it means we've logged a consistent version/state of - * all the inode items, otherwise we are not sure and must do a transaction - * commit (the concurrent task might have only updated last_unlink_trans before - * we logged the inode or it might have also done the unlink). - */ -static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans, - struct btrfs_inode *inode) -{ - bool ret = false; - - mutex_lock(&inode->log_mutex); - if (inode->last_unlink_trans >= trans->transid) { - /* -* Make sure any commits to the log are forced to be full -* commits. -*/ - btrfs_set_log_full_commit(trans); - ret = true; -
Re: [PATCH] btrfs: fix a bug that btrfs_invalidapge() can double account ordered extent for subpage
On Wed, Jan 27, 2021 at 8:29 AM Qu Wenruo wrote: > > Commit dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could > span across a page") make btrfs_invalidapage() to search all ordered > extents. > > The offending code looks like this: > > again: > start = page_start; > ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + > 1); > if (ordred) { > end = min(page_end, > ordered->file_offset + ordered->num_bytes - 1); > > /* Do the cleanup */ > > start = end + 1; > if (start < page_end) > goto again; > } > > The behavior is indeed necessary for the incoming subpage support, but > when it iterate through all the ordered extents, it also resets the > search range @start. > > This means, for the following cases, we can double account the ordered > extents, causing its bytes_left underflow: > > Page offset > 0 16K 32K > |<--- OE 1 --->|<--- OE 2 >| > > As the first iteration will find OE 1, which doesn't cover the full > page, thus after cleanup code, we need to retry again. > But again label will reset start to page_start, and we got OE 1 again, > which causes double account on OE1, and cause OE1's byte_left to > underflow. > > The only good news is, this problem can only happen for subpage case, as > for regular sectorsize == PAGE_SIZE case, we will always find a OE ends > at or after page end, thus no way to trigger the problem. > > This patch will move the again label after start = page_start, to fix > the bug. > This is just a quick fix, which is easy to backport. Hum? Why does it need to be backported? There are no kernel releases that support subpage sector size yet and this does not affect the case where sector size is the same as the page size. > > There will be more comprehensive rework to convert the open coded loop to > a proper while loop. > > Fixes: dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span > across a page") > Signed-off-by: Qu Wenruo Anyway, it looks good to me, thanks. Reviewed-by: Filipe Manana > --- > fs/btrfs/inode.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index ef6cb7b620d0..2eea7d22405a 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -8184,8 +8184,9 @@ static void btrfs_invalidatepage(struct page *page, > unsigned int offset, > > if (!inode_evicting) > lock_extent_bits(tree, page_start, page_end, &cached_state); > -again: > + > start = page_start; > +again: > ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + > 1); > if (ordered) { > found_ordered = true; > -- > 2.30.0 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH] btrfs: fix a bug that btrfs_invalidapge() can double account ordered extent for subpage
On 2021/1/27 下午6:44, Filipe Manana wrote: On Wed, Jan 27, 2021 at 8:29 AM Qu Wenruo wrote: Commit dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span across a page") make btrfs_invalidapage() to search all ordered extents. The offending code looks like this: again: start = page_start; ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1); if (ordred) { end = min(page_end, ordered->file_offset + ordered->num_bytes - 1); /* Do the cleanup */ start = end + 1; if (start < page_end) goto again; } The behavior is indeed necessary for the incoming subpage support, but when it iterate through all the ordered extents, it also resets the search range @start. This means, for the following cases, we can double account the ordered extents, causing its bytes_left underflow: Page offset 0 16K 32K |<--- OE 1 --->|<--- OE 2 >| As the first iteration will find OE 1, which doesn't cover the full page, thus after cleanup code, we need to retry again. But again label will reset start to page_start, and we got OE 1 again, which causes double account on OE1, and cause OE1's byte_left to underflow. The only good news is, this problem can only happen for subpage case, as for regular sectorsize == PAGE_SIZE case, we will always find a OE ends at or after page end, thus no way to trigger the problem. This patch will move the again label after start = page_start, to fix the bug. This is just a quick fix, which is easy to backport. Hum? Why does it need to be backported? There are no kernel releases that support subpage sector size yet and this does not affect the case where sector size is the same as the page size. Hmmm, right, there is no need to backport. Hopes David can remove that line when merging. Thanks, Qu There will be more comprehensive rework to convert the open coded loop to a proper while loop. Fixes: dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span across a page") Signed-off-by: Qu Wenruo Anyway, it looks good to me, thanks. Reviewed-by: Filipe Manana --- fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ef6cb7b620d0..2eea7d22405a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8184,8 +8184,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, if (!inode_evicting) lock_extent_bits(tree, page_start, page_end, &cached_state); -again: + start = page_start; +again: ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1); if (ordered) { found_ordered = true; -- 2.30.0
Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen
On Tue, Jan 26, 2021 at 7:19 PM Roman Anasal | BDSU wrote: > > Am Montag, den 25.01.2021, 20:51 + schrieb Filipe Manana: > > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal > > wrote: > > > This is analogous to the preceding patch ("btrfs: send: fix invalid > > > commands for inodes with changed type but same gen") but for > > > changed > > > rdev: > > > > > > When doing an incremental send, if a new inode has the same number > > > as an > > > inode in the parent subvolume, was created with the same generation > > > but > > > has differing rdev it will not be detected as changed and thus not > > > recreated. This will lead to incorrect results on the receiver > > > where the > > > inode will keep the rdev of the inode in the parent subvolume or > > > even > > > fail when also the ref is unchanged. > > > > > > This case does not happen when doing incremental sends with > > > snapshots > > > that are kept read-only by the user all the time, but it may happen > > > if > > > - a snapshot was modified in the same transaction as its parent > > > after it > > > was created > > > - the subvol used as parent was created independently from the sent > > > subvol > > > > > > Example reproducers: > > > > > > # case 1: same ino at same path > > > btrfs subvolume create subvol1 > > > btrfs subvolume create subvol2 > > > mknod subvol1/a c 1 3 > > > mknod subvol2/a c 1 5 > > > btrfs property set subvol1 ro true > > > btrfs property set subvol2 ro true > > > btrfs send -p subvol1 subvol2 | btrfs receive --dump > > > > > > The produced tree state here is: > > > |-- subvol1 > > > | `-- a (ino 257, c 1,3) > > > | > > > `-- subvol2 > > > `-- a (ino 257, c 1,5) > > > > > > Where subvol1/a and subvol2/a are character devices with differing > > > minor > > > numbers but same inode number and same generation. > > > > > > Example output of the receive command: > > > At subvol subvol2 > > > snapshot./subvol2 uuid=7513941c- > > > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226- > > > 2548-9e39-048c7fdbec99 parent_transid=9 > > > utimes ./subvol2/ atime=2021-01- > > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > > 25T17:14:36+ > > > link./subvol2/a dest=a > > > unlink ./subvol2/a > > > utimes ./subvol2/ atime=2021-01- > > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > > 25T17:14:36+ > > > utimes ./subvol2/a atime=2021-01- > > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01- > > > 25T17:14:36+ > > > > > > => the `link` command causes the receiver to fail with: > > >ERROR: link a -> a failed: File exists > > > > > > Second example: > > > # case 2: same ino at different path > > > btrfs subvolume create subvol1 > > > btrfs subvolume create subvol2 > > > mknod subvol1/a c 1 3 > > > mknod subvol2/b c 1 5 > > > btrfs property set subvol1 ro true > > > btrfs property set subvol2 ro true > > > btrfs send -p subvol1 subvol2 | btrfs receive --dump > > > > As I've told you before for the v1 patchset from a week or two ago, > > this is not a supported scenario for incremental sends. > > Incremental sends are meant to be used on RO snapshots of the same > > subvolume, and those snapshots must never be changed after they were > > created. > > > > Incremental sends were simply not designed for these cases, and can > > never be guaranteed to work with such cases. > > > > The bug is not having incremental sends fail right away, with an > > explicit error message, when the send and parent roots aren't RO > > snapshots of the same subvolume. > > I am sorry, I didn't want to anger you or to appear to be just stubborn > by posting this. > > As I wrote in the cover letter I am aware that this is not a supported > use case and I understand that that makes the patches likely to be > rejected. Ok, now I got the cover letter and the remaining v2 patches. Vger has been having some lag this week, only got the mails during the last evening. Thanks. > As said the reason I _wrote_ the patches was simply to learn more about > the btrfs code and its internals and see if I would be able to > understand it enough. The reason I _submitted_ them was just to > document what I found out so others could have a look into it and just > in case it maybe useful at a later time. > > I also don't want to claim that these will add full support for sending > unrelated roots - they don't! They only handle those very specific edge > cases I found, which are currently _possible_, although still not > supported. > > I took a deeper look into the rest to see if it could be supported: > the comparing algorithm actually works fine, even with completely > unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb, > changed_inode etc.), but the processing of
Re: [PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice
On Wed, Jan 27, 2021 at 9:59 AM Michal Rostecki wrote: > > From: Michal Rostecki > > Before this change, the btrfs_get_io_geometry() function was calling > btrfs_get_chunk_map() to get the extent mapping, necessary for > calculating the I/O geometry. It was using that extent mapping only > internally and freeing the pointer after its execution. > > That resulted in calling btrfs_get_chunk_map() de facto twice by the > __btrfs_map_block() function. It was calling btrfs_get_io_geometry() > first and then calling btrfs_get_chunk_map() directly to get the extent > mapping, used by the rest of the function. > > This change fixes that by passing the extent mapping to the > btrfs_get_io_geometry() function as an argument. > > Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately") Generally we only use the Fixes tag for bug fixes or serious performance regressions. Have you seen here a serious performance regression? > Signed-off-by: Michal Rostecki > --- > fs/btrfs/inode.c | 37 - > fs/btrfs/volumes.c | 39 --- > fs/btrfs/volumes.h | 5 +++-- > 3 files changed, 47 insertions(+), 34 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0dbe1aaa0b71..a4ce8501ed4d 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t > size, struct bio *bio, > struct inode *inode = page->mapping->host; > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > u64 logical = bio->bi_iter.bi_sector << 9; > + struct extent_map *em; > u64 length = 0; > u64 map_length; > - int ret; > + int ret = 0; > struct btrfs_io_geometry geom; > > if (bio_flags & EXTENT_BIO_COMPRESSED) > @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, > size_t size, struct bio *bio, > > length = bio->bi_iter.bi_size; > map_length = length; > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, > map_length, > - &geom); > + em = btrfs_get_chunk_map(fs_info, logical, map_length); > + if (IS_ERR(em)) > + return PTR_ERR(em); > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, > + map_length, &geom); > if (ret < 0) > - return ret; > + goto out; > > - if (geom.len < length + size) > - return 1; > - return 0; > + if (geom.len < length + size) { > + ret = 1; > + goto out; > + } > +out: > + free_extent_map(em); > + return ret; > } > > /* > @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > u64 submit_len; > int clone_offset = 0; > int clone_len; > + int logical; > int ret; > blk_status_t status; > struct btrfs_io_geometry geom; > struct btrfs_dio_data *dio_data = iomap->private; > + struct extent_map *em; > > dip = btrfs_create_dio_private(dio_bio, inode, file_offset); > if (!dip) { > @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > } > > start_sector = dio_bio->bi_iter.bi_sector; > + logical = start_sector << 9; > submit_len = dio_bio->bi_iter.bi_size; > > do { > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio), > - start_sector << 9, submit_len, > + em = btrfs_get_chunk_map(fs_info, logical, submit_len); > + if (IS_ERR(em)) { > + status = errno_to_blk_status(ret); > + goto out_err; em must be set to NULL before going to "out_err", otherwise we get a crash due to an invalid memory access. Also, status should be set to "errno_to_blk_status(PTR_ERR(em))". The value of ret at this point is undefined. Other than that, it looks good. Thanks. > + } > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), > + logical, submit_len, > &geom); > if (ret) { > status = errno_to_blk_status(ret); > @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > clone_offset += clone_len; > start_sector += clone_len >> 9; > file_offset += clone_len; > + > + free_extent_map(em); > } while (submit_len > 0); > return BLK_QC_T_NONE; > > out_err: > dip->dio_bio->bi_status = status; > btrfs_dio_private_put(dip); > + free_extent_map(em); > return BLK_QC_T_NONE; > }
Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig
On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote: > From: Matthias Brugger > > Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms") > btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select > it. This leads to compilation errors: > fs/built-in.o: In function `hash_sha256': > fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts' > fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update' > fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish' > > Signed-off-by: Matthias Brugger So this is a fix for u-boot, got me confused and not for the first time as there's Kconfig and the same fs/btrfs/ directory structure.
Re: [PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice
On Wed, Jan 27, 2021 at 11:20:55AM +, Filipe Manana wrote: > On Wed, Jan 27, 2021 at 9:59 AM Michal Rostecki wrote: > > > > From: Michal Rostecki > > > > Before this change, the btrfs_get_io_geometry() function was calling > > btrfs_get_chunk_map() to get the extent mapping, necessary for > > calculating the I/O geometry. It was using that extent mapping only > > internally and freeing the pointer after its execution. > > > > That resulted in calling btrfs_get_chunk_map() de facto twice by the > > __btrfs_map_block() function. It was calling btrfs_get_io_geometry() > > first and then calling btrfs_get_chunk_map() directly to get the extent > > mapping, used by the rest of the function. > > > > This change fixes that by passing the extent mapping to the > > btrfs_get_io_geometry() function as an argument. > > > > Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately") > > Generally we only use the Fixes tag for bug fixes or serious > performance regressions. > Have you seen here a serious performance regression? > No, I didn't see any big difference in terms of performance (at least in seq reads). I will remove the tag in v2. > > Signed-off-by: Michal Rostecki > > --- > > fs/btrfs/inode.c | 37 - > > fs/btrfs/volumes.c | 39 --- > > fs/btrfs/volumes.h | 5 +++-- > > 3 files changed, 47 insertions(+), 34 deletions(-) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 0dbe1aaa0b71..a4ce8501ed4d 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, > > size_t size, struct bio *bio, > > struct inode *inode = page->mapping->host; > > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > > u64 logical = bio->bi_iter.bi_sector << 9; > > + struct extent_map *em; > > u64 length = 0; > > u64 map_length; > > - int ret; > > + int ret = 0; > > struct btrfs_io_geometry geom; > > > > if (bio_flags & EXTENT_BIO_COMPRESSED) > > @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, > > size_t size, struct bio *bio, > > > > length = bio->bi_iter.bi_size; > > map_length = length; > > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, > > map_length, > > - &geom); > > + em = btrfs_get_chunk_map(fs_info, logical, map_length); > > + if (IS_ERR(em)) > > + return PTR_ERR(em); > > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, > > + map_length, &geom); > > if (ret < 0) > > - return ret; > > + goto out; > > > > - if (geom.len < length + size) > > - return 1; > > - return 0; > > + if (geom.len < length + size) { > > + ret = 1; > > + goto out; > > + } > > +out: > > + free_extent_map(em); > > + return ret; > > } > > > > /* > > @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode > > *inode, struct iomap *iomap, > > u64 submit_len; > > int clone_offset = 0; > > int clone_len; > > + int logical; > > int ret; > > blk_status_t status; > > struct btrfs_io_geometry geom; > > struct btrfs_dio_data *dio_data = iomap->private; > > + struct extent_map *em; > > > > dip = btrfs_create_dio_private(dio_bio, inode, file_offset); > > if (!dip) { > > @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode > > *inode, struct iomap *iomap, > > } > > > > start_sector = dio_bio->bi_iter.bi_sector; > > + logical = start_sector << 9; > > submit_len = dio_bio->bi_iter.bi_size; > > > > do { > > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio), > > - start_sector << 9, submit_len, > > + em = btrfs_get_chunk_map(fs_info, logical, submit_len); > > + if (IS_ERR(em)) { > > + status = errno_to_blk_status(ret); > > + goto out_err; > > em must be set to NULL before going to "out_err", otherwise we get a > crash due to an invalid memory access. > > Also, status should be set to "errno_to_blk_status(PTR_ERR(em))". The > value of ret at this point is undefined. > > Other than that, it looks good. > > Thanks. > Will fix in v2. Thanks! > > + } > > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), > > + logical, submit_len, > > &geom); > > if (ret) { > > status = errno_to_blk_status(ret); > > @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struc
Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig
On 2021/1/27 下午8:01, David Sterba wrote: On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote: From: Matthias Brugger Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms") btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select it. This leads to compilation errors: fs/built-in.o: In function `hash_sha256': fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts' fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update' fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish' Signed-off-by: Matthias Brugger So this is a fix for u-boot, got me confused and not for the first time as there's Kconfig and the same fs/btrfs/ directory structure. Well, sometimes too unified file structure/code base can also be a problem. Considering I'm also going to continue cross-porting more code to U-boot, any recommendation on this? Using different prefix? Thanks, Qu
Re: Only one subvolume can be mounted after replace/balance
Thank you Chris, it's resolved now, see below. Am 25.01.21 um 23:47 schrieb Chris Murphy: On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl wrote: Hi, In short: When mounting a second subvolume from a pool, I get this error: "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda, missing code page or helper program, or other." dmesg | grep BTRFS only shows this error: info (device sda): disk space caching is enabled error (device sda): Remounting read-write after error is not allowed It went read-only before this because it's confused. You need to unmount it before it can be mounted rw. In some cases a reboot is needed. Oh, I didn't notice that the pool was already mounted (via fstab). The filesystem where out of space and I had to resize both disks separately. And I had to mount with -o skip_balance for that. Now it works again. What happened: In my RAID1 pool with two disk, I successfully replaced one disk with btrfs replace start 2 /dev/sdx After that, I mounted the pool and did I don't understand this sequence. In order to do a replace, the file system is already mounted. That was, what I did before my actual problem occurred. But it's resolved now. btrfs fi show /mnt which showed WARNINGs about "filesystems with multiple block group profiles detected" (don't remember exactly) I thought it is a good idea to do btrfs balance start /mnt which finished without errors. Balance alone does not convert block groups to a new profile. You have to explicitly select a conversion filter, e.g. btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt I didn't want to convert to a new profile. I thought btrfs replace automatically uses the same profile as the pool? Regards, Jakob
Re: btrfs becomes read-only
I managed to run btrs check, but it didn't solve the problem: aleksey@host:~$ sudo btrfs check --repair /dev/sdg [sudo] password for aleksey: enabling repair mode Checking filesystem on /dev/sdg UUID: 070ce9af-6511-4b89-a501-0823514320c1 checking extents parent transid verify failed on 52180048330752 wanted 132477 found 132432 parent transid verify failed on 52180048330752 wanted 132477 found 132432 parent transid verify failed on 52180048330752 wanted 132477 found 132432 parent transid verify failed on 52180048330752 wanted 132477 found 132432 Ignoring transid failure leaf parent key incorrect 52180048330752 bad block 52180048330752 Errors found in extent allocation tree or chunk allocation parent transid verify failed on 52180048330752 wanted 132477 found 132432 Ignoring transid failure Fixed 0 roots. checking free space cache parent transid verify failed on 52180048330752 wanted 132477 found 132432 Ignoring transid failure There is no free space entry for 5100308647936-5101376765952 cache appears valid but isnt 5100303024128 block group 5143252697088 has wrong amount of free space failed to load free space cache for block group 5143252697088 block group 10289697259520 has wrong amount of free space failed to load free space cache for block group 10289697259520 block group 10447537307648 has wrong amount of free space failed to load free space cache for block group 10447537307648 block group 10856632942592 has wrong amount of free space failed to load free space cache for block group 10856632942592 block group 10877034037248 has wrong amount of free space failed to load free space cache for block group 10877034037248 block group 10901730099200 has wrong amount of free space failed to load free space cache for block group 10901730099200 block group 37427448119296 has wrong amount of free space failed to load free space cache for block group 37427448119296 block group 40151531126784 has wrong amount of free space failed to load free space cache for block group 40151531126784 block group 40155826094080 has wrong amount of free space failed to load free space cache for block group 40155826094080 block group 45729619902464 has wrong amount of free space failed to load free space cache for block group 45729619902464 block group 45751094738944 has wrong amount of free space failed to load free space cache for block group 45751094738944 block group 45775790800896 has wrong amount of free space failed to load free space cache for block group 45775790800896 block group 45797265637376 has wrong amount of free space failed to load free space cache for block group 45797265637376 block group 45839141568512 has wrong amount of free space failed to load free space cache for block group 45839141568512 found 31266648615205 bytes used err is -22 total csum bytes: 0 total tree bytes: 2949169152 total fs tree bytes: 0 total extent tree bytes: 2927755264 btree space waste bytes: 813795673 file data blocks allocated: 9188147200 referenced 9188147200 С уважением, Алексей Исаев, RQC 27.01.2021 10:38, Chris Murphy пишет: On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev wrote: Hello! BTRFS volume becomes read-only with this messages in dmesg. What can i do to repair btrfs partition? [Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.007587] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.000132] BTRFS error (device sdg): qgroup scan failed with -5 [Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.009783] BTRFS error (device sdg): parent transid verify failed on 52180048330752 wanted 132477 found 132432 [ +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176: errno=-5 IO failure [ +0.60] BTRFS info (device sdg): forced readonly [ +0.04] BTRFS info (device sdg): failed to delete reference to ftrace.h, inode 2986197 parent 2989315 [ +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220: errno=-5 IO failure [ +0.006071] BTRFS error (device sdg): pending csums is 430080 What kernel version? What drive make/model? wanted 132477 found 132432 indicates the drive has lost ~45 transactions, that's not good and also weird. There's no crash or any other errors? A complete dmesg might be more revealing. And also smartctl -x /dev/sdg btrfs check --readonly /dev/sdg After that I suggest https://btrfs.wiki.kernel.org/index.php/Restore And try to get any important data out if it's not backed up. You can try btrfs-find-root to get a listing of roots, most recent to oldest. Start at the top, and plug that address in as 'btrfs restore -t' and see if it'll pull anything out. You likely need -i and -v options as well.
Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig
On Wed, Jan 27, 2021 at 08:14:31PM +0800, Qu Wenruo wrote: > > > On 2021/1/27 下午8:01, David Sterba wrote: > > On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote: > >> From: Matthias Brugger > >> > >> Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms") > >> btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select > >> it. This leads to compilation errors: > >> fs/built-in.o: In function `hash_sha256': > >> fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts' > >> fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update' > >> fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish' > >> > >> Signed-off-by: Matthias Brugger > > > > So this is a fix for u-boot, got me confused and not for the first time > > as there's Kconfig and the same fs/btrfs/ directory structure. > > > Well, sometimes too unified file structure/code base can also be a problem. > > Considering I'm also going to continue cross-porting more code to > U-boot, any recommendation on this? > Using different prefix? If it's a series then please mention u-boot in the cover letter, no need to change the patches, I'll go check CC if I'm too confused about the patch.
Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig
On Wed, 27 Jan 2021 14:47:58 +0100 David Sterba wrote: > If it's a series then please mention u-boot in the cover letter, no need > to change the patches, I'll go check CC if I'm too confused about the > patch. I would suggest using subject prefix [PATCH u-boot]
[PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice
From: Michal Rostecki Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. This change fixes that by passing the extent mapping to the btrfs_get_io_geometry() function as an argument. v2: When btrfs_get_chunk_map() returns an error in btrfs_submit_direct(): - Use errno_to_blk_status(PTR_ERR(em)) as the status - Set em to NULL Signed-off-by: Michal Rostecki --- fs/btrfs/inode.c | 38 +- fs/btrfs/volumes.c | 39 --- fs/btrfs/volumes.h | 5 +++-- 3 files changed, 48 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0dbe1aaa0b71..e2ee3a9c1140 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, struct inode *inode = page->mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); u64 logical = bio->bi_iter.bi_sector << 9; + struct extent_map *em; u64 length = 0; u64 map_length; - int ret; + int ret = 0; struct btrfs_io_geometry geom; if (bio_flags & EXTENT_BIO_COMPRESSED) @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, length = bio->bi_iter.bi_size; map_length = length; - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length, - &geom); + em = btrfs_get_chunk_map(fs_info, logical, map_length); + if (IS_ERR(em)) + return PTR_ERR(em); + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, + map_length, &geom); if (ret < 0) - return ret; + goto out; - if (geom.len < length + size) - return 1; - return 0; + if (geom.len < length + size) { + ret = 1; + goto out; + } +out: + free_extent_map(em); + return ret; } /* @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, u64 submit_len; int clone_offset = 0; int clone_len; + int logical; int ret; blk_status_t status; struct btrfs_io_geometry geom; struct btrfs_dio_data *dio_data = iomap->private; + struct extent_map *em; dip = btrfs_create_dio_private(dio_bio, inode, file_offset); if (!dip) { @@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, } start_sector = dio_bio->bi_iter.bi_sector; + logical = start_sector << 9; submit_len = dio_bio->bi_iter.bi_size; do { - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio), - start_sector << 9, submit_len, + em = btrfs_get_chunk_map(fs_info, logical, submit_len); + if (IS_ERR(em)) { + status = errno_to_blk_status(PTR_ERR(em)); + em = NULL; + goto out_err; + } + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), + logical, submit_len, &geom); if (ret) { status = errno_to_blk_status(ret); @@ -8030,12 +8047,15 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, clone_offset += clone_len; start_sector += clone_len >> 9; file_offset += clone_len; + + free_extent_map(em); } while (submit_len > 0); return BLK_QC_T_NONE; out_err: dip->dio_bio->bi_status = status; btrfs_dio_private_put(dip); + free_extent_map(em); return BLK_QC_T_NONE; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a8ec8539cd8d..4c753b17c0a2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5940,23 +5940,24 @@ static bool need_full_stripe(enum btrfs_map_op op) } /* - * btrfs_get_io_geometry - calculates the geomery of a particular (address, len) + * btrfs_get_io_geometry - calculates the geometry of a particular (address, len) *tuple. This information is used to calculate how big a *particular bio can get before it straddles a stripe. * - * @fs_info
Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice
On 27.01.21 г. 15:57 ч., Michal Rostecki wrote: > From: Michal Rostecki > > Before this change, the btrfs_get_io_geometry() function was calling > btrfs_get_chunk_map() to get the extent mapping, necessary for > calculating the I/O geometry. It was using that extent mapping only > internally and freeing the pointer after its execution. > > That resulted in calling btrfs_get_chunk_map() de facto twice by the > __btrfs_map_block() function. It was calling btrfs_get_io_geometry() > first and then calling btrfs_get_chunk_map() directly to get the extent > mapping, used by the rest of the function. > > This change fixes that by passing the extent mapping to the > btrfs_get_io_geometry() function as an argument. > > v2: > When btrfs_get_chunk_map() returns an error in btrfs_submit_direct(): > - Use errno_to_blk_status(PTR_ERR(em)) as the status > - Set em to NULL > > Signed-off-by: Michal Rostecki > --- > fs/btrfs/inode.c | 38 +- > fs/btrfs/volumes.c | 39 --- > fs/btrfs/volumes.h | 5 +++-- > 3 files changed, 48 insertions(+), 34 deletions(-) So this adds more code but for what benefit? In your reply to Filipe you said you didn't observe this being a performance-affecting change so > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0dbe1aaa0b71..e2ee3a9c1140 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t > size, struct bio *bio, > struct inode *inode = page->mapping->host; > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > u64 logical = bio->bi_iter.bi_sector << 9; > + struct extent_map *em; > u64 length = 0; > u64 map_length; > - int ret; > + int ret = 0; > struct btrfs_io_geometry geom; > > if (bio_flags & EXTENT_BIO_COMPRESSED) > @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, > size_t size, struct bio *bio, > > length = bio->bi_iter.bi_size; > map_length = length; > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length, > - &geom); > + em = btrfs_get_chunk_map(fs_info, logical, map_length); > + if (IS_ERR(em)) > + return PTR_ERR(em); > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, > + map_length, &geom); > if (ret < 0) > - return ret; > + goto out; > > - if (geom.len < length + size) > - return 1; > - return 0; > + if (geom.len < length + size) { > + ret = 1; > + goto out; > + } this could be simply if (geom.len +out: > + free_extent_map(em); > + return ret; > } > > /* > @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > u64 submit_len; > int clone_offset = 0; > int clone_len; > + int logical; > int ret; > blk_status_t status; > struct btrfs_io_geometry geom; > struct btrfs_dio_data *dio_data = iomap->private; > + struct extent_map *em; > > dip = btrfs_create_dio_private(dio_bio, inode, file_offset); > if (!dip) { > @@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > } > > start_sector = dio_bio->bi_iter.bi_sector; > + logical = start_sector << 9; > submit_len = dio_bio->bi_iter.bi_size; > > do { > - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio), > - start_sector << 9, submit_len, > + em = btrfs_get_chunk_map(fs_info, logical, submit_len); > + if (IS_ERR(em)) { > + status = errno_to_blk_status(PTR_ERR(em)); > + em = NULL; > + goto out_err; > + } > + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), > + logical, submit_len, > &geom); > if (ret) { > status = errno_to_blk_status(ret); > @@ -8030,12 +8047,15 @@ static blk_qc_t btrfs_submit_direct(struct inode > *inode, struct iomap *iomap, > clone_offset += clone_len; > start_sector += clone_len >> 9; > file_offset += clone_len; > + > + free_extent_map(em); > } while (submit_len > 0); > return BLK_QC_T_NONE; > > out_err: > dip->dio_bio->bi_status = status; > btrfs_dio_private_put(dip); > + free_extent_map(em); > return BLK_QC_T_NONE; > } For example in this function you increase complexity by having to deal with free_extent_map as well so I'm not sure this is a net-win. > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index a8ec8539cd8d..4c753b17
[PATCH] btrfs: remove wrong comment for can_nocow_extent()
From: Filipe Manana The comment for can_nocow_extent() says that the function will flush ordered extents, however that never happens and was never true before the comment was added in commit e4ecaf90bc13 ("btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for the function btrfs_check_can_nocow(), which after that commit was renamed to check_can_nocow(). So just remove that part of the comment. Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0dbe1aaa0b71..589030cefd90 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7105,9 +7105,6 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, * @strict:if true, omit optimizations that might force us into unnecessary * cow. e.g., don't trust generation number. * - * This function will flush ordered extents in the range to ensure proper - * nocow checks for (nowait == false) case. - * * Return: * >0 and update @len if we can do nocow write * 0 if we can't do nocow write -- 2.28.0
Re: [PATCH v4 00/12] Improve preemptive ENOSPC flushing
On Tue, Jan 26, 2021 at 04:24:24PM -0500, Josef Bacik wrote: > v3->v4: > - Added some comments. > - Updated the tracepoint for the preemptive flushing to take a bool instead of > an int. > - Removed 'inline' from need_preemptive_flushing. Thanks, I'll add it to misc-next.
Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
On 1/27/21 5:34 AM, fdman...@kernel.org wrote: From: Filipe Manana Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana I'm having a hard time with this one. Previously we would commit the transaction if the inode was a regular file, that was created in this current transaction, and had been renamed. Now with this patch you're only committing the transaction if we are a directory and were renamed ourselves. Before if you already had directories A and B and then did something like echo "foo" > /mnt/test/A/blah fsync(/mnt/test/A/blah); fsync(/mnt/test/A); mv /mnt/test/A/blah /mnt/test/B fsync(/mnt/test/B/blah); we would commit the transaction on this second fsync, but with your patch we are not. I suppose that's keeping in line with how fsync is allowed to work, but it's definitely a change in behavior from what we used to do. Not sure if that's good or not, I'll have to think about it. Thanks, Josef
Re: [PATCH] btrfs: remove wrong comment for can_nocow_extent()
On 1/27/21 10:05 AM, fdman...@kernel.org wrote: From: Filipe Manana The comment for can_nocow_extent() says that the function will flush ordered extents, however that never happens and was never true before the comment was added in commit e4ecaf90bc13 ("btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for the function btrfs_check_can_nocow(), which after that commit was renamed to check_can_nocow(). So just remove that part of the comment. Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v3 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int
On Tue, Jan 26, 2021 at 03:32:36PM -0500, Josef Bacik wrote: > On 1/26/21 1:36 PM, David Sterba wrote: > > On Fri, Oct 09, 2020 at 09:28:18AM -0400, Josef Bacik wrote: > >> I got a automated message from somebody who runs clang against our > >> kernels and it's because I used the wrong enum type for what I passed > >> into flush_space. Change the argument to be explicitly the enum we're > >> expecting to make everything consistent. Maybe eventually gcc will > >> catch errors like this. > > > > I can't find any such public report and none of the clang warnings seem > > to be specific about that. Local run with clang 11 does not produce any > > warning. > > > > IDK, it was a private email just to me with the following output from clang > > s/btrfs/space-info.c:1115:12: warning: implicit conversion from > enumeration type 'enum btrfs_flush_state' to different enumeration type > 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] > flush = FLUSH_DELALLOC; > ~ ^~ > fs/btrfs/space-info.c:1120:12: warning: implicit conversion from > enumeration type 'enum btrfs_flush_state' to different enumeration type > 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] > flush = FORCE_COMMIT_TRANS; > ~ ^~ > fs/btrfs/space-info.c:1124:12: warning: implicit conversion from > enumeration type 'enum btrfs_flush_state' to different enumeration type > 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] > flush = FLUSH_DELAYED_ITEMS_NR; > ~ ^~ > fs/btrfs/space-info.c:1127:12: warning: implicit conversion from > enumeration type 'enum btrfs_flush_state' to different enumeration type > 'enum btrfs_reserve_flush_enum' [-Wenum-conversion] > flush = FLUSH_DELAYED_REFS_NR; > ~ ^ > > I figure it made sense, might as well fix it. Do we have that option for our > shiny new -W build options? Thanks, We can add -Wenum-conversion sure, but neither gcc (10.2.1) nor clang (11.0.1) reproduces the warning for me so I wonder how useful it would be. I've also tried adding -Wextra and -Wall and the build is clean. Per gcc documentation, -Wenum-conversion is not exactly what the code did: "Warn when a value of enumerated type is implicitly converted to a different enumerated type. This warning is enabled by -Wextra." And now that I read the warning carefuly it does not apply to our code, "btrfs_reserve_flush_enum = btrfs_flush_state" does not happen so it must have been some older revision of the code. Changing @@ -1073,7 +1073,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work) spin_lock(&space_info->lock); while (need_preemptive_reclaim(fs_info, space_info)) { - enum btrfs_flush_state flush; + enum btrfs_reserve_flush_enum flush; --- gcc says (on misc-next without this patchset) fs/btrfs/space-info.c: In function ‘btrfs_preempt_reclaim_metadata_space’: fs/btrfs/space-info.c:1112:10: warning: implicit conversion from ‘enum btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion] 1112 |flush = FLUSH_DELALLOC; | ^ CC [M] fs/btrfs/block-group.o fs/btrfs/space-info.c:1117:10: warning: implicit conversion from ‘enum btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion] 1117 |flush = FORCE_COMMIT_TRANS; | ^ fs/btrfs/space-info.c:1121:10: warning: implicit conversion from ‘enum btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion] 1121 |flush = FLUSH_DELAYED_ITEMS_NR; | ^ fs/btrfs/space-info.c:1124:10: warning: implicit conversion from ‘enum btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion] 1124 |flush = FLUSH_DELAYED_REFS_NR; | ^ fs/btrfs/space-info.c:1135:48: warning: implicit conversion from ‘enum btrfs_reserve_flush_enum’ to ‘enum btrfs_flush_state’ [-Wenum-conversion] 1135 | flush_space(fs_info, space_info, to_reclaim, flush, true); | --- So you weren't fixing the reported problem by this patch. Assigning enum to an int should be natural so thre's no reason to fix that but for clarity I don't mind, the patch stays.
Re: [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
On 1/27/21 5:35 AM, fdman...@kernel.org wrote: From: Filipe Manana Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation CountAvgLatMaxLat NTCreateX9627107 0.15361.938 Close7072076 0.001 3.175 Rename407633 1.22244.439 Unlink 1943895 0.65844.440 Deltree 25617.339 110.891 Mkdir128 0.003 0.009 Qpathinfo8725406 0.06417.850 Qfileinfo1529516 0.001 2.188 Qfsinfo 1599884 0.002 1.457 Sfileinfo 784200 0.005 3.562 Find 3373513 0.41130.312 WriteX 4802132 0.05329.054 ReadX15089959 0.002 5.801 LockX 31344 0.002 0.425 UnlockX31344 0.001 0.173 Flush 674724 5.952 341.830 Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms After applying patchset, 32 clients: After patchset, with 32 clients: Operation CountAvgLatMaxLat NTCreateX9931568 0.11125.597 Close7295730 0.001 2.171 Rename420549 0.98249.714 Unlink 2005366 0.49739.015 Deltree 25611.14989.242 Mkdir128 0.002 0.014 Qpathinfo9001863 0.04920.761 Qfileinfo1577730 0.001 2.546 Qfsinfo 1650508 0.002 3.531 Sfileinfo 809031 0.005 5.846 Find 3480259 0.30923.977 WriteX 4952505 0.04341.283 ReadX15568127 0.002 5.476 LockX 32338 0.002 0.978 UnlockX32338 0.001 2.032 Flush 696017 7.485 228.835 Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms --> +4.1% throughput, -39.6% max latency Before applying patchset, 64 clients: Operation CountAvgLatMaxLat NTCreateX8956748 0.342 108.312 Close6579660 0.001 3.823 Rename379209 2.39681.897 Unlink 1808625 1.108 131.148 Deltree 25625.632 172.176 Mkdir128 0.003 0.018 Qpathinfo8117615 0.13155
Re: [PATCH] btrfs: Simplify the calculation of variables
On 1/27/21 3:11 AM, Abaci Team wrote: Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot Suggested-by: Jiapeng Zhong Signed-off-by: Abaci Team Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik wrote: > > On 1/27/21 5:34 AM, fdman...@kernel.org wrote: > > From: Filipe Manana > > > > Whenever we fsync an inode, if it is a directory, a regular file that was > > created in the current transaction or has last_unlink_trans set to the > > generation of the current transaction, we check if any of its ancestor > > inodes (and the inode itself if it is a directory) can not be logged and > > need a fallback to a full transaction commit - if so, we return with a > > value of 1 in order to fallback to a transaction commit. > > > > However we often do not need to fallback to a transaction commit because: > > > > 1) The ancestor inode is not an immediate parent, and therefore there is > > not an explicit request to log it and it is not needed neither to > > guarantee the consistency of the inode originally asked to be logged > > (fsynced) nor its immediate parent; > > > > 2) The ancestor inode was already logged before, in which case any link, > > unlink or rename operation updates the log as needed. > > > > So for these two cases we can avoid an unnecessary transaction commit. > > Therefore remove check_parent_dirs_for_sync() and add a check at the top > > of btrfs_log_inode() to make us fallback immediately to a transaction > > commit when we are logging a directory inode that can not be logged and > > needs a full transaction commit. All we need to protect is the case where > > after renaming a file someone fsyncs only the old directory, which would > > result is losing the renamed file after a log replay. > > > > This patch is part of a patchset comprised of the following patches: > > > >btrfs: remove unnecessary directory inode item update when deleting dir > > entry > >btrfs: stop setting nbytes when filling inode item for logging > >btrfs: avoid logging new ancestor inodes when logging new inode > >btrfs: skip logging directories already logged when logging all parents > >btrfs: skip logging inodes already logged when logging new entries > >btrfs: remove unnecessary check_parent_dirs_for_sync() > >btrfs: make concurrent fsyncs wait less when waiting for a transaction > > commit > > > > Performance results, after applying all patches, are mentioned in the > > change log of the last patch. > > > > Signed-off-by: Filipe Manana > > I'm having a hard time with this one. > > Previously we would commit the transaction if the inode was a regular file, > that > was created in this current transaction, and had been renamed. Now with this > patch you're only committing the transaction if we are a directory and were > renamed ourselves. Before if you already had directories A and B and then did > something like > > echo "foo" > /mnt/test/A/blah > fsync(/mnt/test/A/blah); > fsync(/mnt/test/A); > mv /mnt/test/A/blah /mnt/test/B > fsync(/mnt/test/B/blah); > > we would commit the transaction on this second fsync, but with your patch we > are > not. I suppose that's keeping in line with how fsync is allowed to work, but > it's definitely a change in behavior from what we used to do. Not sure if > that's good or not, I'll have to think about it. Thanks, Yes. Because of the rename (or a link), we will set last_unlink_trans to the current transaction, and when logging the file that will cause logging of all its old parents (A). That was added several years ago to fix corruptions, and it turned out to be needed later as well to ensure we have a behaviour similar to xfs and ext4 (and others) regarding strictly ordered metadata updates (I added several tests to fstests over the years for all the cases). There's also the fact that on replay we will delete any inode refs that aren't in the log (that one was added in commit 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination"). For that example we also have A updated in the log by the rename. So we know the log is consistent. So that's why the whole check_parents_for_sync() is not needed. Thanks. > > Josef
Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
On 1/27/21 10:36 AM, Filipe Manana wrote: On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik wrote: On 1/27/21 5:34 AM, fdman...@kernel.org wrote: From: Filipe Manana Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana I'm having a hard time with this one. Previously we would commit the transaction if the inode was a regular file, that was created in this current transaction, and had been renamed. Now with this patch you're only committing the transaction if we are a directory and were renamed ourselves. Before if you already had directories A and B and then did something like echo "foo" > /mnt/test/A/blah fsync(/mnt/test/A/blah); fsync(/mnt/test/A); mv /mnt/test/A/blah /mnt/test/B fsync(/mnt/test/B/blah); we would commit the transaction on this second fsync, but with your patch we are not. I suppose that's keeping in line with how fsync is allowed to work, but it's definitely a change in behavior from what we used to do. Not sure if that's good or not, I'll have to think about it. Thanks, Yes. Because of the rename (or a link), we will set last_unlink_trans to the current transaction, and when logging the file that will cause logging of all its old parents (A). That was added several years ago to fix corruptions, and it turned out to be needed later as well to ensure we have a behaviour similar to xfs and ext4 (and others) regarding strictly ordered metadata updates (I added several tests to fstests over the years for all the cases). There's also the fact that on replay we will delete any inode refs that aren't in the log (that one was added in commit 1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination"). For that example we also have A updated in the log by the rename. So we know the log is consistent. So that's why the whole check_parents_for_sync() is not needed. Ok that's reasonable, thanks, Josef
Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads
On 1/27/21 5:34 AM, fdman...@kernel.org wrote: From: Filipe Manana The following patchset brings one more batch of performance improvements with dbench workloads, or anything that mixes file creation, file writes, renames, unlinks, etc with fsync like dbench does. This patchset is mostly based on avoiding logging directory inodes multiple times when not necessary, falling back to transaction commits less frequently and often waiting less time for transaction commits to complete. Performance results are listed in the change log of the last patch, but in short, I've experienced a reduction of maximum latency up to about -40% and throuhput gains up to about +6%. Filipe Manana (7): btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fs/btrfs/file.c| 1 + fs/btrfs/transaction.c | 39 +++-- fs/btrfs/transaction.h | 2 + fs/btrfs/tree-log.c| 195 - 4 files changed, 92 insertions(+), 145 deletions(-) You can add Reviewed-by: Josef Bacik to the whole series, thanks, Josef
Re: [PATCH v5 02/18] btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support
On 1/26/21 3:33 AM, Qu Wenruo wrote: For the incoming subpage support, UNMAPPED extent buffer will have different behavior in btrfs_release_extent_buffer(). This means we need to set UNMAPPED bit early before calling btrfs_release_extent_buffer(). Currently there is only one caller which relies on btrfs_release_extent_buffer() in its error path while set UNMAPPED bit late: - btrfs_clone_extent_buffer() Make it subpage compatible by setting the UNMAPPED bit early, since we're here, also move the UPTODATE bit early. There is another caller, __alloc_dummy_extent_buffer(), setting UNAMPPED bit late, but that function clean up the allocated page manually, thus no need for any modification. Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 04/18] btrfs: make attach_extent_buffer_page() handle subpage case
On 1/26/21 3:33 AM, Qu Wenruo wrote: For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 01/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK
On 1/26/21 3:33 AM, Qu Wenruo wrote: PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in __process_pages_contig(), to let the function know to clear page dirty bit and then set page writeback. However page writeback and dirty bits are conflicting (at least for sector size == PAGE_SIZE case), this means these two have to be always updated together. This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 00/18] btrfs: add read-only support for subpage sector size
On 1/26/21 3:33 AM, Qu Wenruo wrote: Patches can be fetched from github: https://github.com/adam900710/linux/tree/subpage Currently the branch also contains partial RW data support (still some ordered extent and data csum mismatch problems) Great thanks to David/Nikolay/Josef for their effort reviewing and merging the preparation patches into misc-next. === What works === Just from the patchset: - Data read Both regular and compressed data, with csum check. - Metadata read This means, with these patchset, 64K page systems can at least mount btrfs with 4K sector size read-only. This should provide the ability to migrate data at least. While on the github branch, there are already experimental RW supports, there are still ordered extent related bugs for me to fix. Thus only the RO part is sent for review and testing. === Patchset structure === Patch 01~02:Preparation patches which don't have functional change Patch 03~12:Subpage metadata allocation and freeing Patch 13~15:Subpage metadata read path Patch 16~17:Subpage data read path Patch 18: Enable subpage RO support === Changelog === v1: - Separate the main implementation from previous huge patchset Huge patchset doesn't make much sense. - Use bitmap implementation Now page::private will be a pointer to btrfs_subpage structure, which contains bitmaps for various page status. v2: - Use page::private as btrfs_subpage for extra info This replace old extent io tree based solution, which reduces latency and don't require memory allocation for its operations. - Cherry-pick new preparation patches from RW development Those new preparation patches improves the readability by their own. v3: - Make dummy extent buffer to follow the same subpage accessors Fsstress exposed several ASSERT() for dummy extent buffers. It turns out we need to make dummy extent buffer to own the same btrfs_subpage structure to make eb accessors to work properly - Two new small __process_pages_contig() related preparation patches One to make __process_pages_contig() to enhance the error handling path for locked_page, one to merge one macro. - Extent buffers refs count update Except try_release_extent_buffer(), all other eb uses will try to increase the ref count of the eb. For try_release_extent_buffer(), the eb refs check will happen inside the rcu critical section to avoid eb being freed. - Comment updates Addressing the comments from the mail list. v4: - Get rid of btrfs_subpage::tree_block_bitmap This is to reduce lock complexity (no need to bother extra subpage lock for metadata, all locks are existing locks) Now eb looking up mostly depends on radix tree, with small help from btrfs_subpage::under_alloc. Now I haven't experieneced metadata related problems any more during my local fsstress tests. - Fix a race where metadata page dirty bit can race Fixed in the metadata RW patchset though. - Rebased to latest misc-next branch With 4 patches removed, as they are already in misc-next. v5: - Use the updated version from David as base Most comment/commit message update should be kept as is. - A new separate patch to move UNMAPPED bit set timing - New comment on why we need to prealloc subpage inside a loop Mostly for further 16K page size support, where we can have eb across multiple pages. - Remove one patch which is too RW specific Since it introduces functional change which only makes sense for RW support, it's not a good idea to include it in RO support. - Error handling fixes Great thanks to Josef. - Refactor btrfs_subpage allocation/freeing Now we have btrfs_alloc_subpage() and btrfs_free_subpage() helpers to do all the allocation/freeing. It's pretty easy to convert to kmem_cache using above helpers. (already internally tested using kmem_cache without problem, in fact it's all the problems found in kmem_cache test leads to the new interface) - Use btrfs_subpage::eb_refs to replace old under_alloc This makes checking whether the page has any eb left much easier. Qu Wenruo (18): btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support btrfs: introduce the skeleton of btrfs_subpage structure btrfs: make attach_extent_buffer_page() handle subpage case btrfs: make grab_extent_buffer_from_page() handle subpage case btrfs: support subpage for extent buffer page release I don't have this patch in my inbox so I can't reply to it directly, but you include refcount.h, but then use normal atomics. Please used the actual refcount_t, as it gets us all the debugging stuff that makes finding problems much easier. Thanks, Josef
Re: [PATCH v5 05/18] btrfs: make grab_extent_buffer_from_page() handle subpage case
On 1/26/21 3:33 AM, Qu Wenruo wrote: For subpage case, grab_extent_buffer() can't really get an extent buffer just from btrfs_subpage. We have radix tree lock protecting us from inserting the same eb into the tree. Thus we don't really need to do the extra hassle, just let alloc_extent_buffer() handle the existing eb in radix tree. Now if two ebs are being allocated as the same time, one will fail with -EEIXST when inserting into the radix tree. So for grab_extent_buffer(), just always return NULL for subpage case. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 06/18] btrfs: support subpage for extent buffer page release
On 1/26/21 3:33 AM, Qu Wenruo wrote: In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A)| T2 (eb B) ---+-- alloc_extent_buffer() | btrfs_release_extent_buffer_pages() |- p = find_or_create_page() | | |- attach_extent_buffer_page() | | | | |- detach_extent_buffer_page() | ||- if (!page_range_has_eb()) | || No new eb in the page range yet | || As new eb A hasn't yet been | || inserted into radix tree. | ||- btrfs_detach_subpage() | | |- detach_page_private(); |- radix_tree_insert()| Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 94 +--- fs/btrfs/subpage.c | 42 fs/btrfs/subpage.h | 13 +- 3 files changed, 133 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 16a29f63cfd1..118874926179 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4993,25 +4993,39 @@ int extent_buffer_under_io(const struct extent_buffer *eb) test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); } -/* - * Release all pages attached to the extent buffer. - */ -static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) +static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page) { - int i; - int num_pages; - int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); + struct btrfs_subpage *subpage; - BUG_ON(extent_buffer_under_io(eb)); + lockdep_assert_held(&page->mapping->private_lock); - num_pages = num_extent_pages(eb); - for (i = 0; i < num_pages; i++) { - struct page *page = eb->pages[i]; + if (PagePrivate(page)) { + subpage = (struct btrfs_subpage *)page->private; + if (atomic_read(&subpage->eb_refs)) + return true; + } + return false; +} - if (!page) - continue; +static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *page) +{ + struct btrfs_fs_info *fs_info = eb->fs_info; + const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); + + /* +* For mapped eb, we're going to change the page private, which should +* be done under the private_lock. +*/ + if (mapped) + spin_lock(&page->mapping->private_lock); + + if (!PagePrivate(page)) { if (mapped) - spin_lock(&page->mapping->private_lock); + spin_unlock(&page->mapping->private_lock); + return; + } + + if (fs_info->sectorsize == PAGE_SIZE) { /* * We do this since we'll remove the pages after we've * removed the eb from the radix tree, so we could race @@ -5030,9 +5044,49 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) */ detach_page_private(page); } - if (mapped) spin_unlock(&page->mapping->private_lock); + return; + } + + /* +* For subpage, we can have dummy eb with page private. In this case, +* we can directly detach the private as such page is only attached to +* one dummy eb, no sharing. +*/ +
Re: [PATCH v5 07/18] btrfs: attach private to dummy extent buffer pages
On 1/26/21 3:33 AM, Qu Wenruo wrote: There are locations where we allocate dummy extent buffers for temporary usage, like in tree_mod_log_rewind() or get_old_root(). These dummy extent buffers will be handled by the same eb accessors, and if they don't have page::private subpage eb accessors could fail. To address such problems, make __alloc_dummy_extent_buffer() attach page private for dummy extent buffers too. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Yup I'm slow, Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 08/18] btrfs: introduce helpers for subpage uptodate status
On 1/26/21 3:33 AM, Qu Wenruo wrote: Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 09/18] btrfs: introduce helpers for subpage error status
On 1/26/21 3:33 AM, Qu Wenruo wrote: Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 10/18] btrfs: support subpage in set/clear_extent_buffer_uptodate()
On 1/26/21 3:33 AM, Qu Wenruo wrote: To support subpage in set_extent_buffer_uptodate and clear_extent_buffer_uptodate we only need to use the subpage-aware helpers to update the page bits. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 11/18] btrfs: support subpage in btrfs_clone_extent_buffer
On 1/26/21 3:33 AM, Qu Wenruo wrote: For btrfs_clone_extent_buffer(), it's mostly the same code of __alloc_dummy_extent_buffer(), except it has extra page copy. So to make it subpage compatible, we only need to: - Call set_extent_buffer_uptodate() instead of SetPageUptodate() This will set correct uptodate bit for subpage and regular sector size cases. Since we're calling set_extent_buffer_uptodate() which will also set EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit either. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 12/18] btrfs: support subpage in try_release_extent_buffer()
On 1/26/21 3:33 AM, Qu Wenruo wrote: Unlike the original try_release_extent_buffer(), try_release_subpage_extent_buffer() will iterate through all the ebs in the page, and try to release each. We can release the full page only after there's no private attached, which means all ebs of that page have been released as well. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 13/18] btrfs: introduce read_extent_buffer_subpage()
On 1/26/21 3:33 AM, Qu Wenruo wrote: Introduce a helper, read_extent_buffer_subpage(), to do the subpage extent buffer read. The difference between regular and subpage routines are: - No page locking Here we completely rely on extent locking. Page locking can reduce the concurrency greatly, as if we lock one page to read one extent buffer, all the other extent buffers in the same page will have to wait. - Extent uptodate condition Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check, We also need to check btrfs_subpage::uptodate_bitmap. - No page iteration Just one page, no need to loop, this greatly simplified the subpage routine. This patch only implements the bio submit part, no endio support yet. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 14/18] btrfs: support subpage in endio_readpage_update_page_status()
On 1/26/21 3:33 AM, Qu Wenruo wrote: To handle subpage status update, add the following: - Use btrfs_page_*() subpage-aware helpers to update page status Now we can handle both cases well. - No page unlock for subpage metadata Since subpage metadata doesn't utilize page locking at all, skip it. For subpage data locking, it's handled in later commits. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v5 15/18] btrfs: introduce subpage metadata validation check
On 1/26/21 3:33 AM, Qu Wenruo wrote: For subpage metadata validation check, there are some differences: - Read must finish in one bvec Since we're just reading one subpage range in one page, it should never be split into two bios nor two bvecs. - How to grab the existing eb Instead of grabbing eb using page->private, we have to go search radix tree as we don't have any direct pointer at hand. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
Re: misc bio allocation cleanups
On 1/26/21 7:52 AM, Christoph Hellwig wrote: > Hi Jens, > > this series contains various cleanups for how bios are allocated or > initialized plus related fallout. Applied, thanks. -- Jens Axboe
Re: [PATCH v5 16/18] btrfs: introduce btrfs_subpage for data inodes
On 1/26/21 3:34 AM, Qu Wenruo wrote: To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes to get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a little. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/compression.c | 10 ++-- fs/btrfs/extent_io.c| 46 + fs/btrfs/extent_io.h| 3 ++- fs/btrfs/file.c | 24 --- fs/btrfs/free-space-cache.c | 15 +--- fs/btrfs/inode.c| 14 +++ fs/btrfs/ioctl.c| 8 ++- fs/btrfs/reflink.c | 5 +++- fs/btrfs/relocation.c | 11 +++-- 9 files changed, 103 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 5ae3fa0386b7..6d203acfdeb3 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode, goto next; } - end = last_offset + PAGE_SIZE - 1; /* * at this point, we have a locked page in the page cache * for these bytes in the file. But, we have to make * sure they map to this compressed extent on disk. */ - set_page_extent_mapped(page); + ret = set_page_extent_mapped(page); + if (ret < 0) { + unlock_page(page); + put_page(page); + break; + } + + end = last_offset + PAGE_SIZE - 1; lock_extent(tree, last_offset, end); read_lock(&em_tree->lock); em = lookup_extent_mapping(em_tree, last_offset, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 139a8a77ed72..3213daaa 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3190,10 +3190,39 @@ static int attach_extent_buffer_page(struct extent_buffer *eb, return ret; } -void set_page_extent_mapped(struct page *page) +int set_page_extent_mapped(struct page *page) { + struct btrfs_fs_info *fs_info; + + ASSERT(page->mapping); + + if (PagePrivate(page)) + return 0; + + fs_info = btrfs_sb(page->mapping->host->i_sb); + + if (fs_info->sectorsize < PAGE_SIZE) + return btrfs_attach_subpage(fs_info, page, BTRFS_SUBPAGE_DATA); + + attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); + return 0; + +} + +void clear_page_extent_mapped(struct page *page) +{ + struct btrfs_fs_info *fs_info; + + ASSERT(page->mapping); + if (!PagePrivate(page)) - attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); + return; + + fs_info = btrfs_sb(page->mapping->host->i_sb); + if (fs_info->sectorsize < PAGE_SIZE) + return btrfs_detach_subpage(fs_info, page); + + detach_page_private(page); } static struct extent_map * @@ -3250,7 +3279,12 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached, unsigned long this_bio_flag = 0; struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; - set_page_extent_mapped(page); + ret = set_page_extent_mapped(page); + if (ret < 0) { + unlock_extent(tree, start, end); + SetPageError(page); + goto out; + } if (!PageUptodate(page)) { if (cleancache_get_page(page) == 0) { @@ -3690,7 +3724,11 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, flush_dcache_page(page); } - set_page_extent_mapped(page); + ret = set_page_extent_mapped(page); + if (ret < 0) { + SetPageError(page); + goto done; + } if (!epd->extent_locked) { ret = writepage_delall
Re: [PATCH v5 17/18] btrfs: integrate page status update for data read path into begin/end_page_read()
On 1/26/21 3:34 AM, Qu Wenruo wrote: In btrfs data page read path, the page status update are handled in two different locations: btrfs_do_read_page() { while (cur <= end) { /* No need to read from disk */ if (HOLE/PREALLOC/INLINE){ memset(); set_extent_uptodate(); continue; } /* Read from disk */ ret = submit_extent_page(end_bio_extent_readpage); } end_bio_extent_readpage() { endio_readpage_uptodate_page_status(); } This is fine for sectorsize == PAGE_SIZE case, as for above loop we should only hit one branch and then exit. But for subpage, there are more works to be done in page status update: - Page Unlock condition Unlike regular page size == sectorsize case, we can no longer just unlock a page without a brain. Only the last reader of the page can unlock the page. This means, we can unlock the page either in the while() loop, or in the endio function. - Page uptodate condition Since we have multiple sectors to read for a page, we can only mark the full page uptodate if all sectors are uptodate. To handle both subpage and regular cases, introduce a pair of functions to help handling page status update: - being_page_read() For regular case, it does nothing. For subpage case, it update the reader counters so that later end_page_read() can know who is the last one to unlock the page. - end_page_read() This is just endio_readpage_uptodate_page_status() renamed. The original name is a little too long and too specific for endio. The only new trick added is the condition for page unlock. Now for subage data, we unlock the page if we're the last reader. This does not only provide the basis for subpage data read, but also hide the special handling of page read from the main read loop. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 38 -- fs/btrfs/subpage.c | 56 ++-- fs/btrfs/subpage.h | 8 +++ 3 files changed, 78 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3213daaa..7fc2c62d4eb9 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2839,8 +2839,17 @@ static void endio_readpage_release_extent(struct processed_extent *processed, processed->uptodate = uptodate; } -static void endio_readpage_update_page_status(struct page *page, bool uptodate, - u64 start, u32 len) +static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page *page) +{ + ASSERT(PageLocked(page)); + if (fs_info->sectorsize == PAGE_SIZE) + return; + + ASSERT(PagePrivate(page)); + btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE); +} + +static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len) { struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb); @@ -2856,7 +2865,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate, if (fs_info->sectorsize == PAGE_SIZE) unlock_page(page); - /* Subpage locking will be handled in later patches */ + else if (is_data_inode(page->mapping->host)) + /* +* For subpage data, unlock the page if we're the last reader. +* For subpage metadata, page lock is not utilized for read. +*/ + btrfs_subpage_end_reader(fs_info, page, start, len); } /* @@ -2993,7 +3007,7 @@ static void end_bio_extent_readpage(struct bio *bio) bio_offset += len; /* Update page status and unlock */ - endio_readpage_update_page_status(page, uptodate, start, len); + end_page_read(page, uptodate, start, len); endio_readpage_release_extent(&processed, BTRFS_I(inode), start, end, uptodate); } @@ -3263,6 +3277,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached, unsigned int read_flags, u64 *prev_em_start) { struct inode *inode = page->mapping->host; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); u64 start = page_offset(page); const u64 end = start + PAGE_SIZE - 1; u64 cur = start; @@ -3306,6 +3321,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached, kunmap_atomic(userpage); } } You have two error cases above this ret = set_page_extent_mapped(page); if (ret < 0) { unlock_extent(tree, start, end); SetPageError(page); goto out; } and if (!PageUptodat
Re: [PATCH v5 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system
On 1/26/21 3:34 AM, Qu Wenruo wrote: This adds the basic RO mount ability for 4K sector size on 64K page system. Currently we only plan to support 4K and 64K page system. Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Reviewed-by: Josef Bacik Thanks, Josef
ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd
Hi, seems 5.10.8 still has the ENOSPC issue when compression is used (compress-force=zstd,space_cache=v2): Jan 27 11:02:14 kernel: [248571.569840] [ cut here ] Jan 27 11:02:14 kernel: [248571.569843] BTRFS: Transaction aborted (error -28) Jan 27 11:02:14 kernel: [248571.569845] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569848] BTRFS info (device dm-0): forced readonly Jan 27 11:02:14 kernel: [248571.569851] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569852] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569854] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130 Jan 27 11:02:14 kernel: [248571.569913] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569939] Modules linked in: Jan 27 11:02:14 kernel: [248571.569966] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569992] bfq zram bcache crc64 loop dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc intel_powerclamp coretemp k$ Jan 27 11:02:14 kernel: [248571.570075] CPU: 3 PID: 21255 Comm: kworker/u50:22 Tainted: G I 5.10.8 #1 Jan 27 11:02:14 kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.13.0 03/02/2018 Jan 27 11:02:14 kernel: [248571.570079] Workqueue: events_unbound btrfs_async_reclaim_metadata_space Jan 27 11:02:14 kernel: [248571.570081] RIP: 0010:add_to_free_space_tree+0xe8/0x130 Jan 27 11:02:14 kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 04 e8 8a 99 4a 00 <0f> 0b 8$ Jan 27 11:02:14 kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 00010282 Jan 27 11:02:14 kernel: [248571.570084] RAX: RBX: 4000 RCX: 0027 Jan 27 11:02:14 kernel: [248571.570085] RDX: 0027 RSI: 0004 RDI: 888617a58b88 Jan 27 11:02:14 kernel: [248571.570086] RBP: 8889ecb874e0 R08: 888617a58b80 R09: Jan 27 11:02:14 kernel: [248571.570087] R10: 0001 R11: 822372e0 R12: 00574151 Jan 27 11:02:14 kernel: [248571.570087] R13: 8884e05727e0 R14: 88815ae4fc00 R15: 88815ae4fdd8 Jan 27 11:02:14 kernel: [248571.570088] FS: () GS:888617a4() knlGS: Jan 27 11:02:14 kernel: [248571.570089] CS: 0010 DS: ES: CR0: 80050033 Jan 27 11:02:14 kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 0260a005 CR4: 000206e0 Jan 27 11:02:14 kernel: [248571.570091] Call Trace: Jan 27 11:02:14 kernel: [248571.570097] __btrfs_free_extent.isra.0+0x56a/0xa10 Jan 27 11:02:14 kernel: [248571.570100] __btrfs_run_delayed_refs+0x659/0xf20 Jan 27 11:02:14 kernel: [248571.570102] btrfs_run_delayed_refs+0x73/0x200 Jan 27 11:02:14 kernel: [248571.570103] flush_space+0x4e8/0x5e0 Jan 27 11:02:14 kernel: [248571.570105] ? btrfs_get_alloc_profile+0x66/0x1b0 Jan 27 11:02:14 kernel: [248571.570106] ? btrfs_get_alloc_profile+0x66/0x1b0 Jan 27 11:02:14 kernel: [248571.570107] btrfs_async_reclaim_metadata_space+0x107/0x3a0 Jan 27 11:02:14 kernel: [248571.570111] process_one_work+0x1b6/0x350 Jan 27 11:02:14 kernel: [248571.570112] worker_thread+0x50/0x3b0 Jan 27 11:02:14 kernel: [248571.570114] ? process_one_work+0x350/0x350 Jan 27 11:02:14 kernel: [248571.570116] kthread+0xfe/0x140 Jan 27 11:02:14 kernel: [248571.570117] ? kthread_park+0x90/0x90 Jan 27 11:02:14 kernel: [248571.570120] ret_from_fork+0x22/0x30 Jan 27 11:02:14 kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]--- Jan 27 11:02:14 kernel: [248571.570123] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.570151] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.570178] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left btrfs fi usage: Overall: Device size: 931.49GiB Device allocated: 931.49GiB Device unallocated: 1.00MiB Device missing: 0.00B Used: 786.39GiB Free (estimated): 107.69GiB (min: 107.69GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,sing
Re: [PATCH v14 12/42] btrfs: calculate allocation offset for conventional zones
On 1/25/21 9:24 PM, Naohiro Aota wrote: Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset if a block group contains a conventional zone. But instead, we can consider the end of the last allocated extent in the block group as an allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex (which is already taken in btrfs_make_block_group()). Since it is a new block group, we can simply set the allocation offset to 0, anyway. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v14 13/42] btrfs: track unusable bytes for zones
On 1/25/21 9:24 PM, Naohiro Aota wrote: In zoned btrfs a region that was once written then freed is not usable until we reset the underlying zone. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the allocation offset. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v14 41/42] btrfs: serialize log transaction on ZONED mode
On 1/25/21 9:25 PM, Naohiro Aota wrote: This is the 2/3 patch to enable tree-log on ZONED mode. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same mixed allocation problem. This patch serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v14 22/42] btrfs: split ordered extent when bio is sent
On 1/25/21 9:25 PM, Naohiro Aota wrote: For a zone append write, the device decides the location the data is written to. Therefore we cannot ensure that two bios are written consecutively on the device. In order to ensure that a ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. This commit implements the splitting of an ordered extent and extent map on bio submission to adhere to the rule. [testbot] made extract_ordered_extent static Reported-by: kernel test robot Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: remove wrong comment for can_nocow_extent()
On Wed, Jan 27, 2021 at 03:05:41PM +, fdman...@kernel.org wrote: > From: Filipe Manana > > The comment for can_nocow_extent() says that the function will flush > ordered extents, however that never happens and was never true before the > comment was added in commit e4ecaf90bc13 ("btrfs: add comments for > btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for > the function btrfs_check_can_nocow(), which after that commit was renamed > to check_can_nocow(). So just remove that part of the comment. > > Signed-off-by: Filipe Manana Added to misc-next, thanks.
Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
On Tue, Jan 26, 2021 at 03:29:17PM +0800, Qu Wenruo wrote: > On 2021/1/20 上午6:35, David Sterba wrote: > > On Tue, Jan 19, 2021 at 04:54:28PM -0500, Josef Bacik wrote: > >> On 1/16/21 2:15 AM, Qu Wenruo wrote: > >>> +/* For rare cases where we need to pre-allocate a btrfs_subpage > >>> structure */ > >>> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info, > >>> + struct btrfs_subpage **ret) > >>> +{ > >>> + if (fs_info->sectorsize == PAGE_SIZE) > >>> + return 0; > >>> + > >>> + *ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS); > >>> + if (!*ret) > >>> + return -ENOMEM; > >>> + return 0; > >>> +} > >> > >> We're allocating these for every metadata page, that deserves a dedicated > >> kmem_cache. Thanks, > > > > I'm not opposed to that idea but for the first implementation I'm ok > > with using the default slabs. As the subpage support depends on the > > filesystem, creating the cache unconditionally would waste resources and > > creating it on demand would need some care. Either way I'd rather see it > > in a separate patch. > > > Well, too late for me to see this comment... > > As I have already converted to kmem cache. > > But the good news is, the latest version has extra refactor on memory > allocation/freeing, now we just need to change two lines to change how > we allocate memory for subpage. > (Although still need to remove the cache allocation code). > > Will convert it back to default slab, but will also keep the refactor > there to make it easier for later convert to kmem_cache. > > So don't be too surprised to see function like in next version. > >btrfs_free_subpage(struct btrfs_subpage *subpage) >{ > kfree(subpage); >} I hoped to save you time converting it to the kmem slabs so no need to revert it back to kmalloc, keep what you have. Switching with the helper would be easier should we need to reconsider it for some reason.
Re: btrfs becomes read-only
On Wed, Jan 27, 2021 at 1:57 AM Alexey Isaev wrote: > > kernel version: > > aleksey@host:~$ sudo uname --all > Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12 > 18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux This is an old and EOL kernel. It could be a long fixed Btrfs bug that caused this problem, I'm not sure. I suggest 5.4.93+ if you need a longterm kernel, otherwise 5.10.11 is the current stable kernel. > > drive make/model: > > Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3 > (made by Orico https://www.orico.cc/us/product/detail/3622.html) > with 5 WD Red 10 Tb. We use this drive for backups. > > When i try to run btrfs check i get error message: > > aleksey@host:~$ sudo btrfs check --readonly /dev/sdg > Couldn't open file system OK is it now on some other dev node? A relatively recent btrfs-progs is also recommended, 5.10 is current and I probably wouldn't use anything older than 5.6.1. > aleksey@host:~$ sudo smartctl -x /dev/sdg Yeah probably won't work since it's behind a raid5 controller. I think there's smartctl commands to enable passthrough and get information for each drive, so that you don't have to put it in JBOD mode. But I'm not familiar with how to do that. Anyway it's a good idea to find out if there's SMART reporting any problems about any drive, but not urgent. -- Chris Murphy
Re: btrfs becomes read-only
On Wed, Jan 27, 2021 at 6:05 AM Alexey Isaev wrote: > > I managed to run btrs check, but it didn't solve the problem: > > aleksey@host:~$ sudo btrfs check --repair /dev/sdg OK it's risky to run --repair without a developer giving a go ahead, in particular with older versions of btrfs-progs. There are warnings in the man page about it. > [sudo] password for aleksey: > enabling repair mode > Checking filesystem on /dev/sdg > UUID: 070ce9af-6511-4b89-a501-0823514320c1 > checking extents > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > Ignoring transid failure > leaf parent key incorrect 52180048330752 > bad block 52180048330752 > Errors found in extent allocation tree or chunk allocation > parent transid verify failed on 52180048330752 wanted 132477 found 132432 Yeah it's not finding what it's expecting to find there. Any power fail or crash in the history of the file system? What do you get for: btrfs insp dump-s -f /dev/sdg -- Chris Murphy
Re: Only one subvolume can be mounted after replace/balance
On Wed, Jan 27, 2021 at 6:10 AM Jakob Schöttl wrote: > > Thank you Chris, it's resolved now, see below. > > Am 25.01.21 um 23:47 schrieb Chris Murphy: > > On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl wrote: > >> Hi, > >> > >> In short: > >> When mounting a second subvolume from a pool, I get this error: > >> "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda, > >> missing code page or helper program, or other." > >> dmesg | grep BTRFS only shows this error: > >> info (device sda): disk space caching is enabled > >> error (device sda): Remounting read-write after error is not allowed > > It went read-only before this because it's confused. You need to > > unmount it before it can be mounted rw. In some cases a reboot is > > needed. > Oh, I didn't notice that the pool was already mounted (via fstab). > The filesystem where out of space and I had to resize both disks > separately. And I had to mount with -o skip_balance for that. Now it > works again. > > >> What happened: > >> > >> In my RAID1 pool with two disk, I successfully replaced one disk with > >> > >> btrfs replace start 2 /dev/sdx > >> > >> After that, I mounted the pool and did > > I don't understand this sequence. In order to do a replace, the file > > system is already mounted. > That was, what I did before my actual problem occurred. But it's > resolved now. > > >> btrfs fi show /mnt > >> > >> which showed WARNINGs about > >> "filesystems with multiple block group profiles detected" > >> (don't remember exactly) > >> > >> I thought it is a good idea to do > >> > >> btrfs balance start /mnt > >> > >> which finished without errors. > > Balance alone does not convert block groups to a new profile. You have > > to explicitly select a conversion filter, e.g. > > > > btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt > I didn't want to convert to a new profile. I thought btrfs replace > automatically uses the same profile as the pool? Btrfs replace does not change the profile. But you reported mixed profile block groups, which means conversion is indicated to make sure they're al the same. Please post: sudo btrfs fi us /mnt Let's see what the block groups are and what you want them to be and then see what conversion command might be indicated. -- Chris Murphy
Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd
On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber wrote: > > Hi, > > seems 5.10.8 still has the ENOSPC issue when compression is used > (compress-force=zstd,space_cache=v2): > > Jan 27 11:02:14 kernel: [248571.569840] [ cut here ] > Jan 27 11:02:14 kernel: [248571.569843] BTRFS: Transaction aborted (error > -28) > Jan 27 11:02:14 kernel: [248571.569845] BTRFS: error (device dm-0) in > add_to_free_space_tree:1039: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569848] BTRFS info (device dm-0): forced > readonly > Jan 27 11:02:14 kernel: [248571.569851] BTRFS: error (device dm-0) in > add_to_free_space_tree:1039: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569852] BTRFS: error (device dm-0) in > __btrfs_free_extent:3270: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569854] BTRFS: error (device dm-0) in > btrfs_run_delayed_refs:2191: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at > fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130 > Jan 27 11:02:14 kernel: [248571.569913] BTRFS: error (device dm-0) in > __btrfs_free_extent:3270: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569939] Modules linked in: > Jan 27 11:02:14 kernel: [248571.569966] BTRFS: error (device dm-0) in > btrfs_run_delayed_refs:2191: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.569992] bfq zram bcache crc64 loop dm_crypt > xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc > intel_powerclamp coretemp k$ > Jan 27 11:02:14 kernel: [248571.570075] CPU: 3 PID: 21255 Comm: > kworker/u50:22 Tainted: G I 5.10.8 #1 > Jan 27 11:02:14 kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge > R510/0DPRKF, BIOS 1.13.0 03/02/2018 > Jan 27 11:02:14 kernel: [248571.570079] Workqueue: events_unbound > btrfs_async_reclaim_metadata_space > Jan 27 11:02:14 kernel: [248571.570081] RIP: > 0010:add_to_free_space_tree+0xe8/0x130 > Jan 27 11:02:14 kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 > 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 > 04 e8 8a 99 4a 00 <0f> 0b 8$ > Jan 27 11:02:14 kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: > 00010282 > Jan 27 11:02:14 kernel: [248571.570084] RAX: RBX: > 4000 RCX: 0027 > Jan 27 11:02:14 kernel: [248571.570085] RDX: 0027 RSI: > 0004 RDI: 888617a58b88 > Jan 27 11:02:14 kernel: [248571.570086] RBP: 8889ecb874e0 R08: > 888617a58b80 R09: > Jan 27 11:02:14 kernel: [248571.570087] R10: 0001 R11: > 822372e0 R12: 00574151 > Jan 27 11:02:14 kernel: [248571.570087] R13: 8884e05727e0 R14: > 88815ae4fc00 R15: 88815ae4fdd8 > Jan 27 11:02:14 kernel: [248571.570088] FS: () > GS:888617a4() knlGS: > Jan 27 11:02:14 kernel: [248571.570089] CS: 0010 DS: ES: CR0: > 80050033 > Jan 27 11:02:14 kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: > 0260a005 CR4: 000206e0 > Jan 27 11:02:14 kernel: [248571.570091] Call Trace: > Jan 27 11:02:14 kernel: [248571.570097] > __btrfs_free_extent.isra.0+0x56a/0xa10 > Jan 27 11:02:14 kernel: [248571.570100] __btrfs_run_delayed_refs+0x659/0xf20 > Jan 27 11:02:14 kernel: [248571.570102] btrfs_run_delayed_refs+0x73/0x200 > Jan 27 11:02:14 kernel: [248571.570103] flush_space+0x4e8/0x5e0 > Jan 27 11:02:14 kernel: [248571.570105] ? btrfs_get_alloc_profile+0x66/0x1b0 > Jan 27 11:02:14 kernel: [248571.570106] ? btrfs_get_alloc_profile+0x66/0x1b0 > Jan 27 11:02:14 kernel: [248571.570107] > btrfs_async_reclaim_metadata_space+0x107/0x3a0 > Jan 27 11:02:14 kernel: [248571.570111] process_one_work+0x1b6/0x350 > Jan 27 11:02:14 kernel: [248571.570112] worker_thread+0x50/0x3b0 > Jan 27 11:02:14 kernel: [248571.570114] ? process_one_work+0x350/0x350 > Jan 27 11:02:14 kernel: [248571.570116] kthread+0xfe/0x140 > Jan 27 11:02:14 kernel: [248571.570117] ? kthread_park+0x90/0x90 > Jan 27 11:02:14 kernel: [248571.570120] ret_from_fork+0x22/0x30 > Jan 27 11:02:14 kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]--- > Jan 27 11:02:14 kernel: [248571.570123] BTRFS: error (device dm-0) in > add_to_free_space_tree:1039: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.570151] BTRFS: error (device dm-0) in > __btrfs_free_extent:3270: errno=-28 No space left > Jan 27 11:02:14 kernel: [248571.570178] BTRFS: error (device dm-0) in > btrfs_run_delayed_refs:2191: errno=-28 No space left > > > btrfs fi usage: > > Overall: > Device size: 931.49GiB > Device allocated:931.49GiB > Device unallocated:1.00MiB > Device missing: 0.00B > Used:786.39GiB > Free (estimated):
Re: [PATCH v5 00/18] btrfs: add read-only support for subpage sector size
On 2021/1/28 上午12:17, Josef Bacik wrote: On 1/26/21 3:33 AM, Qu Wenruo wrote: Patches can be fetched from github: https://github.com/adam900710/linux/tree/subpage Currently the branch also contains partial RW data support (still some ordered extent and data csum mismatch problems) Great thanks to David/Nikolay/Josef for their effort reviewing and merging the preparation patches into misc-next. === What works === Just from the patchset: - Data read Both regular and compressed data, with csum check. - Metadata read This means, with these patchset, 64K page systems can at least mount btrfs with 4K sector size read-only. This should provide the ability to migrate data at least. While on the github branch, there are already experimental RW supports, there are still ordered extent related bugs for me to fix. Thus only the RO part is sent for review and testing. === Patchset structure === Patch 01~02: Preparation patches which don't have functional change Patch 03~12: Subpage metadata allocation and freeing Patch 13~15: Subpage metadata read path Patch 16~17: Subpage data read path Patch 18: Enable subpage RO support === Changelog === v1: - Separate the main implementation from previous huge patchset Huge patchset doesn't make much sense. - Use bitmap implementation Now page::private will be a pointer to btrfs_subpage structure, which contains bitmaps for various page status. v2: - Use page::private as btrfs_subpage for extra info This replace old extent io tree based solution, which reduces latency and don't require memory allocation for its operations. - Cherry-pick new preparation patches from RW development Those new preparation patches improves the readability by their own. v3: - Make dummy extent buffer to follow the same subpage accessors Fsstress exposed several ASSERT() for dummy extent buffers. It turns out we need to make dummy extent buffer to own the same btrfs_subpage structure to make eb accessors to work properly - Two new small __process_pages_contig() related preparation patches One to make __process_pages_contig() to enhance the error handling path for locked_page, one to merge one macro. - Extent buffers refs count update Except try_release_extent_buffer(), all other eb uses will try to increase the ref count of the eb. For try_release_extent_buffer(), the eb refs check will happen inside the rcu critical section to avoid eb being freed. - Comment updates Addressing the comments from the mail list. v4: - Get rid of btrfs_subpage::tree_block_bitmap This is to reduce lock complexity (no need to bother extra subpage lock for metadata, all locks are existing locks) Now eb looking up mostly depends on radix tree, with small help from btrfs_subpage::under_alloc. Now I haven't experieneced metadata related problems any more during my local fsstress tests. - Fix a race where metadata page dirty bit can race Fixed in the metadata RW patchset though. - Rebased to latest misc-next branch With 4 patches removed, as they are already in misc-next. v5: - Use the updated version from David as base Most comment/commit message update should be kept as is. - A new separate patch to move UNMAPPED bit set timing - New comment on why we need to prealloc subpage inside a loop Mostly for further 16K page size support, where we can have eb across multiple pages. - Remove one patch which is too RW specific Since it introduces functional change which only makes sense for RW support, it's not a good idea to include it in RO support. - Error handling fixes Great thanks to Josef. - Refactor btrfs_subpage allocation/freeing Now we have btrfs_alloc_subpage() and btrfs_free_subpage() helpers to do all the allocation/freeing. It's pretty easy to convert to kmem_cache using above helpers. (already internally tested using kmem_cache without problem, in fact it's all the problems found in kmem_cache test leads to the new interface) - Use btrfs_subpage::eb_refs to replace old under_alloc This makes checking whether the page has any eb left much easier. Qu Wenruo (18): btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support btrfs: introduce the skeleton of btrfs_subpage structure btrfs: make attach_extent_buffer_page() handle subpage case btrfs: make grab_extent_buffer_from_page() handle subpage case btrfs: support subpage for extent buffer page release I don't have this patch in my inbox so I can't reply to it directly, but you include refcount.h, but then use normal atomics. Please used the actual refcount_t, as it gets us all the debugging stuff that makes finding problems much easier. Thanks, My bad, my initial plan is to use refcount, but the use case has valid 0 refcount usage, thus refcount is not good here. I'll remove the remaini