Re: [PATCH v3] btrfs: Fix race condition between delayed refs and blockgroup removal
On 18.04.2018 23:47, David Sterba wrote: > On Wed, Apr 18, 2018 at 09:41:54AM +0300, Nikolay Borisov wrote: >> When the delayed refs for a head are all run, eventually >> cleanup_ref_head is called which (in case of deletion) obtains a >> reference for the relevant btrfs_space_info struct by querying the bg >> for the range. This is problematic because when the last extent of a >> bg is deleted a race window emerges between removal of that bg and the >> subsequent invocation of cleanup_ref_head. This can result in cache being >> null >> and either a null pointer dereference or assertion failure. >> >> task: 8d04d31ed080 task.stack: 9e5dc10cc000 >> RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs] >> RSP: 0018:9e5dc10cfbe8 EFLAGS: 00010292 >> RAX: 0044 RBX: RCX: >> RDX: 8d04ffc1f868 RSI: 8d04ffc178c8 RDI: 8d04ffc178c8 >> RBP: 8d04d29e5ea0 R08: 01f0 R09: 0001 >> R10: 9e5dc0507d58 R11: 0001 R12: 8d04d29e5ea0 >> R13: 8d04d29e5f08 R14: 8d04efe29b40 R15: 8d04efe203e0 >> FS: 7fbf58ead500() GS:8d04ffc0() >> knlGS: >> CS: 0010 DS: ES: CR0: 80050033 >> CR2: 7fe6c6975648 CR3: 13b2a000 CR4: 06f0 >> DR0: DR1: DR2: >> DR3: DR6: fffe0ff0 DR7: 0400 >> Call Trace: >> __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs] >> btrfs_run_delayed_refs+0x68/0x250 [btrfs] >> btrfs_should_end_transaction+0x42/0x60 [btrfs] >> btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs] >> btrfs_evict_inode+0x4c6/0x5c0 [btrfs] >> evict+0xc6/0x190 >> do_unlinkat+0x19c/0x300 >> do_syscall_64+0x74/0x140 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> RIP: 0033:0x7fbf589c57a7 >> >> To fix this, introduce a new flag "is_system" to head_ref structs, >> which is populated at insertion time. This allows to decouple the >> querying for the spaceinfo from querying the possibly deleted bg. >> >> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned >> accounting") >> Suggested-by: Omar Sandoval >> Signed-off-by: Nikolay Borisov > > Added to next, thanks. > >> @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, >> struct btrfs_delayed_ref_head *head_ref, >> struct btrfs_qgroup_extent_record *qrecord, >> u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, >> - int action, int is_data, int *qrecord_inserted_ret, >> + int action, int is_data, int is_system, >> + int *qrecord_inserted_ret, >> int *old_ref_mod, int *new_ref_mod) > > At some point we might want to peel of a few arguments of that function. > It now has 14. The fs_info/trans should work and the > action/is_data/is_system could be merged to a bitmask. Yes, I sent a series which splits the delayed ref initialisation + addition. It will conflicts with this patch but I intend to rebase. I'm talking about [PATCH 0/8] Delayed refs cleanups/streamlining > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] btrfs: remove le_test_bit()
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap conversion"), there are no more callers to le_test_bit(). This patch removes le_test_bit(). Signed-off-by: Howard McLauchlan --- fs/btrfs/extent_io.h | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index d34416c831bf..c5e80d60d71b 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -79,11 +79,6 @@ #define BITMAP_LAST_BYTE_MASK(nbits) \ (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1))) -static inline int le_test_bit(int nr, const u8 *addr) -{ - return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1))); -} - struct extent_state; struct btrfs_root; struct btrfs_inode; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs: clean up le_bitmap_{set, clear}()
le_bitmap_set() is only used by free-space-tree, so move it there and make it static. le_bitmap_clear() is not used, so remove it. Signed-off-by: Howard McLauchlan --- V1->V2: Also move le_bitmap_set() based on 4.17-rc1 fs/btrfs/extent_io.c | 40 -- fs/btrfs/extent_io.h | 3 --- fs/btrfs/free-space-tree.c | 20 +++ 3 files changed, 20 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e99b329002cf..9a521e5e297d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5620,46 +5620,6 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src, } } -void le_bitmap_set(u8 *map, unsigned int start, int len) -{ - u8 *p = map + BIT_BYTE(start); - const unsigned int size = start + len; - int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); - u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); - - while (len - bits_to_set >= 0) { - *p |= mask_to_set; - len -= bits_to_set; - bits_to_set = BITS_PER_BYTE; - mask_to_set = ~0; - p++; - } - if (len) { - mask_to_set &= BITMAP_LAST_BYTE_MASK(size); - *p |= mask_to_set; - } -} - -void le_bitmap_clear(u8 *map, unsigned int start, int len) -{ - u8 *p = map + BIT_BYTE(start); - const unsigned int size = start + len; - int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); - u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); - - while (len - bits_to_clear >= 0) { - *p &= ~mask_to_clear; - len -= bits_to_clear; - bits_to_clear = BITS_PER_BYTE; - mask_to_clear = ~0; - p++; - } - if (len) { - mask_to_clear &= BITMAP_LAST_BYTE_MASK(size); - *p &= ~mask_to_clear; - } -} - /* * eb_bitmap_offset() - calculate the page and offset of the byte containing the * given bit number diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index a53009694b16..d34416c831bf 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -84,9 +84,6 @@ static inline int le_test_bit(int nr, const u8 *addr) return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1))); } -void le_bitmap_set(u8 *map, unsigned int start, int len); -void le_bitmap_clear(u8 *map, unsigned int start, int len); - struct extent_state; struct btrfs_root; struct btrfs_inode; diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c index 32a0f6cb5594..e03830d83311 100644 --- a/fs/btrfs/free-space-tree.c +++ b/fs/btrfs/free-space-tree.c @@ -157,6 +157,26 @@ static u8 *alloc_bitmap(u32 bitmap_size) return ret; } +static void le_bitmap_set(u8 *map, unsigned int start, int len) +{ + u8 *p = map + BIT_BYTE(start); + const unsigned int size = start + len; + int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); + u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); + + while (len - bits_to_set >= 0) { + *p |= mask_to_set; + len -= bits_to_set; + bits_to_set = BITS_PER_BYTE; + mask_to_set = ~0; + p++; + } + if (len) { + mask_to_set &= BITMAP_LAST_BYTE_MASK(size); + *p |= mask_to_set; + } +} + int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, struct btrfs_block_group_cache *block_group, -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] btrfs: optimize free space tree bitmap conversion
Presently, convert_free_space_to_extents() does a linear scan of the bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). Testing shows a 20-33% decrease in execution time for convert_free_space_to_extents(). Suggested-by: Omar Sandoval Signed-off-by: Howard McLauchlan --- V1->V2: Round the bitmap size accordingly when allocating bitmap. fs/btrfs/free-space-tree.c | 61 ++ 1 file changed, 23 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c index e03830d83311..7019afe6e727 100644 --- a/fs/btrfs/free-space-tree.c +++ b/fs/btrfs/free-space-tree.c @@ -138,10 +138,11 @@ static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize) return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE); } -static u8 *alloc_bitmap(u32 bitmap_size) +static unsigned long *alloc_bitmap(u32 bitmap_size) { - u8 *ret; + unsigned long *ret; unsigned int nofs_flag; + u32 bitmap_rounded_size = round_up(bitmap_size, sizeof(unsigned long)); /* * GFP_NOFS doesn't work with kvmalloc(), but we really can't recurse @@ -152,14 +153,14 @@ static u8 *alloc_bitmap(u32 bitmap_size) * know that recursion is unsafe. */ nofs_flag = memalloc_nofs_save(); - ret = kvzalloc(bitmap_size, GFP_KERNEL); + ret = kvzalloc(bitmap_rounded_size, GFP_KERNEL); memalloc_nofs_restore(nofs_flag); return ret; } -static void le_bitmap_set(u8 *map, unsigned int start, int len) +static void le_bitmap_set(unsigned long *map, unsigned int start, int len) { - u8 *p = map + BIT_BYTE(start); + u8 *p = ((u8 *)map) + BIT_BYTE(start); const unsigned int size = start + len; int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); @@ -186,7 +187,8 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans, struct btrfs_free_space_info *info; struct btrfs_key key, found_key; struct extent_buffer *leaf; - u8 *bitmap, *bitmap_cursor; + unsigned long *bitmap; + char *bitmap_cursor; u64 start, end; u64 bitmap_range, i; u32 bitmap_size, flags, expected_extent_count; @@ -275,7 +277,7 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans, goto out; } - bitmap_cursor = bitmap; + bitmap_cursor = (char *)bitmap; bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS; i = start; while (i < end) { @@ -324,13 +326,10 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, struct btrfs_free_space_info *info; struct btrfs_key key, found_key; struct extent_buffer *leaf; - u8 *bitmap; + unsigned long *bitmap; u64 start, end; - /* Initialize to silence GCC. */ - u64 extent_start = 0; - u64 offset; u32 bitmap_size, flags, expected_extent_count; - int prev_bit = 0, bit, bitnr; + unsigned long nrbits, start_bit, end_bit; u32 extent_count = 0; int done = 0, nr; int ret; @@ -368,7 +367,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, break; } else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) { unsigned long ptr; - u8 *bitmap_cursor; + char *bitmap_cursor; u32 bitmap_pos, data_size; ASSERT(found_key.objectid >= start); @@ -378,7 +377,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, bitmap_pos = div_u64(found_key.objectid - start, fs_info->sectorsize * BITS_PER_BYTE); - bitmap_cursor = bitmap + bitmap_pos; + bitmap_cursor = ((char *)bitmap) + bitmap_pos; data_size = free_space_bitmap_size(found_key.offset, fs_info->sectorsize); @@ -412,32 +411,16 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, btrfs_mark_buffer_dirty(leaf); btrfs_release_path(path); - offset = start; - bitnr = 0; - while (offset < end) { - bit = !!le_test_bit(bitnr, bitmap); - if (prev_bit == 0 && bit == 1) { - extent_start = offset; - } else if (prev_bit == 1 && bit == 0) { - key.objectid = extent_start; - key.type = BTRFS_FREE_SPACE_EXTENT_K
Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()
On Thu, Apr 19, 2018 at 01:15:59AM +0100, Al Viro wrote: > On Thu, Apr 19, 2018 at 01:06:13AM +0100, Al Viro wrote: > > On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote: > > > Hi Chris and other btrfs folks, > > > > > > btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is > > > wrong > > > because it exposes the inode to lookups before it's been fully > > > initialized. > > > > Huh? It *is* fully initialized by that point; what else is left to do? > > ISTR something about false positives from lockdep (with > lockdep_annotate_inode_mutex_key() called too late, perhaps?); said that, it > was a long time ago and I don't remember details at the moment... Are you > actually seeing a deadlock there or is that just lockdep complaining? It's an actual deadlock. unlock_new_inode() calls lockdep_annotate_inode_mutex_key() which calls init_rwsem(), which resets i_rwsem->count while it's read-locked by lookup_slow(). Then the unlock in lookup_slow() makes i_rwsem->count negative, which makes it appear to be write-locked. So no, the inode isn't fully initialized until unlock_new_inode() ran. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()
On Thu, Apr 19, 2018 at 01:06:13AM +0100, Al Viro wrote: > On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote: > > Hi Chris and other btrfs folks, > > > > btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is > > wrong > > because it exposes the inode to lookups before it's been fully initialized. > > Huh? It *is* fully initialized by that point; what else is left to do? ISTR something about false positives from lockdep (with lockdep_annotate_inode_mutex_key() called too late, perhaps?); said that, it was a long time ago and I don't remember details at the moment... Are you actually seeing a deadlock there or is that just lockdep complaining? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()
On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote: > Hi Chris and other btrfs folks, > > btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is wrong > because it exposes the inode to lookups before it's been fully initialized. Huh? It *is* fully initialized by that point; what else is left to do? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
d_instantiate() and unlock_new_inode() order in btrfs_mkdir()
Hi Chris and other btrfs folks, btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is wrong because it exposes the inode to lookups before it's been fully initialized. Most filesystems get it right, but f2fs and btrfs don't. I sent a f2fs patch (https://marc.info/?l=linux-fsdevel&m=152409178431350) and was going to send a btrfs patch too, but in btrfs_mkdir() there is actually a comment claiming that the existing order is intentional: d_instantiate(dentry, inode); /* * mkdir is special. We're unlocking after we call d_instantiate * to avoid a race with nfsd calling d_instantiate. */ unlock_new_inode(inode); Unfortunately, I cannot find what it is refering to. The comment was added by commit b0d5d10f41a0 ("Btrfs: use insert_inode_locked4 for inode creation"). Chris, do you remember exactly what you had in mind when you wrote this? And in case anyone wants it, here's a reproducer for the deadlock caused by the current code that calls d_instantiate() before unlock_new_inode(). Note: it needs CONFIG_DEBUG_LOCK_ALLOC=y. #include #include int main() { struct stat stbuf; if (fork() == 0) { for (;;) stat("dir/file", &stbuf); } else { for (;;) { mkdir("dir", 0777); stat("dir/file", &stbuf); rmdir("dir"); } } } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Do super block verification before writing it to disk
On 2018年04月19日 06:04, David Sterba wrote: > On Tue, Apr 17, 2018 at 09:47:19AM +0800, Qu Wenruo wrote: >> @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb, >> >> memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE); >> >> -ret = btrfs_check_super_valid(fs_info); >> +ret = btrfs_check_super_valid(fs_info, fs_info->super_copy); >> if (ret) { >> btrfs_err(fs_info, "superblock contains fatal errors"); >> err = -EINVAL; >> @@ -3310,6 +3311,27 @@ static int write_dev_supers(struct btrfs_device >> *device, > > This is in write_dev_supers, so the superblock is checked > number-of-devices times. The caller write_all_supers rewrites the device > item so it matches the device it's going to write to. But, > btrfs_check_super_valid does not validate the dev_item so all the > validation does not bring much benefit, as it repeatedly checks the same > data. > > So, what if the validation is done only once in write_all_supers? Lock > the devices, validate, if it fails, report that and unlock devices and > go readonly. Makes sense. I'll update btrfs_check_super_valid() to cooperate with that in next update. Thanks, Qu > > There's a differnce to what you implemented: if the in-memory superblock > corruption happens between writing to the devices, there are some left > with the new superblock and some with the old. > > Although this sounds quite improbable, I think that doing the check in > advance would save some trouble if that happens. The superblocks on all > devices will match. > signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/2] btrfs: optimize free space tree bitmap conversion
On 04/18/2018 03:26 PM, David Sterba wrote: > On Wed, Apr 18, 2018 at 02:30:59PM -0700, Howard McLauchlan wrote: >> Presently, convert_free_space_to_extents() does a linear scan of the >> bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). >> >> This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). >> Testing shows a 20-33% decrease in execution time for >> convert_free_space_to_extents(). >> >> Suggested-by: Omar Sandoval >> Signed-off-by: Howard McLauchlan >> --- >> >> Since we change bitmap to be unsigned long, we have to do some casting for >> the >> bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing >> bit operations. Everywhere else, we're just using it for pointer arithmetic >> and >> not directly accessing it, so char seems more appropriate. > Ok, makes sense for just passing the pointers around. I'll add the text > to changelog and apply the patch to next. > >> -bit = !!le_test_bit(bitnr, bitmap); > This is the last use of le_test_bit, so it can be removed (in another > patch). I just found an issue in this patch that should be fixed. I'll address moving le_bitmap_set(), le_test_bit and send a V2 for both these patches. Howard -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: optimize free space tree bitmap conversion
On Wed, Apr 18, 2018 at 02:30:59PM -0700, Howard McLauchlan wrote: > Presently, convert_free_space_to_extents() does a linear scan of the > bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). > > This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). > Testing shows a 20-33% decrease in execution time for > convert_free_space_to_extents(). > > Suggested-by: Omar Sandoval > Signed-off-by: Howard McLauchlan > --- > > Since we change bitmap to be unsigned long, we have to do some casting for the > bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing > bit operations. Everywhere else, we're just using it for pointer arithmetic > and > not directly accessing it, so char seems more appropriate. Ok, makes sense for just passing the pointers around. I'll add the text to changelog and apply the patch to next. > - bit = !!le_test_bit(bitnr, bitmap); This is the last use of le_test_bit, so it can be removed (in another patch). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: remove le_bitmap_clear()
On Wed, Apr 18, 2018 at 02:30:58PM -0700, Howard McLauchlan wrote: > le_bitmap_clear() was implemented as a parallel to le_bitmap_set() but > is actually not being used. This patch removes it. > > Signed-off-by: Howard McLauchlan Reviewed-by: David Sterba I've noticed that le_bitmap_set is only used for the free-space-tree so it can be moved there and made static. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Do super block verification before writing it to disk
On Tue, Apr 17, 2018 at 09:47:19AM +0800, Qu Wenruo wrote: > @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb, > > memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE); > > - ret = btrfs_check_super_valid(fs_info); > + ret = btrfs_check_super_valid(fs_info, fs_info->super_copy); > if (ret) { > btrfs_err(fs_info, "superblock contains fatal errors"); > err = -EINVAL; > @@ -3310,6 +3311,27 @@ static int write_dev_supers(struct btrfs_device > *device, This is in write_dev_supers, so the superblock is checked number-of-devices times. The caller write_all_supers rewrites the device item so it matches the device it's going to write to. But, btrfs_check_super_valid does not validate the dev_item so all the validation does not bring much benefit, as it repeatedly checks the same data. So, what if the validation is done only once in write_all_supers? Lock the devices, validate, if it fails, report that and unlock devices and go readonly. There's a differnce to what you implemented: if the in-memory superblock corruption happens between writing to the devices, there are some left with the new superblock and some with the old. Although this sounds quite improbable, I think that doing the check in advance would save some trouble if that happens. The superblocks on all devices will match. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: optimize free space tree bitmap conversion
Presently, convert_free_space_to_extents() does a linear scan of the bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). Testing shows a 20-33% decrease in execution time for convert_free_space_to_extents(). Suggested-by: Omar Sandoval Signed-off-by: Howard McLauchlan --- Since we change bitmap to be unsigned long, we have to do some casting for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing bit operations. Everywhere else, we're just using it for pointer arithmetic and not directly accessing it, so char seems more appropriate. Howard fs/btrfs/extent_io.c | 4 +-- fs/btrfs/extent_io.h | 2 +- fs/btrfs/free-space-tree.c | 54 ++ 3 files changed, 22 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cf50278f1c59..1d984b0acd66 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5620,9 +5620,9 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src, } } -void le_bitmap_set(u8 *map, unsigned int start, int len) +void le_bitmap_set(unsigned long *map, unsigned int start, int len) { - u8 *p = map + BIT_BYTE(start); + u8 *p = ((u8 *)map) + BIT_BYTE(start); const unsigned int size = start + len; int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 4b8b072d9594..01261306e5f9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -84,7 +84,7 @@ static inline int le_test_bit(int nr, const u8 *addr) return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1))); } -void le_bitmap_set(u8 *map, unsigned int start, int len); +void le_bitmap_set(unsigned long *map, unsigned int start, int len); struct extent_state; struct btrfs_root; diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c index 32a0f6cb5594..0ddf96b01a3a 100644 --- a/fs/btrfs/free-space-tree.c +++ b/fs/btrfs/free-space-tree.c @@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize) return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE); } -static u8 *alloc_bitmap(u32 bitmap_size) +static unsigned long *alloc_bitmap(u32 bitmap_size) { - u8 *ret; + unsigned long *ret; unsigned int nofs_flag; /* @@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans, struct btrfs_free_space_info *info; struct btrfs_key key, found_key; struct extent_buffer *leaf; - u8 *bitmap, *bitmap_cursor; + unsigned long *bitmap; + char *bitmap_cursor; u64 start, end; u64 bitmap_range, i; u32 bitmap_size, flags, expected_extent_count; @@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans, goto out; } - bitmap_cursor = bitmap; + bitmap_cursor = (char *)bitmap; bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS; i = start; while (i < end) { @@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, struct btrfs_free_space_info *info; struct btrfs_key key, found_key; struct extent_buffer *leaf; - u8 *bitmap; + unsigned long *bitmap; u64 start, end; - /* Initialize to silence GCC. */ - u64 extent_start = 0; - u64 offset; u32 bitmap_size, flags, expected_extent_count; - int prev_bit = 0, bit, bitnr; + unsigned long nrbits, start_bit, end_bit; u32 extent_count = 0; int done = 0, nr; int ret; @@ -348,7 +346,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, break; } else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) { unsigned long ptr; - u8 *bitmap_cursor; + char *bitmap_cursor; u32 bitmap_pos, data_size; ASSERT(found_key.objectid >= start); @@ -358,7 +356,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, bitmap_pos = div_u64(found_key.objectid - start, fs_info->sectorsize * BITS_PER_BYTE); - bitmap_cursor = bitmap + bitmap_pos; + bitmap_cursor = ((char *)bitmap) + bitmap_pos; data_size = free_space_bitmap_size(found_key.offset, fs_info->sectorsize); @@ -392,32 +39
[PATCH 1/2] btrfs: remove le_bitmap_clear()
le_bitmap_clear() was implemented as a parallel to le_bitmap_set() but is actually not being used. This patch removes it. Signed-off-by: Howard McLauchlan --- fs/btrfs/extent_io.c | 20 fs/btrfs/extent_io.h | 1 - 2 files changed, 21 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e99b329002cf..cf50278f1c59 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5640,26 +5640,6 @@ void le_bitmap_set(u8 *map, unsigned int start, int len) } } -void le_bitmap_clear(u8 *map, unsigned int start, int len) -{ - u8 *p = map + BIT_BYTE(start); - const unsigned int size = start + len; - int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); - u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); - - while (len - bits_to_clear >= 0) { - *p &= ~mask_to_clear; - len -= bits_to_clear; - bits_to_clear = BITS_PER_BYTE; - mask_to_clear = ~0; - p++; - } - if (len) { - mask_to_clear &= BITMAP_LAST_BYTE_MASK(size); - *p &= ~mask_to_clear; - } -} - /* * eb_bitmap_offset() - calculate the page and offset of the byte containing the * given bit number diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index a53009694b16..4b8b072d9594 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -85,7 +85,6 @@ static inline int le_test_bit(int nr, const u8 *addr) } void le_bitmap_set(u8 *map, unsigned int start, int len); -void le_bitmap_clear(u8 *map, unsigned int start, int len); struct extent_state; struct btrfs_root; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Fix race condition between delayed refs and blockgroup removal
On Wed, Apr 18, 2018 at 09:41:54AM +0300, Nikolay Borisov wrote: > When the delayed refs for a head are all run, eventually > cleanup_ref_head is called which (in case of deletion) obtains a > reference for the relevant btrfs_space_info struct by querying the bg > for the range. This is problematic because when the last extent of a > bg is deleted a race window emerges between removal of that bg and the > subsequent invocation of cleanup_ref_head. This can result in cache being null > and either a null pointer dereference or assertion failure. > > task: 8d04d31ed080 task.stack: 9e5dc10cc000 > RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs] > RSP: 0018:9e5dc10cfbe8 EFLAGS: 00010292 > RAX: 0044 RBX: RCX: > RDX: 8d04ffc1f868 RSI: 8d04ffc178c8 RDI: 8d04ffc178c8 > RBP: 8d04d29e5ea0 R08: 01f0 R09: 0001 > R10: 9e5dc0507d58 R11: 0001 R12: 8d04d29e5ea0 > R13: 8d04d29e5f08 R14: 8d04efe29b40 R15: 8d04efe203e0 > FS: 7fbf58ead500() GS:8d04ffc0() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fe6c6975648 CR3: 13b2a000 CR4: 06f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: >__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs] >btrfs_run_delayed_refs+0x68/0x250 [btrfs] >btrfs_should_end_transaction+0x42/0x60 [btrfs] >btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs] >btrfs_evict_inode+0x4c6/0x5c0 [btrfs] >evict+0xc6/0x190 >do_unlinkat+0x19c/0x300 >do_syscall_64+0x74/0x140 >entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > RIP: 0033:0x7fbf589c57a7 > > To fix this, introduce a new flag "is_system" to head_ref structs, > which is populated at insertion time. This allows to decouple the > querying for the spaceinfo from querying the possibly deleted bg. > > Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned > accounting") > Suggested-by: Omar Sandoval > Signed-off-by: Nikolay Borisov Added to next, thanks. > @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, >struct btrfs_delayed_ref_head *head_ref, >struct btrfs_qgroup_extent_record *qrecord, >u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, > - int action, int is_data, int *qrecord_inserted_ret, > + int action, int is_data, int is_system, > + int *qrecord_inserted_ret, >int *old_ref_mod, int *new_ref_mod) At some point we might want to peel of a few arguments of that function. It now has 14. The fs_info/trans should work and the action/is_data/is_system could be merged to a bitmask. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
Thank you, all Though the info is useful, there's not a clear consensus on what I should expect. For interest's sake, I'll post benchmarks from the device itself when it arrives. I'm expecting at least that I'll be blown away :) On 04/18/2018 09:23 PM, Chris Murphy wrote: On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn mailto:ahferro...@gmail.com>> wrote: For reference, the zstd compression in BTRFS uses level 3 by default (as does zlib compression IIRC), though I'm not sure about lzop (I think it uses the lowest compression setting). The user space tool, zstd, does default to 3, according to its man page. -# # compression level [1-19] (default: 3) However, the kernel is claiming it's level 0, which doesn't exist in the man page. So I have no idea what we're using. This is what I get with mount option compress=zstd [ 4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/3] Allow rmdir(2) to delete a subvolume
On Wed, Apr 18, 2018 at 11:32:57AM +0900, Misono Tomohiro wrote: > cangelog: > v4 -> v5 ... Merge 2nd and 3rd patches and update commit log >No code change in total > v3 -> v4 ... Reorganize patches and update commit log >No code change in total > v2 -> v3 ... Use if-else block instead of two if blocks and >add Tested-by tag in 2nd patch > v1 -> v2 ... Split the patch to hopefully make review easier > > Note: I will send a xfstest if this series is merged. > > This series changes the behavior of rmdir(2) and allow it to delete an > empty subvolume. > > In order so that, 1st and 2nd patch refactor btrfs_ioctl_snap_destroy() > and extract the actual deletion process as btrfs_delete_subvolume() > (remaining part in btrfs_ioctl_snap_destroy() is mainly permission checks). > > Then, 3rd patch changes btrfs_rmdir() to call this function. The > required permission check is already done in vfs layer. > > Tomohiro Misono (3): > btrfs: Move may_destroy_subvol() from ioctl.c to inode.c > btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() > btrfs: Allow rmdir(2) to delete an empty subvolume Looks good to me now, thanks. I'll add it to next for more testing. The 3rd patch may need some updates in the changelog about the change in behaviour, but this can be added later. Tests should cover the common usage like rm -rf /path/to/subvol, removing the empty subvol (the subvol stub), snapshots and rw subvolumes, nested in a nontrivial way. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn wrote: > For reference, the zstd compression in BTRFS uses level 3 by default (as > does zlib compression IIRC), though I'm not sure about lzop (I think it > uses the lowest compression setting). > The user space tool, zstd, does default to 3, according to its man page. -# # compression level [1-19] (default: 3) However, the kernel is claiming it's level 0, which doesn't exist in the man page. So I have no idea what we're using. This is what I get with mount option compress=zstd [4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: optimize free space tree bitmap conversion
On 04/18/2018 08:50 AM, David Sterba wrote: > On Tue, Apr 17, 2018 at 04:19:04PM -0700, Howard McLauchlan wrote: >> Presently, convert_free_space_to_extents() does a linear scan of the >> bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). >> >> This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). >> Testing shows a 20-33% decrease in execution time for >> convert_free_space_to_extents(). > Sounds good. > >> Suggested-by: Omar Sandoval >> Signed-off-by: Howard McLauchlan >> --- >> Based on 4.17-rc1 >> >> fs/btrfs/extent_io.c | 12 - >> fs/btrfs/extent_io.h | 4 +-- >> fs/btrfs/free-space-tree.c | 54 ++ >> 3 files changed, 27 insertions(+), 43 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index e99b329002cf..1c0e7ce49556 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -5620,12 +5620,12 @@ void copy_extent_buffer(struct extent_buffer *dst, >> struct extent_buffer *src, >> } >> } >> >> -void le_bitmap_set(u8 *map, unsigned int start, int len) >> +void le_bitmap_set(unsigned long *map, unsigned int start, int len) >> { >> -u8 *p = map + BIT_BYTE(start); >> +char *p = ((char *)map) + BIT_BYTE(start); >> const unsigned int size = start + len; >> int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); >> -u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); >> +char mask_to_set = BITMAP_FIRST_BYTE_MASK(start); > Why do you switch to char? As there are bit operations done below (that > you don't change), the unsigned type seems correct. > >> >> while (len - bits_to_set >= 0) { >> *p |= mask_to_set; >> @@ -5640,12 +5640,12 @@ void le_bitmap_set(u8 *map, unsigned int start, int >> len) >> } >> } >> >> -void le_bitmap_clear(u8 *map, unsigned int start, int len) >> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len) >> { >> -u8 *p = map + BIT_BYTE(start); >> +char *p = ((char *)map) + BIT_BYTE(start); >> const unsigned int size = start + len; >> int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); >> -u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); >> +char mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); > Same here and in below. Otherwise it looks ok. > >> >> while (len - bits_to_clear >= 0) { >> *p &= ~mask_to_clear; >> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h >> index a53009694b16..a6db233f4a40 100644 >> --- a/fs/btrfs/extent_io.h >> +++ b/fs/btrfs/extent_io.h >> @@ -84,8 +84,8 @@ static inline int le_test_bit(int nr, const u8 *addr) >> return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1))); >> } >> >> -void le_bitmap_set(u8 *map, unsigned int start, int len); >> -void le_bitmap_clear(u8 *map, unsigned int start, int len); >> +void le_bitmap_set(unsigned long *map, unsigned int start, int len); >> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len); >> >> struct extent_state; >> struct btrfs_root; >> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c >> index 32a0f6cb5594..0ddf96b01a3a 100644 >> --- a/fs/btrfs/free-space-tree.c >> +++ b/fs/btrfs/free-space-tree.c >> @@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 >> sectorsize) >> return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE); >> } >> >> -static u8 *alloc_bitmap(u32 bitmap_size) >> +static unsigned long *alloc_bitmap(u32 bitmap_size) >> { >> -u8 *ret; >> +unsigned long *ret; >> unsigned int nofs_flag; >> >> /* >> @@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct >> btrfs_trans_handle *trans, >> struct btrfs_free_space_info *info; >> struct btrfs_key key, found_key; >> struct extent_buffer *leaf; >> -u8 *bitmap, *bitmap_cursor; >> +unsigned long *bitmap; >> +char *bitmap_cursor; >> u64 start, end; >> u64 bitmap_range, i; >> u32 bitmap_size, flags, expected_extent_count; >> @@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct >> btrfs_trans_handle *trans, >> goto out; >> } >> >> -bitmap_cursor = bitmap; >> +bitmap_cursor = (char *)bitmap; >> bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS; >> i = start; >> while (i < end) { >> @@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct >> btrfs_trans_handle *trans, >> struct btrfs_free_space_info *info; >> struct btrfs_key key, found_key; >> struct extent_buffer *leaf; >> -u8 *bitmap; >> +unsigned long *bitmap; >> u64 start, end; >> -/* Initialize to silence GCC. */ >> -u64 extent_start = 0; >> -u64 offset; >> u32 bitmap_size, flags, expected_extent_count; >> -int prev_bit = 0, bit, bitnr; >> +unsigned long nrbits, start_bit, end_bit; >> u32 extent_count = 0; >> int done = 0, nr; >> int ret; >>
Re: [PATCH] btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
On Wed, Apr 18, 2018 at 09:08:10AM +0800, Qu Wenruo wrote: > >>dst_rsv = &fs_info->delayed_block_rsv; > >> > >>num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1); > >> + > >> + /* > >> + * Here we migrate space rsv from transaction rsv, since have > >> + * already reserved space when starting a transaction. > >> + * So no need to reserve qgroup space here. > >> + */ > > > > Please format the comments to the full line width. > > Right, the already can go previous line without exceeding 80 chars. > > But the "So no need to..." line is a new line so it will not take up any > space of previous line anyway. You can split the loosely related parts by an empty line, ie. paragraphs, but I don't tend to like if the new sentence on a new line when it logically follows the previous one. > >> @@ -647,7 +657,9 @@ static int btrfs_delayed_inode_reserve_metadata( > >> "delayed_inode", > >> btrfs_ino(inode), > >> num_bytes, 1); > >> - } > >> + } else > >> + btrfs_qgroup_free_meta_prealloc(root, > >> + fs_info->nodesize); > > > > Please don't diverge from the coding style, I'm fixing such issues but, > > you know. > > For this, did you mean the bracket for else branch? Yes. if () { ... } else { ... } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
On 18.04.2018 21:28, David Sterba wrote: > On Wed, Apr 18, 2018 at 06:14:07PM +0300, Nikolay Borisov wrote: >> >> >> On 18.04.2018 18:10, Brendan Hide wrote: >>> Hi, all >>> >>> I'm looking for some advice re compression with NVME. Compression helps >>> performance with a minor CPU hit - but is it still worth it with the far >>> higher throughputs offered by newer PCI and NVME-type SSDs? >>> >>> I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my >>> home desktop. I previously used compression on an older SATA-based Intel >>> 520 SSD, where compression made sense. >>> >>> However, the wisdom isn't so clear-cut if the SSD is potentially faster >>> than the compression algorithm with my CPU (aging i7 3770). >>> >>> Testing using a copy of the kernel source tarball in tmpfs it seems my >>> system can compress/decompress at about 670MB/s using zstd with 8 >>> threads. lzop isn't that far behind. But I'm not sure if the benchmark >>> I'm running is the same as how btrfs would be using it internally. >>> >>> Given these numbers I'm inclined to believe compression will make things >>> slower - but can't be sure without knowing if I'm testing correctly. >>> >>> What is the best practice with benchmarking and with NVME/PCI storage? >> >> btrfs doesn't support DAX so using it on NVME doesn't make much sense >> performance wise. > > Is'nt NVMe just "the faster SSD"? Not the persistent memory thing. Indeed, brain fart on my part. NVDIMM is the persistent memory thing. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
On Wed, Apr 18, 2018 at 06:14:07PM +0300, Nikolay Borisov wrote: > > > On 18.04.2018 18:10, Brendan Hide wrote: > > Hi, all > > > > I'm looking for some advice re compression with NVME. Compression helps > > performance with a minor CPU hit - but is it still worth it with the far > > higher throughputs offered by newer PCI and NVME-type SSDs? > > > > I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my > > home desktop. I previously used compression on an older SATA-based Intel > > 520 SSD, where compression made sense. > > > > However, the wisdom isn't so clear-cut if the SSD is potentially faster > > than the compression algorithm with my CPU (aging i7 3770). > > > > Testing using a copy of the kernel source tarball in tmpfs it seems my > > system can compress/decompress at about 670MB/s using zstd with 8 > > threads. lzop isn't that far behind. But I'm not sure if the benchmark > > I'm running is the same as how btrfs would be using it internally. > > > > Given these numbers I'm inclined to believe compression will make things > > slower - but can't be sure without knowing if I'm testing correctly. > > > > What is the best practice with benchmarking and with NVME/PCI storage? > > btrfs doesn't support DAX so using it on NVME doesn't make much sense > performance wise. Is'nt NVMe just "the faster SSD"? Not the persistent memory thing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix unaligned access in readdir
On Wed, Apr 18, 2018 at 6:17 AM, David Sterba wrote: > The last update to readdir introduced a temporary buffer to store the > emitted readdir data, but as there are file names of variable length, > there's a lot of unaligned access. > > This was observed on a sparc64 machine: > > Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 > [btrfs] > Reviewed-by: Liu Bo thanks, liubo > Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault") > CC: sta...@vger.kernel.org # 4.14+ > Reported-and-tested-by: René Rebe > Signed-off-by: David Sterba > --- > fs/btrfs/inode.c | 20 > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index e064c49c9a9a..d241285a0d2a 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "disk-io.h" > #include "transaction.h" > @@ -5905,11 +5906,13 @@ static int btrfs_filldir(void *addr, int entries, > struct dir_context *ctx) > struct dir_entry *entry = addr; > char *name = (char *)(entry + 1); > > - ctx->pos = entry->offset; > - if (!dir_emit(ctx, name, entry->name_len, entry->ino, > - entry->type)) > + ctx->pos = get_unaligned(&entry->offset); > + if (!dir_emit(ctx, name, get_unaligned(&entry->name_len), > +get_unaligned(&entry->ino), > +get_unaligned(&entry->type))) > return 1; > - addr += sizeof(struct dir_entry) + entry->name_len; > + addr += sizeof(struct dir_entry) + > + get_unaligned(&entry->name_len); > ctx->pos++; > } > return 0; > @@ -5999,14 +6002,15 @@ static int btrfs_real_readdir(struct file *file, > struct dir_context *ctx) > } > > entry = addr; > - entry->name_len = name_len; > + put_unaligned(name_len, &entry->name_len); > name_ptr = (char *)(entry + 1); > read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1), >name_len); > - entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)]; > + put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)], > + &entry->type); > btrfs_dir_item_key_to_cpu(leaf, di, &location); > - entry->ino = location.objectid; > - entry->offset = found_key.offset; > + put_unaligned(location.objectid, &entry->ino); > + put_unaligned(found_key.offset, &entry->offset); > entries++; > addr += sizeof(struct dir_entry) + name_len; > total_len += sizeof(struct dir_entry) + name_len; > -- > 2.16.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Filesystem full with unallocated space?
On 17.04.2018 19:08, Matthew Lai wrote: > Hello! > > I am getting ENOSPC on my root filesystem with plenty of unallocated > space according to "fi usage". Any idea why that may be? This is a root > partition for Debian Stable. Not doing anything unusual that I'm aware > of. No snapshots. > > Thanks! > > Matthew > > uname -a: > Linux bigfoot 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) > x86_64 GNU/Linux > > btrfs --version: > btrfs-progs v4.7.3 > > btrfs fi show: > Label: none uuid: 2364c63f-e20c-410f-90b4-05f722ee1c77 > Total devices 1 FS bytes used 176.27GiB > devid 1 size 246.33GiB used 185.01GiB path /dev/sda2 > > btrfs fi df /: > Data, single: total=183.00GiB, used=175.78GiB > System, single: total=4.00MiB, used=48.00KiB > Metadata, single: total=2.01GiB, used=504.81MiB > GlobalReserve, single: total=211.39MiB, used=0.00B > You haven't shown "btrfs fi usage". In any case there were some patches in more recent kernels which deal with premature ENOSPC: 996478ca9c46 ("btrfs: change how we decide to commit transactions during flushing") 17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc") I'd advise you update to 4.14 stable. Otherwise running: git log --oneline --grep "enospc" fs/btrfs/ will shows you likely candidates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
On 2018-04-18 11:10, Brendan Hide wrote: Hi, all I'm looking for some advice re compression with NVME. Compression helps performance with a minor CPU hit - but is it still worth it with the far higher throughputs offered by newer PCI and NVME-type SSDs? I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my home desktop. I previously used compression on an older SATA-based Intel 520 SSD, where compression made sense. However, the wisdom isn't so clear-cut if the SSD is potentially faster than the compression algorithm with my CPU (aging i7 3770). Testing using a copy of the kernel source tarball in tmpfs it seems my system can compress/decompress at about 670MB/s using zstd with 8 threads. lzop isn't that far behind. But I'm not sure if the benchmark I'm running is the same as how btrfs would be using it internally. BTRFS compresses chunks of 128k at a time and compresses each block one at a time (it doesn't do multi-threaded compression). You can simulate this a bit better by splitting the files you're trying to compress into 128k chunks (calling `split -b 131072` on the file will do this quickly and easily), and then passing all those chunks to the compression program _at the same time_ (this eliminates the overhead of re-invoking the compressor for each block), and then running it with one thread. For reference, the zstd compression in BTRFS uses level 3 by default (as does zlib compression IIRC), though I'm not sure about lzop (I think it uses the lowest compression setting). Note that this will still not be entirely accurate (there are significant differences in buffer handling in the in-kernel implementations because of memory management differences). Another option is to see how long it takes to copy the test data into a ZRAM device. This will eliminate the storage overhead, and use the same compression algorithms that BTRFS does (the only big difference is that it compresses by page, so it will use 4k blocks instead of 128k). zRAM currently doesn't support zstd (though patches have been posted), but it by default uses lzo, and it supports deflate as well (which is essentially the same mathematically as the 'zlib' compression method in BTRFS). Given these numbers I'm inclined to believe compression will make things slower - but can't be sure without knowing if I'm testing correctly. On NVMe, yes, it's probably not worth it for speed. It may however help in other ways. Compressed writes are smaller than normal writes. This means that rewriting a file that is compressed by the filesystem will result in fewer rewritten blocks of storage, which can be useful when dealing with flash memory. Less written data also means you leave a bit more free space for the wear-leveling algorithms to work with, which can improve performance on some devices. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: optimize free space tree bitmap conversion
On Tue, Apr 17, 2018 at 04:19:04PM -0700, Howard McLauchlan wrote: > Presently, convert_free_space_to_extents() does a linear scan of the > bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). > > This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). > Testing shows a 20-33% decrease in execution time for > convert_free_space_to_extents(). Sounds good. > Suggested-by: Omar Sandoval > Signed-off-by: Howard McLauchlan > --- > Based on 4.17-rc1 > > fs/btrfs/extent_io.c | 12 - > fs/btrfs/extent_io.h | 4 +-- > fs/btrfs/free-space-tree.c | 54 ++ > 3 files changed, 27 insertions(+), 43 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index e99b329002cf..1c0e7ce49556 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -5620,12 +5620,12 @@ void copy_extent_buffer(struct extent_buffer *dst, > struct extent_buffer *src, > } > } > > -void le_bitmap_set(u8 *map, unsigned int start, int len) > +void le_bitmap_set(unsigned long *map, unsigned int start, int len) > { > - u8 *p = map + BIT_BYTE(start); > + char *p = ((char *)map) + BIT_BYTE(start); > const unsigned int size = start + len; > int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); > - u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); > + char mask_to_set = BITMAP_FIRST_BYTE_MASK(start); Why do you switch to char? As there are bit operations done below (that you don't change), the unsigned type seems correct. > > while (len - bits_to_set >= 0) { > *p |= mask_to_set; > @@ -5640,12 +5640,12 @@ void le_bitmap_set(u8 *map, unsigned int start, int > len) > } > } > > -void le_bitmap_clear(u8 *map, unsigned int start, int len) > +void le_bitmap_clear(unsigned long *map, unsigned int start, int len) > { > - u8 *p = map + BIT_BYTE(start); > + char *p = ((char *)map) + BIT_BYTE(start); > const unsigned int size = start + len; > int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); > - u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); > + char mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); Same here and in below. Otherwise it looks ok. > > while (len - bits_to_clear >= 0) { > *p &= ~mask_to_clear; > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > index a53009694b16..a6db233f4a40 100644 > --- a/fs/btrfs/extent_io.h > +++ b/fs/btrfs/extent_io.h > @@ -84,8 +84,8 @@ static inline int le_test_bit(int nr, const u8 *addr) > return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1))); > } > > -void le_bitmap_set(u8 *map, unsigned int start, int len); > -void le_bitmap_clear(u8 *map, unsigned int start, int len); > +void le_bitmap_set(unsigned long *map, unsigned int start, int len); > +void le_bitmap_clear(unsigned long *map, unsigned int start, int len); > > struct extent_state; > struct btrfs_root; > diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c > index 32a0f6cb5594..0ddf96b01a3a 100644 > --- a/fs/btrfs/free-space-tree.c > +++ b/fs/btrfs/free-space-tree.c > @@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 > sectorsize) > return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE); > } > > -static u8 *alloc_bitmap(u32 bitmap_size) > +static unsigned long *alloc_bitmap(u32 bitmap_size) > { > - u8 *ret; > + unsigned long *ret; > unsigned int nofs_flag; > > /* > @@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct > btrfs_trans_handle *trans, > struct btrfs_free_space_info *info; > struct btrfs_key key, found_key; > struct extent_buffer *leaf; > - u8 *bitmap, *bitmap_cursor; > + unsigned long *bitmap; > + char *bitmap_cursor; > u64 start, end; > u64 bitmap_range, i; > u32 bitmap_size, flags, expected_extent_count; > @@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct > btrfs_trans_handle *trans, > goto out; > } > > - bitmap_cursor = bitmap; > + bitmap_cursor = (char *)bitmap; > bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS; > i = start; > while (i < end) { > @@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct > btrfs_trans_handle *trans, > struct btrfs_free_space_info *info; > struct btrfs_key key, found_key; > struct extent_buffer *leaf; > - u8 *bitmap; > + unsigned long *bitmap; > u64 start, end; > - /* Initialize to silence GCC. */ > - u64 extent_start = 0; > - u64 offset; > u32 bitmap_size, flags, expected_extent_count; > - int prev_bit = 0, bit, bitnr; > + unsigned long nrbits, start_bit, end_bit; > u32 extent_count = 0; > int done = 0, nr; > int ret; > @@ -348,7 +346,7 @@ int convert_free_space_to_extents(struct > btrfs_trans_handle *trans, >
Re: nvme+btrfs+compression sensibility and benchmark
On 18.04.2018 18:10, Brendan Hide wrote: > Hi, all > > I'm looking for some advice re compression with NVME. Compression helps > performance with a minor CPU hit - but is it still worth it with the far > higher throughputs offered by newer PCI and NVME-type SSDs? > > I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my > home desktop. I previously used compression on an older SATA-based Intel > 520 SSD, where compression made sense. > > However, the wisdom isn't so clear-cut if the SSD is potentially faster > than the compression algorithm with my CPU (aging i7 3770). > > Testing using a copy of the kernel source tarball in tmpfs it seems my > system can compress/decompress at about 670MB/s using zstd with 8 > threads. lzop isn't that far behind. But I'm not sure if the benchmark > I'm running is the same as how btrfs would be using it internally. > > Given these numbers I'm inclined to believe compression will make things > slower - but can't be sure without knowing if I'm testing correctly. > > What is the best practice with benchmarking and with NVME/PCI storage? btrfs doesn't support DAX so using it on NVME doesn't make much sense performance wise. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
nvme+btrfs+compression sensibility and benchmark
Hi, all I'm looking for some advice re compression with NVME. Compression helps performance with a minor CPU hit - but is it still worth it with the far higher throughputs offered by newer PCI and NVME-type SSDs? I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my home desktop. I previously used compression on an older SATA-based Intel 520 SSD, where compression made sense. However, the wisdom isn't so clear-cut if the SSD is potentially faster than the compression algorithm with my CPU (aging i7 3770). Testing using a copy of the kernel source tarball in tmpfs it seems my system can compress/decompress at about 670MB/s using zstd with 8 threads. lzop isn't that far behind. But I'm not sure if the benchmark I'm running is the same as how btrfs would be using it internally. Given these numbers I'm inclined to believe compression will make things slower - but can't be sure without knowing if I'm testing correctly. What is the best practice with benchmarking and with NVME/PCI storage? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add comment about BTRFS_FS_EXCL_OP
On Wed, Apr 18, 2018 at 02:59:25PM +0800, Anand Jain wrote: > Adds comments about BTRFS_FS_EXCL_OP to existing comments > about the device locks. > > Signed-off-by: Anand Jain Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]
On Tue, Apr 17, 2018 at 12:18:45PM +0200, René Rebe wrote: > On 16 Apr 2018, at 21:15, David Sterba wrote: > > On Mon, Apr 16, 2018 at 09:55:45PM +0200, René Rebe wrote: > >> On 04/16/2018 06:48 PM, David Sterba wrote: > >>> The warnings are valid, there's unaligned access introduced by patch > >>> > >>> 23b5ec74943f44378b68c0edd8e210a86318ea5e > >>> btrfs: fix readdir deadlock with pagefault > >>> > >>> The directory entries (struct dir_entry) are copied to a temporary > >>> buffer as they fit, ie. no alignment, and the members accessed in > >>> several places. > >>> > >>> The following patch adds the proper unaligned access, only compile-tested. > >>> Please test and let me know, thanks! > >> Would have loved to immediately give it a try, however, sorry, > >> I forgot to mention I'm on the latest stable release -4.16.2- > >> on a first glance this does not look like it does just apply. > >> > >> I would re-base myself if I would not also have a glibc initialization > >> bug to hunt and debug, too :-/ > >> > >> If you happen to also rebase it for current -stable, ... ;-) > > > > Sure, attached a 4.16.2 version. > > <0001-test-readdir-unaligned-access.patch> > > Cool, thanks, built and so far it works, the warnings are gone. > > https://www.youtube.com/watch?v=XYsKct4T2xk Thanks for testing on a real hardware! I'm not aware of regular runtime testing on sparc, the other architectures with strict alignment requirement have probably the emulation turned on by default so the warnings do not appear. Or nobody bothers to report them though the fixes are typically simple. I like your vintage hw videos, keep going! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix unaligned access in readdir
The last update to readdir introduced a temporary buffer to store the emitted readdir data, but as there are file names of variable length, there's a lot of unaligned access. This was observed on a sparc64 machine: Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs] Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault") CC: sta...@vger.kernel.org # 4.14+ Reported-and-tested-by: René Rebe Signed-off-by: David Sterba --- fs/btrfs/inode.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e064c49c9a9a..d241285a0d2a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -5905,11 +5906,13 @@ static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx) struct dir_entry *entry = addr; char *name = (char *)(entry + 1); - ctx->pos = entry->offset; - if (!dir_emit(ctx, name, entry->name_len, entry->ino, - entry->type)) + ctx->pos = get_unaligned(&entry->offset); + if (!dir_emit(ctx, name, get_unaligned(&entry->name_len), +get_unaligned(&entry->ino), +get_unaligned(&entry->type))) return 1; - addr += sizeof(struct dir_entry) + entry->name_len; + addr += sizeof(struct dir_entry) + + get_unaligned(&entry->name_len); ctx->pos++; } return 0; @@ -5999,14 +6002,15 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) } entry = addr; - entry->name_len = name_len; + put_unaligned(name_len, &entry->name_len); name_ptr = (char *)(entry + 1); read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1), name_len); - entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)]; + put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)], + &entry->type); btrfs_dir_item_key_to_cpu(leaf, di, &location); - entry->ino = location.objectid; - entry->offset = found_key.offset; + put_unaligned(location.objectid, &entry->ino); + put_unaligned(found_key.offset, &entry->offset); entries++; addr += sizeof(struct dir_entry) + name_len; total_len += sizeof(struct dir_entry) + name_len; -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: update uuid_mutex and device_list_mutex comments
Make the uuid_mutex and device_list_mutex comments inline with the changes. Signed-off-by: Anand Jain --- Hi David, Can you kindly add this patch to the set: 'Review uuid_mutex usage' Thanks, Anand fs/btrfs/volumes.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6d7f10729713..12617a8a7083 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -155,29 +155,26 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, * * uuid_mutex (global lock) * - * protects the fs_uuids list that tracks all per-fs fs_devices, resulting from + * Protects the fs_uuids list that tracks all per-fs fs_devices, resulting from * the SCAN_DEV ioctl registration or from mount either implicitly (the first - * device) or requested by the device= mount option - * - * the mutex can be very coarse and can cover long-running operations - * - * protects: updates to fs_devices counters like missing devices, rw devices, - * seeding, structure cloning, openning/closing devices at mount/umount time + * device) or requested by the device= mount option. * * global::fs_devs - add, remove, updates to the global list * - * does not protect: manipulation of the fs_devices::devices list! + * Does not protect: manipulation of the fs_devices::devices list! * * btrfs_device::name - renames (write side), read is RCU * * fs_devices::device_list_mutex (per-fs, with RCU) * - * protects updates to fs_devices::devices, ie. adding and deleting + * Protects updates to fs_devices::devices, ie. adding and deleting, and its + * counters like missing devices, rw devices, seeding, structure cloning, + * openning/closing devices at mount/umount time. * - * simple list traversal with read-only actions can be done with RCU protection + * Simple list traversal with read-only actions can be done with RCU protection. * - * may be used to exclude some operations from running concurrently without any - * modifications to the list (see write_all_supers) + * May be used to exclude some operations from running concurrently without any + * modifications to the list (see write_all_supers). * * balance_mutex * - -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add chattr support for send/receive
On Tue, Apr 17, 2018 at 04:39:35PM -0700, Howard McLauchlan wrote: > Presently btrfs send/receive does not propagate inode attribute flags; > all chattr operations are effectively discarded upon transmission. > > This patch adds kernel support for inode attribute flags. Userspace > support can be found under the commit: > > btrfs-progs: add chattr support for send/receive > > An associated xfstest can be found at: > > btrfs: add verify chattr support for send/receive test > > A caveat is that a user with an updated kernel (aware of chattrs) and an > older version of btrfs-progs (unaware of chattrs) will fail to receive > if a chattr is included in the send stream. > > Signed-off-by: Howard McLauchlan This is a send protocol change and must be properly versioned. There are more known defficiencies from v1, see https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive#Send_stream_v2_draft > --- a/fs/btrfs/send.h > +++ b/fs/btrfs/send.h > @@ -77,6 +77,7 @@ enum btrfs_send_cmd { > > BTRFS_SEND_C_END, > BTRFS_SEND_C_UPDATE_EXTENT, > + BTRFS_SEND_C_CHATTR, > __BTRFS_SEND_C_MAX, This change > }; > #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1) > @@ -114,6 +115,7 @@ enum { > BTRFS_SEND_A_CLONE_PATH, > BTRFS_SEND_A_CLONE_OFFSET, > BTRFS_SEND_A_CLONE_LEN, > + BTRFS_SEND_A_CHATTR, > > __BTRFS_SEND_A_MAX, and this will change numbers of __BTRFS_SEND_*_MAX and defines derived from them, that must stay fixed for v1, because they're part of the public API exported from progs as send.h. Unfortunatelly for anybody who wants to implement new additions to the send stream, the full versioning, backward compatibility, commandline options, ioctl extensions need to happen first. A concrete example how this can be done wrong is the Synology version of btrfs shipped with their NAS. There are some extensions to the protocol that work on their kernel, but the send stream cannot be used on upstram kernels. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] btrfs-progs: dump-tree: add degraded option
From: Anand Jain btrfs inspect dump-tree picks the disk with the largest generation to read the root tree by scanning for the required devices by default. But in 2 or more disks RAID1/5/6 you may need to know what's in the disks individually, so this option --noscan indicates to use only the given disk to dump. For example: mkfs.btrfs -fq -draid1 -mraid1 /dev/sd[b-d] btrfs in dump-tree --noscan /dev/sdb /dev/sdc However as usual without noscan option it works as it scans for the devices by its own. And if minimum required number of devices are not provided then when --noscan option is used, it errors out as shown below. btrfs in dump-tree --noscan /dev/sdb check arg type : /dev/sdbbtrfs-progs v4.14.1 warning, device 3 is missing warning, device 2 is missing bytenr mismatch, want=22036480, have=0 ERROR: cannot read chunk root ERROR: unable to open /dev/sdb Signed-off-by: Anand Jain --- v2->v3: make it scalable for more than two disks in noscan mode v1->v2: rename --degraded to --noscan cmds-inspect-dump-tree.c | 60 ++-- 1 file changed, 53 insertions(+), 7 deletions(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 0802b31e9596..8625cd608ad0 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "kerncompat.h" #include "radix-tree.h" @@ -183,6 +184,46 @@ static u64 treeid_from_string(const char *str, const char **end) return id; } +static int scan_args(char **argv, int argc, int tmpind, int noscan) +{ + int fd; + int ret; + u64 num_devices; + struct btrfs_fs_devices *tmp_devices; + + while (tmpind < argc) { + ret = check_arg_type(argv[tmpind]); + if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) { + error("not a block device or regular file: %s", + argv[tmpind]); + return -EINVAL; + } + + if (!noscan) { + tmpind++; + continue; + } + + fd = open(argv[tmpind], O_RDONLY); + if (fd < 0) { + error("cannot open %s: %m", argv[tmpind]); + return -EINVAL; + } + ret = btrfs_scan_one_device(fd, argv[tmpind], + &tmp_devices, &num_devices, + BTRFS_SUPER_INFO_OFFSET, + SBREAD_DEFAULT); + close(fd); + if (ret) { + error("cannot scan %s: %s", + argv[tmpind], strerror(-ret)); + return ret; + } + tmpind++; + } + return 0; +} + const char * const cmd_inspect_dump_tree_usage[] = { "btrfs inspect-internal dump-tree [options] device", "Dump tree structures from a given device", @@ -199,6 +240,7 @@ const char * const cmd_inspect_dump_tree_usage[] = { "-b|--block print info from the specified block only", "-t|--tree print only tree with the given id (string or number)", "--follow use with -b, to show all children tree blocks of ", + "--noscan only uses the devices provided as the argument", NULL }; @@ -213,7 +255,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) struct btrfs_disk_key disk_key; struct btrfs_key found_key; char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; - int ret; + int ret = 0; int slot; int extent_only = 0; int device_only = 0; @@ -225,10 +267,12 @@ int cmd_inspect_dump_tree(int argc, char **argv) struct btrfs_root *tree_root_scan; u64 tree_id = 0; bool follow = false; + bool noscan = false; while (1) { int c; - enum { GETOPT_VAL_FOLLOW = 256 }; + enum { GETOPT_VAL_FOLLOW = 256, + GETOPT_VAL_NOSCAN = 257}; static const struct option long_options[] = { { "extents", no_argument, NULL, 'e'}, { "device", no_argument, NULL, 'd'}, @@ -238,6 +282,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) { "block", required_argument, NULL, 'b'}, { "tree", required_argument, NULL, 't'}, { "follow", no_argument, NULL, GETOPT_VAL_FOLLOW }, + { "noscan", no_argument, NULL, GETOPT_VAL_NOSCAN }, { NULL, 0, NULL, 0 } }; @@ -293,19 +338,20 @@ int cmd_inspect_dump_tree(int argc, char **argv) case GETOPT_VAL_FOLLOW: follow = true; break; + case GETOP
Re: Hard link not persisted on fsync
On Wed, Apr 18, 2018 at 1:02 AM, Jayashree Mohan wrote: > Hi, > > A gentle reminder on the crash consistency bug that we found on btrfs: Why do you call it a consistency bug? The filesystem does not stay in inconsistent state. The link count stays 1 and the dentry used for fsync (foo) is persisted. An inconsistency would be if we ended up with a link count of 2 and only one of the dentries was persisted, or if we ended up with a link count of 1 and both dentries were persisted. Those cases would be detected by an fsck and would could fs operations to fail unexpectedly (attemping to remove a dentry, etc). > Link count of a file is not persisted even after a fsync. We believe a > filesystem that ensures strictly ordered metadata behaviour should be > able to persist the hard link after a fsync on the original file. The thing is there's no written specification about what's the expected and correct behavior. Yes, I'm aware of Dave's explanation on strictly ordered metadata on the other thread and transaction details, but things on btrfs work very differently from xfs (I'm not saying he's wrong). For me it makes more sense to persist the hard link, but again, there's a lack of a specification that demands such behaviour, and I'm not aware of applications needing that behaviour nor user reports about it. > Could you comment on why btrfs does not exhibit this behavior, and if > it's something you'd want to fix? I don't know the "why", as I am not the original author of the log tree (what is used to implement fsync on btrfs), so either it's accidental behavior (the most likely) or intentional. Since it's not something that causes any corruption, fs inconsistency, crash, etc, nor there are user reports complaining about this (afaik), for me it would be far from a priority at the moment as I'm trying to fix corruptions and similar issues (not necessarily caused by fsync). Plus, adding such behaviour would have to be done carefully to not impact performance, as checking if the node has multiple hard links and which ones need to be persisted (created in the current, uncommitted, transaction) may have a measurable impact. The current behaviour it to only guarantee persisting the dentry used for the fsync call. > > Thanks, > Jayashree Mohan > > > > On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan > wrote: >> Hi, >> >> The following seems to be a crash consistency bug on btrfs, where in >> the link count is not persisted even after a fsync on the original >> file. >> >> Consider the following workload : >> creat foo >> link (foo, A/bar) >> fsync(foo) >> ---Crash--- >> >> Now, on recovery we expect the metadata of foo to be persisted i.e >> have a link count of 2. However in btrfs, the link count is 1 and file >> A/bar is not persisted. The expected behaviour would be to persist the >> dependencies of inode foo. That is to say, shouldn't fsync of foo >> persist A/bar and correctly update the link count? >> Note that ext4, xfs and f2fs recover to the correct link count of 2 >> for the above workload. >> >> Let us know what you think about this behavior. >> >> Thanks, >> Jayashree Mohan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add chattr support for send/receive
On Wed, Apr 18, 2018 at 12:39 AM, Howard McLauchlan wrote: > Presently btrfs send/receive does not propagate inode attribute flags; > all chattr operations are effectively discarded upon transmission. > > This patch adds kernel support for inode attribute flags. Userspace > support can be found under the commit: > > btrfs-progs: add chattr support for send/receive > > An associated xfstest can be found at: > > btrfs: add verify chattr support for send/receive test > > A caveat is that a user with an updated kernel (aware of chattrs) and an > older version of btrfs-progs (unaware of chattrs) will fail to receive > if a chattr is included in the send stream. So we do have several things missing in send besides attribute flags, like hole punching for example (there's a list on the wiki). We can't just add a new command and introduce such caveat every time we implement one of the missing and desired features. In 2014, while wanting to implement some of those features, I introduced a way to bump the send stream version with room (commands) for all those missing features, so that all could be implemented later without adding further backward incompatibility between kernel versions btrfs-progs versions. Some of the threads for reference: https://patchwork.kernel.org/patch/4021491/ https://www.spinics.net/lists/linux-btrfs/msg35169.html It never took off, and honestly I don't remember why as no one add more comments on the latest versions of the kernel and btrfs-progs patchsets. > > Signed-off-by: Howard McLauchlan > --- > Based on 4.17-rc1 > > fs/btrfs/ctree.h | 2 + > fs/btrfs/ioctl.c | 2 +- > fs/btrfs/send.c | 176 +++ > fs/btrfs/send.h | 2 + > 4 files changed, 154 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 5474ef14d6e6..a0dc6a8a37eb 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1436,6 +1436,8 @@ struct btrfs_map_token { > unsigned long offset; > }; > > +unsigned int btrfs_flags_to_ioctl(unsigned int flags); > + > #define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \ > ((bytes) >> (fs_info)->sb->s_blocksize_bits) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 632e26d6f7ce..36ce1e589f9e 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -106,7 +106,7 @@ static unsigned int btrfs_mask_flags(umode_t mode, > unsigned int flags) > /* > * Export inode flags to the format expected by the FS_IOC_GETFLAGS ioctl. > */ > -static unsigned int btrfs_flags_to_ioctl(unsigned int flags) > +unsigned int btrfs_flags_to_ioctl(unsigned int flags) > { > unsigned int iflags = 0; > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 221e5cdb060b..da521a5a1843 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -101,6 +101,13 @@ struct send_ctx { > u64 cur_inode_last_extent; > u64 cur_inode_next_write_offset; > > + /* > +* state for chattr purposes > +*/ > + u64 cur_inode_flip_flags; > + u64 cur_inode_receive_flags; > + int receive_flags_valid; > + > u64 send_progress; > > struct list_head new_refs; > @@ -798,7 +805,7 @@ static int send_rmdir(struct send_ctx *sctx, struct > fs_path *path) > */ > static int __get_inode_info(struct btrfs_root *root, struct btrfs_path *path, > u64 ino, u64 *size, u64 *gen, u64 *mode, u64 *uid, > - u64 *gid, u64 *rdev) > + u64 *gid, u64 *rdev, u64 *flags) > { > int ret; > struct btrfs_inode_item *ii; > @@ -828,6 +835,8 @@ static int __get_inode_info(struct btrfs_root *root, > struct btrfs_path *path, > *gid = btrfs_inode_gid(path->nodes[0], ii); > if (rdev) > *rdev = btrfs_inode_rdev(path->nodes[0], ii); > + if (flags) > + *flags = btrfs_inode_flags(path->nodes[0], ii); > > return ret; > } > @@ -835,7 +844,7 @@ static int __get_inode_info(struct btrfs_root *root, > struct btrfs_path *path, > static int get_inode_info(struct btrfs_root *root, > u64 ino, u64 *size, u64 *gen, > u64 *mode, u64 *uid, u64 *gid, > - u64 *rdev) > + u64 *rdev, u64 *flags) > { > struct btrfs_path *path; > int ret; > @@ -844,7 +853,7 @@ static int get_inode_info(struct btrfs_root *root, > if (!path) > return -ENOMEM; > ret = __get_inode_info(root, path, ino, size, gen, mode, uid, gid, > - rdev); > + rdev, flags); > btrfs_free_path(path); > return ret; > } > @@ -1233,7 +1242,7 @@ static int __iterate_backrefs(u64 ino, u64 offset, u64 > root, void *ctx_) > * accept clones from these extents. > */ > ret = __get_inode_info(found->root,
Re: [PATCH v2] btrfs: Rewrite retry logic in do_chunk_alloc
On 18.04.2018 10:27, Nikolay Borisov wrote: > do_chunk_alloc implements logic to detect whether there is currently > pending chunk allocation (by means of space_info->chunk_alloc being > set) and if so it loops around to the 'again' label. Additionally, > based on the state of the space_info (e.g. whether it's full or not) > and the return value of should_alloc_chunk() it decides whether this > is a "hard" error (ENOSPC) or we can just return 0. > > This patch refactors all of this: > > 1. Put order to the scattered ifs handling the various cases in an > easy-to-read if {} else if{} branches. This makes clear the various > cases we are interested in handling. > > 2. Call should_alloc_chunk only once and use the result in the > if/else if constructs. All of this is done under space_info->lock, so > even before multiple calls of should_alloc_chunk were unnecessary. > > 3. Rewrite the "do {} while()" loop currently implemented via label > into an explicit loop construct. > > 4. Move the mutex locking for the case where the caller is the one doing the > allocation. For the case where the caller needs to wait a concurrent > allocation, > introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword > the > comment. > > 5. Switch local vars to bool type where pertinent. > > All in all this shouldn't introduce any functional changes. > > Signed-off-by: Nikolay Borisov > --- > > v2: > * Introduce missing logic in the case where we have to loop. The correct > behavior when a concurrent allocation is executing is to acquire/release the > mutex and loop to check if it makes sense to continue with our allocation > try. > > fs/btrfs/extent-tree.c | 73 > +- > 1 file changed, 36 insertions(+), 37 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 7bfc03494c39..bfb19bcdeee3 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4601,7 +4601,8 @@ static int do_chunk_alloc(struct btrfs_trans_handle > *trans, > struct btrfs_fs_info *fs_info, u64 flags, int force) > { > struct btrfs_space_info *space_info; > - int wait_for_alloc = 0; > + bool wait_for_alloc = false; > + bool should_alloc = false; > int ret = 0; > > /* Don't re-enter if we're already allocating a chunk */ > @@ -4611,45 +4612,43 @@ static int do_chunk_alloc(struct btrfs_trans_handle > *trans, > space_info = __find_space_info(fs_info, flags); > ASSERT(space_info); > > -again: > - spin_lock(&space_info->lock); > - if (force < space_info->force_alloc) > - force = space_info->force_alloc; > - if (space_info->full) { > - if (should_alloc_chunk(fs_info, space_info, force)) > - ret = -ENOSPC; > - else > - ret = 0; > - spin_unlock(&space_info->lock); > - return ret; > - } > - > - if (!should_alloc_chunk(fs_info, space_info, force)) { > - spin_unlock(&space_info->lock); > - return 0; > - } else if (space_info->chunk_alloc) { > - wait_for_alloc = 1; > - } else { > - space_info->chunk_alloc = 1; > - } > - > - spin_unlock(&space_info->lock); > - > - mutex_lock(&fs_info->chunk_mutex); > + do { > + spin_lock(&space_info->lock); > + if (force < space_info->force_alloc) > + force = space_info->force_alloc; > + should_alloc = should_alloc_chunk(fs_info, space_info, force); > + if (space_info->full) { > + /* No more free physical space */ > + if (should_alloc) > + ret = -ENOSPC; > + else > + ret = 0; > + spin_unlock(&space_info->lock); > + return ret; > + } else if (!should_alloc) { > + spin_unlock(&space_info->lock); > + return 0; > + } else if (space_info->chunk_alloc) { > + /* Someone is already allocating, so we need to block > + * while this someone is finished and then loop, to > + * recheck if we should continue with our allocation > + * try > + */ > + wait_for_alloc = true; > + spin_unlock(&space_info->lock); > + mutex_lock(&fs_info->chunk_mutex); > + mutex_unlock(&fs_info->chunk_mutex); > + } else { > + /* Proceed with allocation */ > + space_info->chunk_alloc = 1; > + wait_for_alloc = false; > + spin_unlock(&space_info->lock); > + } > > - /* > - * The chunk_mutex is held throughout the en
[PATCH v2] btrfs: Rewrite retry logic in do_chunk_alloc
do_chunk_alloc implements logic to detect whether there is currently pending chunk allocation (by means of space_info->chunk_alloc being set) and if so it loops around to the 'again' label. Additionally, based on the state of the space_info (e.g. whether it's full or not) and the return value of should_alloc_chunk() it decides whether this is a "hard" error (ENOSPC) or we can just return 0. This patch refactors all of this: 1. Put order to the scattered ifs handling the various cases in an easy-to-read if {} else if{} branches. This makes clear the various cases we are interested in handling. 2. Call should_alloc_chunk only once and use the result in the if/else if constructs. All of this is done under space_info->lock, so even before multiple calls of should_alloc_chunk were unnecessary. 3. Rewrite the "do {} while()" loop currently implemented via label into an explicit loop construct. 4. Move the mutex locking for the case where the caller is the one doing the allocation. For the case where the caller needs to wait a concurrent allocation, introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword the comment. 5. Switch local vars to bool type where pertinent. All in all this shouldn't introduce any functional changes. Signed-off-by: Nikolay Borisov --- v2: * Introduce missing logic in the case where we have to loop. The correct behavior when a concurrent allocation is executing is to acquire/release the mutex and loop to check if it makes sense to continue with our allocation try. fs/btrfs/extent-tree.c | 73 +- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7bfc03494c39..bfb19bcdeee3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4601,7 +4601,8 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 flags, int force) { struct btrfs_space_info *space_info; - int wait_for_alloc = 0; + bool wait_for_alloc = false; + bool should_alloc = false; int ret = 0; /* Don't re-enter if we're already allocating a chunk */ @@ -4611,45 +4612,43 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); -again: - spin_lock(&space_info->lock); - if (force < space_info->force_alloc) - force = space_info->force_alloc; - if (space_info->full) { - if (should_alloc_chunk(fs_info, space_info, force)) - ret = -ENOSPC; - else - ret = 0; - spin_unlock(&space_info->lock); - return ret; - } - - if (!should_alloc_chunk(fs_info, space_info, force)) { - spin_unlock(&space_info->lock); - return 0; - } else if (space_info->chunk_alloc) { - wait_for_alloc = 1; - } else { - space_info->chunk_alloc = 1; - } - - spin_unlock(&space_info->lock); - - mutex_lock(&fs_info->chunk_mutex); + do { + spin_lock(&space_info->lock); + if (force < space_info->force_alloc) + force = space_info->force_alloc; + should_alloc = should_alloc_chunk(fs_info, space_info, force); + if (space_info->full) { + /* No more free physical space */ + if (should_alloc) + ret = -ENOSPC; + else + ret = 0; + spin_unlock(&space_info->lock); + return ret; + } else if (!should_alloc) { + spin_unlock(&space_info->lock); + return 0; + } else if (space_info->chunk_alloc) { + /* Someone is already allocating, so we need to block +* while this someone is finished and then loop, to +* recheck if we should continue with our allocation +* try +*/ + wait_for_alloc = true; + spin_unlock(&space_info->lock); + mutex_lock(&fs_info->chunk_mutex); + mutex_unlock(&fs_info->chunk_mutex); + } else { + /* Proceed with allocation */ + space_info->chunk_alloc = 1; + wait_for_alloc = false; + spin_unlock(&space_info->lock); + } - /* -* The chunk_mutex is held throughout the entirety of a chunk -* allocation, so once we've acquired the chunk_mutex we know that the -* other guy is done and we need to recheck and see if we should
[PATCH] btrfs-progs: restore: fix mistake on overwrite_ok() if relative path is given
From: Ting-Chang Hou fstatat will return -1 with errno EBADF if path_name is relative path. This caused an error of the return value of overwrite_ok(). When restoring the subvolume to destination with relative path, it will overwrite the existing file rather than skip it. Signed-off-by: tchou --- cmds-restore.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-restore.c b/cmds-restore.c index ade35f0..dc042e2 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -825,7 +825,7 @@ static int overwrite_ok(const char * path) int ret; /* don't be fooled by symlinks */ - ret = fstatat(-1, path_name, &st, AT_SYMLINK_NOFOLLOW); + ret = fstatat(AT_FDCWD, path_name, &st, AT_SYMLINK_NOFOLLOW); if (!ret) { if (overwrite) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html