Re: [PATCH v3] btrfs: Fix race condition between delayed refs and blockgroup removal

2018-04-18 Thread Nikolay Borisov


On 18.04.2018 23:47, David Sterba wrote:
> On Wed, Apr 18, 2018 at 09:41:54AM +0300, Nikolay Borisov wrote:
>> When the delayed refs for a head are all run, eventually
>> cleanup_ref_head is called which (in case of deletion) obtains a
>> reference for the relevant btrfs_space_info struct by querying the bg
>> for the range. This is problematic because when the last extent of a
>> bg is deleted a race window emerges between removal of that bg and the
>> subsequent invocation of cleanup_ref_head. This can result in cache being 
>> null
>> and either a null pointer dereference or assertion failure.
>>
>>  task: 8d04d31ed080 task.stack: 9e5dc10cc000
>>  RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
>>  RSP: 0018:9e5dc10cfbe8 EFLAGS: 00010292
>>  RAX: 0044 RBX:  RCX: 
>>  RDX: 8d04ffc1f868 RSI: 8d04ffc178c8 RDI: 8d04ffc178c8
>>  RBP: 8d04d29e5ea0 R08: 01f0 R09: 0001
>>  R10: 9e5dc0507d58 R11: 0001 R12: 8d04d29e5ea0
>>  R13: 8d04d29e5f08 R14: 8d04efe29b40 R15: 8d04efe203e0
>>  FS:  7fbf58ead500() GS:8d04ffc0() 
>> knlGS:
>>  CS:  0010 DS:  ES:  CR0: 80050033
>>  CR2: 7fe6c6975648 CR3: 13b2a000 CR4: 06f0
>>  DR0:  DR1:  DR2: 
>>  DR3:  DR6: fffe0ff0 DR7: 0400
>>  Call Trace:
>>   __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
>>   btrfs_run_delayed_refs+0x68/0x250 [btrfs]
>>   btrfs_should_end_transaction+0x42/0x60 [btrfs]
>>   btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
>>   btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
>>   evict+0xc6/0x190
>>   do_unlinkat+0x19c/0x300
>>   do_syscall_64+0x74/0x140
>>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>  RIP: 0033:0x7fbf589c57a7
>>
>> To fix this, introduce a new flag "is_system" to head_ref structs,
>> which is populated at insertion time. This allows to decouple the
>> querying for the spaceinfo from querying the possibly deleted bg.
>>
>> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned 
>> accounting")
>> Suggested-by: Omar Sandoval 
>> Signed-off-by: Nikolay Borisov 
> 
> Added to next, thanks.
> 
>> @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>>   struct btrfs_delayed_ref_head *head_ref,
>>   struct btrfs_qgroup_extent_record *qrecord,
>>   u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
>> - int action, int is_data, int *qrecord_inserted_ret,
>> + int action, int is_data, int is_system,
>> + int *qrecord_inserted_ret,
>>   int *old_ref_mod, int *new_ref_mod)
> 
> At some point we might want to peel of a few arguments of that function.
> It now has 14.  The fs_info/trans should work and the
> action/is_data/is_system could be merged to a bitmask.

Yes, I sent a series which splits the delayed ref initialisation +
addition. It will conflicts with this patch but I intend to rebase. I'm
talking about [PATCH 0/8] Delayed refs cleanups/streamlining

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] btrfs: remove le_test_bit()

2018-04-18 Thread Howard McLauchlan
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap
conversion"), there are no more callers to le_test_bit(). This patch
removes le_test_bit().

Signed-off-by: Howard McLauchlan 
---
 fs/btrfs/extent_io.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index d34416c831bf..c5e80d60d71b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -79,11 +79,6 @@
 #define BITMAP_LAST_BYTE_MASK(nbits) \
(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
 
-static inline int le_test_bit(int nr, const u8 *addr)
-{
-   return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1)));
-}
-
 struct extent_state;
 struct btrfs_root;
 struct btrfs_inode;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs: clean up le_bitmap_{set, clear}()

2018-04-18 Thread Howard McLauchlan
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.

Signed-off-by: Howard McLauchlan 
---
V1->V2: Also move le_bitmap_set()
based on 4.17-rc1
 fs/btrfs/extent_io.c   | 40 --
 fs/btrfs/extent_io.h   |  3 ---
 fs/btrfs/free-space-tree.c | 20 +++
 3 files changed, 20 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329002cf..9a521e5e297d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5620,46 +5620,6 @@ void copy_extent_buffer(struct extent_buffer *dst, 
struct extent_buffer *src,
}
 }
 
-void le_bitmap_set(u8 *map, unsigned int start, int len)
-{
-   u8 *p = map + BIT_BYTE(start);
-   const unsigned int size = start + len;
-   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
-   u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
-
-   while (len - bits_to_set >= 0) {
-   *p |= mask_to_set;
-   len -= bits_to_set;
-   bits_to_set = BITS_PER_BYTE;
-   mask_to_set = ~0;
-   p++;
-   }
-   if (len) {
-   mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
-   *p |= mask_to_set;
-   }
-}
-
-void le_bitmap_clear(u8 *map, unsigned int start, int len)
-{
-   u8 *p = map + BIT_BYTE(start);
-   const unsigned int size = start + len;
-   int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
-   u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
-
-   while (len - bits_to_clear >= 0) {
-   *p &= ~mask_to_clear;
-   len -= bits_to_clear;
-   bits_to_clear = BITS_PER_BYTE;
-   mask_to_clear = ~0;
-   p++;
-   }
-   if (len) {
-   mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
-   *p &= ~mask_to_clear;
-   }
-}
-
 /*
  * eb_bitmap_offset() - calculate the page and offset of the byte containing 
the
  * given bit number
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a53009694b16..d34416c831bf 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -84,9 +84,6 @@ static inline int le_test_bit(int nr, const u8 *addr)
return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1)));
 }
 
-void le_bitmap_set(u8 *map, unsigned int start, int len);
-void le_bitmap_clear(u8 *map, unsigned int start, int len);
-
 struct extent_state;
 struct btrfs_root;
 struct btrfs_inode;
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 32a0f6cb5594..e03830d83311 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -157,6 +157,26 @@ static u8 *alloc_bitmap(u32 bitmap_size)
return ret;
 }
 
+static void le_bitmap_set(u8 *map, unsigned int start, int len)
+{
+   u8 *p = map + BIT_BYTE(start);
+   const unsigned int size = start + len;
+   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
+   u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
+
+   while (len - bits_to_set >= 0) {
+   *p |= mask_to_set;
+   len -= bits_to_set;
+   bits_to_set = BITS_PER_BYTE;
+   mask_to_set = ~0;
+   p++;
+   }
+   if (len) {
+   mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
+   *p |= mask_to_set;
+   }
+}
+
 int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
  struct btrfs_fs_info *fs_info,
  struct btrfs_block_group_cache *block_group,
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread Howard McLauchlan
Presently, convert_free_space_to_extents() does a linear scan of the
bitmap. We can speed this up with find_next_{bit,zero_bit}_le().

This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
Testing shows a 20-33% decrease in execution time for
convert_free_space_to_extents().

Suggested-by: Omar Sandoval 
Signed-off-by: Howard McLauchlan 
---
V1->V2: Round the bitmap size accordingly when allocating bitmap.

 fs/btrfs/free-space-tree.c | 61 ++
 1 file changed, 23 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index e03830d83311..7019afe6e727 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -138,10 +138,11 @@ static inline u32 free_space_bitmap_size(u64 size, u32 
sectorsize)
return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
 }
 
-static u8 *alloc_bitmap(u32 bitmap_size)
+static unsigned long *alloc_bitmap(u32 bitmap_size)
 {
-   u8 *ret;
+   unsigned long *ret;
unsigned int nofs_flag;
+   u32 bitmap_rounded_size = round_up(bitmap_size, sizeof(unsigned long));
 
/*
 * GFP_NOFS doesn't work with kvmalloc(), but we really can't recurse
@@ -152,14 +153,14 @@ static u8 *alloc_bitmap(u32 bitmap_size)
 * know that recursion is unsafe.
 */
nofs_flag = memalloc_nofs_save();
-   ret = kvzalloc(bitmap_size, GFP_KERNEL);
+   ret = kvzalloc(bitmap_rounded_size, GFP_KERNEL);
memalloc_nofs_restore(nofs_flag);
return ret;
 }
 
-static void le_bitmap_set(u8 *map, unsigned int start, int len)
+static void le_bitmap_set(unsigned long *map, unsigned int start, int len)
 {
-   u8 *p = map + BIT_BYTE(start);
+   u8 *p = ((u8 *)map) + BIT_BYTE(start);
const unsigned int size = start + len;
int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
@@ -186,7 +187,8 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle 
*trans,
struct btrfs_free_space_info *info;
struct btrfs_key key, found_key;
struct extent_buffer *leaf;
-   u8 *bitmap, *bitmap_cursor;
+   unsigned long *bitmap;
+   char *bitmap_cursor;
u64 start, end;
u64 bitmap_range, i;
u32 bitmap_size, flags, expected_extent_count;
@@ -275,7 +277,7 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle 
*trans,
goto out;
}
 
-   bitmap_cursor = bitmap;
+   bitmap_cursor = (char *)bitmap;
bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
i = start;
while (i < end) {
@@ -324,13 +326,10 @@ int convert_free_space_to_extents(struct 
btrfs_trans_handle *trans,
struct btrfs_free_space_info *info;
struct btrfs_key key, found_key;
struct extent_buffer *leaf;
-   u8 *bitmap;
+   unsigned long *bitmap;
u64 start, end;
-   /* Initialize to silence GCC. */
-   u64 extent_start = 0;
-   u64 offset;
u32 bitmap_size, flags, expected_extent_count;
-   int prev_bit = 0, bit, bitnr;
+   unsigned long nrbits, start_bit, end_bit;
u32 extent_count = 0;
int done = 0, nr;
int ret;
@@ -368,7 +367,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle 
*trans,
break;
} else if (found_key.type == 
BTRFS_FREE_SPACE_BITMAP_KEY) {
unsigned long ptr;
-   u8 *bitmap_cursor;
+   char *bitmap_cursor;
u32 bitmap_pos, data_size;
 
ASSERT(found_key.objectid >= start);
@@ -378,7 +377,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle 
*trans,
bitmap_pos = div_u64(found_key.objectid - start,
 fs_info->sectorsize *
 BITS_PER_BYTE);
-   bitmap_cursor = bitmap + bitmap_pos;
+   bitmap_cursor = ((char *)bitmap) + bitmap_pos;
data_size = 
free_space_bitmap_size(found_key.offset,
   
fs_info->sectorsize);
 
@@ -412,32 +411,16 @@ int convert_free_space_to_extents(struct 
btrfs_trans_handle *trans,
btrfs_mark_buffer_dirty(leaf);
btrfs_release_path(path);
 
-   offset = start;
-   bitnr = 0;
-   while (offset < end) {
-   bit = !!le_test_bit(bitnr, bitmap);
-   if (prev_bit == 0 && bit == 1) {
-   extent_start = offset;
-   } else if (prev_bit == 1 && bit == 0) {
-   key.objectid = extent_start;
-   key.type = BTRFS_FREE_SPACE_EXTENT_K

Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()

2018-04-18 Thread Eric Biggers
On Thu, Apr 19, 2018 at 01:15:59AM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 01:06:13AM +0100, Al Viro wrote:
> > On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote:
> > > Hi Chris and other btrfs folks,
> > > 
> > > btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is 
> > > wrong
> > > because it exposes the inode to lookups before it's been fully 
> > > initialized.
> > 
> > Huh?  It *is* fully initialized by that point; what else is left to do?
> 
>   ISTR something about false positives from lockdep (with
> lockdep_annotate_inode_mutex_key() called too late, perhaps?); said that, it
> was a long time ago and I don't remember details at the moment...  Are you
> actually seeing a deadlock there or is that just lockdep complaining?

It's an actual deadlock.  unlock_new_inode() calls
lockdep_annotate_inode_mutex_key() which calls init_rwsem(), which resets
i_rwsem->count while it's read-locked by lookup_slow().  Then the unlock in
lookup_slow() makes i_rwsem->count negative, which makes it appear to be
write-locked.

So no, the inode isn't fully initialized until unlock_new_inode() ran.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()

2018-04-18 Thread Al Viro
On Thu, Apr 19, 2018 at 01:06:13AM +0100, Al Viro wrote:
> On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote:
> > Hi Chris and other btrfs folks,
> > 
> > btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is 
> > wrong
> > because it exposes the inode to lookups before it's been fully initialized.
> 
> Huh?  It *is* fully initialized by that point; what else is left to do?

ISTR something about false positives from lockdep (with
lockdep_annotate_inode_mutex_key() called too late, perhaps?); said that, it
was a long time ago and I don't remember details at the moment...  Are you
actually seeing a deadlock there or is that just lockdep complaining?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: d_instantiate() and unlock_new_inode() order in btrfs_mkdir()

2018-04-18 Thread Al Viro
On Wed, Apr 18, 2018 at 05:00:29PM -0700, Eric Biggers wrote:
> Hi Chris and other btrfs folks,
> 
> btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is wrong
> because it exposes the inode to lookups before it's been fully initialized.

Huh?  It *is* fully initialized by that point; what else is left to do?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


d_instantiate() and unlock_new_inode() order in btrfs_mkdir()

2018-04-18 Thread Eric Biggers
Hi Chris and other btrfs folks,

btrfs_mkdir() calls d_instantiate() before unlock_new_inode(), which is wrong
because it exposes the inode to lookups before it's been fully initialized.
Most filesystems get it right, but f2fs and btrfs don't.  I sent a f2fs patch
(https://marc.info/?l=linux-fsdevel&m=152409178431350) and was going to send a
btrfs patch too, but in btrfs_mkdir() there is actually a comment claiming that
the existing order is intentional:

d_instantiate(dentry, inode);
/*
 * mkdir is special.  We're unlocking after we call d_instantiate
 * to avoid a race with nfsd calling d_instantiate.
 */
unlock_new_inode(inode);

Unfortunately, I cannot find what it is refering to.  The comment was added by
commit b0d5d10f41a0 ("Btrfs: use insert_inode_locked4 for inode creation").
Chris, do you remember exactly what you had in mind when you wrote this?

And in case anyone wants it, here's a reproducer for the deadlock caused by the
current code that calls d_instantiate() before unlock_new_inode().  Note: it
needs CONFIG_DEBUG_LOCK_ALLOC=y.

#include 
#include 

int main()
{
struct stat stbuf;

if (fork() == 0) {
for (;;)
stat("dir/file", &stbuf);
} else {
for (;;) {
mkdir("dir", 0777);
stat("dir/file", &stbuf);
rmdir("dir");
}
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Do super block verification before writing it to disk

2018-04-18 Thread Qu Wenruo


On 2018年04月19日 06:04, David Sterba wrote:
> On Tue, Apr 17, 2018 at 09:47:19AM +0800, Qu Wenruo wrote:
>> @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
>>  
>>  memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
>>  
>> -ret = btrfs_check_super_valid(fs_info);
>> +ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
>>  if (ret) {
>>  btrfs_err(fs_info, "superblock contains fatal errors");
>>  err = -EINVAL;
>> @@ -3310,6 +3311,27 @@ static int write_dev_supers(struct btrfs_device 
>> *device,
> 
> This is in write_dev_supers, so the superblock is checked
> number-of-devices times. The caller write_all_supers rewrites the device
> item so it matches the device it's going to write to. But,
> btrfs_check_super_valid does not validate the dev_item so all the
> validation does not bring much benefit, as it repeatedly checks the same
> data.
> 
> So, what if the validation is done only once in write_all_supers? Lock
> the devices, validate, if it fails, report that and unlock devices and
> go readonly.

Makes sense.

I'll update btrfs_check_super_valid() to cooperate with that in next update.

Thanks,
Qu

> 
> There's a differnce to what you implemented: if the in-memory superblock
> corruption happens between writing to the devices, there are some left
> with the new superblock and some with the old.
> 
> Although this sounds quite improbable, I think that doing the check in
> advance would save some trouble if that happens. The superblocks on all
> devices will match.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread Howard McLauchlan
On 04/18/2018 03:26 PM, David Sterba wrote:
> On Wed, Apr 18, 2018 at 02:30:59PM -0700, Howard McLauchlan wrote:
>> Presently, convert_free_space_to_extents() does a linear scan of the
>> bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
>>
>> This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
>> Testing shows a 20-33% decrease in execution time for
>> convert_free_space_to_extents().
>>
>> Suggested-by: Omar Sandoval 
>> Signed-off-by: Howard McLauchlan 
>> ---
>>
>> Since we change bitmap to be unsigned long, we have to do some casting for 
>> the
>> bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing
>> bit operations. Everywhere else, we're just using it for pointer arithmetic 
>> and
>> not directly accessing it, so char seems more appropriate.
> Ok, makes sense for just passing the pointers around. I'll add the text
> to changelog and apply the patch to next.
>
>> -bit = !!le_test_bit(bitnr, bitmap);
> This is the last use of le_test_bit, so it can be removed (in another
> patch).
I just found an issue in this patch that should be fixed. I'll address
moving le_bitmap_set(), le_test_bit and send a V2 for both these patches.

Howard
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 02:30:59PM -0700, Howard McLauchlan wrote:
> Presently, convert_free_space_to_extents() does a linear scan of the
> bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
> 
> This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
> Testing shows a 20-33% decrease in execution time for
> convert_free_space_to_extents().
> 
> Suggested-by: Omar Sandoval 
> Signed-off-by: Howard McLauchlan 
> ---
> 
> Since we change bitmap to be unsigned long, we have to do some casting for the
> bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing
> bit operations. Everywhere else, we're just using it for pointer arithmetic 
> and
> not directly accessing it, so char seems more appropriate.

Ok, makes sense for just passing the pointers around. I'll add the text
to changelog and apply the patch to next.

> - bit = !!le_test_bit(bitnr, bitmap);

This is the last use of le_test_bit, so it can be removed (in another
patch).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: remove le_bitmap_clear()

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 02:30:58PM -0700, Howard McLauchlan wrote:
> le_bitmap_clear() was implemented as a parallel to le_bitmap_set() but
> is actually not being used. This patch removes it.
> 
> Signed-off-by: Howard McLauchlan 

Reviewed-by: David Sterba 

I've noticed that le_bitmap_set is only used for the free-space-tree so
it can be moved there and made static.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Do super block verification before writing it to disk

2018-04-18 Thread David Sterba
On Tue, Apr 17, 2018 at 09:47:19AM +0800, Qu Wenruo wrote:
> @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
>  
>   memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
>  
> - ret = btrfs_check_super_valid(fs_info);
> + ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
>   if (ret) {
>   btrfs_err(fs_info, "superblock contains fatal errors");
>   err = -EINVAL;
> @@ -3310,6 +3311,27 @@ static int write_dev_supers(struct btrfs_device 
> *device,

This is in write_dev_supers, so the superblock is checked
number-of-devices times. The caller write_all_supers rewrites the device
item so it matches the device it's going to write to. But,
btrfs_check_super_valid does not validate the dev_item so all the
validation does not bring much benefit, as it repeatedly checks the same
data.

So, what if the validation is done only once in write_all_supers? Lock
the devices, validate, if it fails, report that and unlock devices and
go readonly.

There's a differnce to what you implemented: if the in-memory superblock
corruption happens between writing to the devices, there are some left
with the new superblock and some with the old.

Although this sounds quite improbable, I think that doing the check in
advance would save some trouble if that happens. The superblocks on all
devices will match.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread Howard McLauchlan
Presently, convert_free_space_to_extents() does a linear scan of the
bitmap. We can speed this up with find_next_{bit,zero_bit}_le().

This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
Testing shows a 20-33% decrease in execution time for
convert_free_space_to_extents().

Suggested-by: Omar Sandoval 
Signed-off-by: Howard McLauchlan 
---

Since we change bitmap to be unsigned long, we have to do some casting for the
bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing
bit operations. Everywhere else, we're just using it for pointer arithmetic and
not directly accessing it, so char seems more appropriate.

Howard

 fs/btrfs/extent_io.c   |  4 +--
 fs/btrfs/extent_io.h   |  2 +-
 fs/btrfs/free-space-tree.c | 54 ++
 3 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cf50278f1c59..1d984b0acd66 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5620,9 +5620,9 @@ void copy_extent_buffer(struct extent_buffer *dst, struct 
extent_buffer *src,
}
 }
 
-void le_bitmap_set(u8 *map, unsigned int start, int len)
+void le_bitmap_set(unsigned long *map, unsigned int start, int len)
 {
-   u8 *p = map + BIT_BYTE(start);
+   u8 *p = ((u8 *)map) + BIT_BYTE(start);
const unsigned int size = start + len;
int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4b8b072d9594..01261306e5f9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -84,7 +84,7 @@ static inline int le_test_bit(int nr, const u8 *addr)
return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1)));
 }
 
-void le_bitmap_set(u8 *map, unsigned int start, int len);
+void le_bitmap_set(unsigned long *map, unsigned int start, int len);
 
 struct extent_state;
 struct btrfs_root;
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 32a0f6cb5594..0ddf96b01a3a 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 
sectorsize)
return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
 }
 
-static u8 *alloc_bitmap(u32 bitmap_size)
+static unsigned long *alloc_bitmap(u32 bitmap_size)
 {
-   u8 *ret;
+   unsigned long *ret;
unsigned int nofs_flag;
 
/*
@@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle 
*trans,
struct btrfs_free_space_info *info;
struct btrfs_key key, found_key;
struct extent_buffer *leaf;
-   u8 *bitmap, *bitmap_cursor;
+   unsigned long *bitmap;
+   char *bitmap_cursor;
u64 start, end;
u64 bitmap_range, i;
u32 bitmap_size, flags, expected_extent_count;
@@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct btrfs_trans_handle 
*trans,
goto out;
}
 
-   bitmap_cursor = bitmap;
+   bitmap_cursor = (char *)bitmap;
bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
i = start;
while (i < end) {
@@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct 
btrfs_trans_handle *trans,
struct btrfs_free_space_info *info;
struct btrfs_key key, found_key;
struct extent_buffer *leaf;
-   u8 *bitmap;
+   unsigned long *bitmap;
u64 start, end;
-   /* Initialize to silence GCC. */
-   u64 extent_start = 0;
-   u64 offset;
u32 bitmap_size, flags, expected_extent_count;
-   int prev_bit = 0, bit, bitnr;
+   unsigned long nrbits, start_bit, end_bit;
u32 extent_count = 0;
int done = 0, nr;
int ret;
@@ -348,7 +346,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle 
*trans,
break;
} else if (found_key.type == 
BTRFS_FREE_SPACE_BITMAP_KEY) {
unsigned long ptr;
-   u8 *bitmap_cursor;
+   char *bitmap_cursor;
u32 bitmap_pos, data_size;
 
ASSERT(found_key.objectid >= start);
@@ -358,7 +356,7 @@ int convert_free_space_to_extents(struct btrfs_trans_handle 
*trans,
bitmap_pos = div_u64(found_key.objectid - start,
 fs_info->sectorsize *
 BITS_PER_BYTE);
-   bitmap_cursor = bitmap + bitmap_pos;
+   bitmap_cursor = ((char *)bitmap) + bitmap_pos;
data_size = 
free_space_bitmap_size(found_key.offset,
   
fs_info->sectorsize);
 
@@ -392,32 +39

[PATCH 1/2] btrfs: remove le_bitmap_clear()

2018-04-18 Thread Howard McLauchlan
le_bitmap_clear() was implemented as a parallel to le_bitmap_set() but
is actually not being used. This patch removes it.

Signed-off-by: Howard McLauchlan 
---
 fs/btrfs/extent_io.c | 20 
 fs/btrfs/extent_io.h |  1 -
 2 files changed, 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329002cf..cf50278f1c59 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5640,26 +5640,6 @@ void le_bitmap_set(u8 *map, unsigned int start, int len)
}
 }
 
-void le_bitmap_clear(u8 *map, unsigned int start, int len)
-{
-   u8 *p = map + BIT_BYTE(start);
-   const unsigned int size = start + len;
-   int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
-   u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
-
-   while (len - bits_to_clear >= 0) {
-   *p &= ~mask_to_clear;
-   len -= bits_to_clear;
-   bits_to_clear = BITS_PER_BYTE;
-   mask_to_clear = ~0;
-   p++;
-   }
-   if (len) {
-   mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
-   *p &= ~mask_to_clear;
-   }
-}
-
 /*
  * eb_bitmap_offset() - calculate the page and offset of the byte containing 
the
  * given bit number
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a53009694b16..4b8b072d9594 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -85,7 +85,6 @@ static inline int le_test_bit(int nr, const u8 *addr)
 }
 
 void le_bitmap_set(u8 *map, unsigned int start, int len);
-void le_bitmap_clear(u8 *map, unsigned int start, int len);
 
 struct extent_state;
 struct btrfs_root;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Fix race condition between delayed refs and blockgroup removal

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 09:41:54AM +0300, Nikolay Borisov wrote:
> When the delayed refs for a head are all run, eventually
> cleanup_ref_head is called which (in case of deletion) obtains a
> reference for the relevant btrfs_space_info struct by querying the bg
> for the range. This is problematic because when the last extent of a
> bg is deleted a race window emerges between removal of that bg and the
> subsequent invocation of cleanup_ref_head. This can result in cache being null
> and either a null pointer dereference or assertion failure.
> 
>   task: 8d04d31ed080 task.stack: 9e5dc10cc000
>   RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
>   RSP: 0018:9e5dc10cfbe8 EFLAGS: 00010292
>   RAX: 0044 RBX:  RCX: 
>   RDX: 8d04ffc1f868 RSI: 8d04ffc178c8 RDI: 8d04ffc178c8
>   RBP: 8d04d29e5ea0 R08: 01f0 R09: 0001
>   R10: 9e5dc0507d58 R11: 0001 R12: 8d04d29e5ea0
>   R13: 8d04d29e5f08 R14: 8d04efe29b40 R15: 8d04efe203e0
>   FS:  7fbf58ead500() GS:8d04ffc0() 
> knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 7fe6c6975648 CR3: 13b2a000 CR4: 06f0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: fffe0ff0 DR7: 0400
>   Call Trace:
>__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
>btrfs_run_delayed_refs+0x68/0x250 [btrfs]
>btrfs_should_end_transaction+0x42/0x60 [btrfs]
>btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
>btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
>evict+0xc6/0x190
>do_unlinkat+0x19c/0x300
>do_syscall_64+0x74/0x140
>entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>   RIP: 0033:0x7fbf589c57a7
> 
> To fix this, introduce a new flag "is_system" to head_ref structs,
> which is populated at insertion time. This allows to decouple the
> querying for the spaceinfo from querying the possibly deleted bg.
> 
> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned 
> accounting")
> Suggested-by: Omar Sandoval 
> Signed-off-by: Nikolay Borisov 

Added to next, thanks.

> @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>struct btrfs_delayed_ref_head *head_ref,
>struct btrfs_qgroup_extent_record *qrecord,
>u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
> -  int action, int is_data, int *qrecord_inserted_ret,
> +  int action, int is_data, int is_system,
> +  int *qrecord_inserted_ret,
>int *old_ref_mod, int *new_ref_mod)

At some point we might want to peel of a few arguments of that function.
It now has 14.  The fs_info/trans should work and the
action/is_data/is_system could be merged to a bitmask.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Brendan Hide

Thank you, all

Though the info is useful, there's not a clear consensus on what I 
should expect. For interest's sake, I'll post benchmarks from the device 
itself when it arrives.


I'm expecting at least that I'll be blown away :)

On 04/18/2018 09:23 PM, Chris Murphy wrote:



On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn 
mailto:ahferro...@gmail.com>> wrote:


For reference, the zstd compression in BTRFS uses level 3 by default
(as does zlib compression IIRC), though I'm not sure about lzop (I
think it uses the lowest compression setting).



The user space tool, zstd, does default to 3, according to its man page.

    -# # compression level [1-19] (default: 3)


However, the kernel is claiming it's level 0, which doesn't exist in the 
man page. So I have no idea what we're using. This is what I get with 
mount option compress=zstd


[    4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0




--
Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/3] Allow rmdir(2) to delete a subvolume

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 11:32:57AM +0900, Misono Tomohiro wrote:
> cangelog:
>   v4 -> v5 ... Merge 2nd and 3rd patches and update commit log
>No code change in total
>   v3 -> v4 ... Reorganize patches and update commit log
>No code change in total
>   v2 -> v3 ... Use if-else block instead of two if blocks and
>add Tested-by tag in 2nd patch
>   v1 -> v2 ... Split the patch to hopefully make review easier
> 
> Note: I will send a xfstest if this series is merged.
> 
> This series changes the behavior of rmdir(2) and allow it to delete an
> empty subvolume.
> 
> In order so that, 1st and 2nd patch refactor btrfs_ioctl_snap_destroy()
> and extract the actual deletion process as btrfs_delete_subvolume()
> (remaining part in btrfs_ioctl_snap_destroy() is mainly permission checks).
> 
> Then, 3rd patch changes btrfs_rmdir() to call this function. The
> required permission check is already done in vfs layer.
> 
> Tomohiro Misono (3):
>   btrfs: Move may_destroy_subvol() from ioctl.c to inode.c
>   btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy()
>   btrfs: Allow rmdir(2) to delete an empty subvolume

Looks good to me now, thanks. I'll add it to next for more testing. The
3rd patch may need some updates in the changelog about the change in
behaviour, but this can be added later.

Tests should cover the common usage like rm -rf /path/to/subvol,
removing the empty subvol (the subvol stub), snapshots and rw
subvolumes, nested in a nontrivial way.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Chris Murphy
On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn  wrote:

> For reference, the zstd compression in BTRFS uses level 3 by default (as
> does zlib compression IIRC), though I'm not sure about lzop (I think it
> uses the lowest compression setting).
>


The user space tool, zstd, does default to 3, according to its man page.

   -# # compression level [1-19] (default: 3)


However, the kernel is claiming it's level 0, which doesn't exist in the
man page. So I have no idea what we're using. This is what I get with mount
option compress=zstd

[4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0



--
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread Howard McLauchlan
On 04/18/2018 08:50 AM, David Sterba wrote:
> On Tue, Apr 17, 2018 at 04:19:04PM -0700, Howard McLauchlan wrote:
>> Presently, convert_free_space_to_extents() does a linear scan of the
>> bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
>>
>> This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
>> Testing shows a 20-33% decrease in execution time for
>> convert_free_space_to_extents().
> Sounds good.
>
>> Suggested-by: Omar Sandoval 
>> Signed-off-by: Howard McLauchlan 
>> ---
>> Based on 4.17-rc1
>>
>>  fs/btrfs/extent_io.c   | 12 -
>>  fs/btrfs/extent_io.h   |  4 +--
>>  fs/btrfs/free-space-tree.c | 54 ++
>>  3 files changed, 27 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index e99b329002cf..1c0e7ce49556 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -5620,12 +5620,12 @@ void copy_extent_buffer(struct extent_buffer *dst, 
>> struct extent_buffer *src,
>>  }
>>  }
>>  
>> -void le_bitmap_set(u8 *map, unsigned int start, int len)
>> +void le_bitmap_set(unsigned long *map, unsigned int start, int len)
>>  {
>> -u8 *p = map + BIT_BYTE(start);
>> +char *p = ((char *)map) + BIT_BYTE(start);
>>  const unsigned int size = start + len;
>>  int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
>> -u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
>> +char mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
> Why do you switch to char? As there are bit operations done below (that
> you don't change), the unsigned type seems correct.
>
>>  
>>  while (len - bits_to_set >= 0) {
>>  *p |= mask_to_set;
>> @@ -5640,12 +5640,12 @@ void le_bitmap_set(u8 *map, unsigned int start, int 
>> len)
>>  }
>>  }
>>  
>> -void le_bitmap_clear(u8 *map, unsigned int start, int len)
>> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len)
>>  {
>> -u8 *p = map + BIT_BYTE(start);
>> +char *p = ((char *)map) + BIT_BYTE(start);
>>  const unsigned int size = start + len;
>>  int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
>> -u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
>> +char mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
> Same here and in below. Otherwise it looks ok.
>
>>  
>>  while (len - bits_to_clear >= 0) {
>>  *p &= ~mask_to_clear;
>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>> index a53009694b16..a6db233f4a40 100644
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -84,8 +84,8 @@ static inline int le_test_bit(int nr, const u8 *addr)
>>  return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1)));
>>  }
>>  
>> -void le_bitmap_set(u8 *map, unsigned int start, int len);
>> -void le_bitmap_clear(u8 *map, unsigned int start, int len);
>> +void le_bitmap_set(unsigned long *map, unsigned int start, int len);
>> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len);
>>  
>>  struct extent_state;
>>  struct btrfs_root;
>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
>> index 32a0f6cb5594..0ddf96b01a3a 100644
>> --- a/fs/btrfs/free-space-tree.c
>> +++ b/fs/btrfs/free-space-tree.c
>> @@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 
>> sectorsize)
>>  return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
>>  }
>>  
>> -static u8 *alloc_bitmap(u32 bitmap_size)
>> +static unsigned long *alloc_bitmap(u32 bitmap_size)
>>  {
>> -u8 *ret;
>> +unsigned long *ret;
>>  unsigned int nofs_flag;
>>  
>>  /*
>> @@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct 
>> btrfs_trans_handle *trans,
>>  struct btrfs_free_space_info *info;
>>  struct btrfs_key key, found_key;
>>  struct extent_buffer *leaf;
>> -u8 *bitmap, *bitmap_cursor;
>> +unsigned long *bitmap;
>> +char *bitmap_cursor;
>>  u64 start, end;
>>  u64 bitmap_range, i;
>>  u32 bitmap_size, flags, expected_extent_count;
>> @@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct 
>> btrfs_trans_handle *trans,
>>  goto out;
>>  }
>>  
>> -bitmap_cursor = bitmap;
>> +bitmap_cursor = (char *)bitmap;
>>  bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
>>  i = start;
>>  while (i < end) {
>> @@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct 
>> btrfs_trans_handle *trans,
>>  struct btrfs_free_space_info *info;
>>  struct btrfs_key key, found_key;
>>  struct extent_buffer *leaf;
>> -u8 *bitmap;
>> +unsigned long *bitmap;
>>  u64 start, end;
>> -/* Initialize to silence GCC. */
>> -u64 extent_start = 0;
>> -u64 offset;
>>  u32 bitmap_size, flags, expected_extent_count;
>> -int prev_bit = 0, bit, bitnr;
>> +unsigned long nrbits, start_bit, end_bit;
>>  u32 extent_count = 0;
>>  int done = 0, nr;
>>  int ret;
>> 

Re: [PATCH] btrfs: delayed-inode: Remove wrong qgroup meta reservation calls

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 09:08:10AM +0800, Qu Wenruo wrote:
> >>dst_rsv = &fs_info->delayed_block_rsv;
> >>  
> >>num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> >> +
> >> +  /*
> >> +   * Here we migrate space rsv from transaction rsv, since have
> >> +   * already reserved space when starting a transaction.
> >> +   * So no need to reserve qgroup space here.
> >> +   */
> > 
> > Please format the comments to the full line width.
> 
> Right, the already can go previous line without exceeding 80 chars.
> 
> But the "So no need to..." line is a new line so it will not take up any
> space of previous line anyway.

You can split the loosely related parts by an empty line, ie.
paragraphs, but I don't tend to like if the new sentence on a new line
when it logically follows the previous one.

> >> @@ -647,7 +657,9 @@ static int btrfs_delayed_inode_reserve_metadata(
> >>  "delayed_inode",
> >>  btrfs_ino(inode),
> >>  num_bytes, 1);
> >> -  }
> >> +  } else
> >> +  btrfs_qgroup_free_meta_prealloc(root,
> >> +  fs_info->nodesize);
> > 
> > Please don't diverge from the coding style, I'm fixing such issues but,
> > you know.
> 
> For this, did you mean the bracket for else branch?

Yes.

if () {
...
} else {
...
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Nikolay Borisov


On 18.04.2018 21:28, David Sterba wrote:
> On Wed, Apr 18, 2018 at 06:14:07PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 18.04.2018 18:10, Brendan Hide wrote:
>>> Hi, all
>>>
>>> I'm looking for some advice re compression with NVME. Compression helps
>>> performance with a minor CPU hit - but is it still worth it with the far
>>> higher throughputs offered by newer PCI and NVME-type SSDs?
>>>
>>> I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my
>>> home desktop. I previously used compression on an older SATA-based Intel
>>> 520 SSD, where compression made sense.
>>>
>>> However, the wisdom isn't so clear-cut if the SSD is potentially faster
>>> than the compression algorithm with my CPU (aging i7 3770).
>>>
>>> Testing using a copy of the kernel source tarball in tmpfs  it seems my
>>> system can compress/decompress at about 670MB/s using zstd with 8
>>> threads. lzop isn't that far behind. But I'm not sure if the benchmark
>>> I'm running is the same as how btrfs would be using it internally.
>>>
>>> Given these numbers I'm inclined to believe compression will make things
>>> slower - but can't be sure without knowing if I'm testing correctly.
>>>
>>> What is the best practice with benchmarking and with NVME/PCI storage?
>>
>> btrfs doesn't support DAX so using it on NVME doesn't make much sense
>> performance wise.
> 
> Is'nt NVMe just "the faster SSD"? Not the persistent memory thing.

Indeed, brain fart on my part. NVDIMM is the persistent memory thing.

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 06:14:07PM +0300, Nikolay Borisov wrote:
> 
> 
> On 18.04.2018 18:10, Brendan Hide wrote:
> > Hi, all
> > 
> > I'm looking for some advice re compression with NVME. Compression helps
> > performance with a minor CPU hit - but is it still worth it with the far
> > higher throughputs offered by newer PCI and NVME-type SSDs?
> > 
> > I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my
> > home desktop. I previously used compression on an older SATA-based Intel
> > 520 SSD, where compression made sense.
> > 
> > However, the wisdom isn't so clear-cut if the SSD is potentially faster
> > than the compression algorithm with my CPU (aging i7 3770).
> > 
> > Testing using a copy of the kernel source tarball in tmpfs  it seems my
> > system can compress/decompress at about 670MB/s using zstd with 8
> > threads. lzop isn't that far behind. But I'm not sure if the benchmark
> > I'm running is the same as how btrfs would be using it internally.
> > 
> > Given these numbers I'm inclined to believe compression will make things
> > slower - but can't be sure without knowing if I'm testing correctly.
> > 
> > What is the best practice with benchmarking and with NVME/PCI storage?
> 
> btrfs doesn't support DAX so using it on NVME doesn't make much sense
> performance wise.

Is'nt NVMe just "the faster SSD"? Not the persistent memory thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix unaligned access in readdir

2018-04-18 Thread Liu Bo
On Wed, Apr 18, 2018 at 6:17 AM, David Sterba  wrote:
> The last update to readdir introduced a temporary buffer to store the
> emitted readdir data, but as there are file names of variable length,
> there's a lot of unaligned access.
>
> This was observed on a sparc64 machine:
>
>   Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 
> [btrfs]
>

Reviewed-by: Liu Bo 

thanks,
liubo
> Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault")
> CC: sta...@vger.kernel.org # 4.14+
> Reported-and-tested-by: René Rebe 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/inode.c | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e064c49c9a9a..d241285a0d2a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -5905,11 +5906,13 @@ static int btrfs_filldir(void *addr, int entries, 
> struct dir_context *ctx)
> struct dir_entry *entry = addr;
> char *name = (char *)(entry + 1);
>
> -   ctx->pos = entry->offset;
> -   if (!dir_emit(ctx, name, entry->name_len, entry->ino,
> - entry->type))
> +   ctx->pos = get_unaligned(&entry->offset);
> +   if (!dir_emit(ctx, name, get_unaligned(&entry->name_len),
> +get_unaligned(&entry->ino),
> +get_unaligned(&entry->type)))
> return 1;
> -   addr += sizeof(struct dir_entry) + entry->name_len;
> +   addr += sizeof(struct dir_entry) +
> +   get_unaligned(&entry->name_len);
> ctx->pos++;
> }
> return 0;
> @@ -5999,14 +6002,15 @@ static int btrfs_real_readdir(struct file *file, 
> struct dir_context *ctx)
> }
>
> entry = addr;
> -   entry->name_len = name_len;
> +   put_unaligned(name_len, &entry->name_len);
> name_ptr = (char *)(entry + 1);
> read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
>name_len);
> -   entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
> +   put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
> +   &entry->type);
> btrfs_dir_item_key_to_cpu(leaf, di, &location);
> -   entry->ino = location.objectid;
> -   entry->offset = found_key.offset;
> +   put_unaligned(location.objectid, &entry->ino);
> +   put_unaligned(found_key.offset, &entry->offset);
> entries++;
> addr += sizeof(struct dir_entry) + name_len;
> total_len += sizeof(struct dir_entry) + name_len;
> --
> 2.16.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Filesystem full with unallocated space?

2018-04-18 Thread Nikolay Borisov


On 17.04.2018 19:08, Matthew Lai wrote:
> Hello!
> 
> I am getting ENOSPC on my root filesystem with plenty of unallocated
> space according to "fi usage". Any idea why that may be? This is a root
> partition for Debian Stable. Not doing anything unusual that I'm aware
> of. No snapshots.
> 
> Thanks!
> 
> Matthew
> 
> uname -a:
> Linux bigfoot 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23)
> x86_64 GNU/Linux
> 
> btrfs --version:
> btrfs-progs v4.7.3
> 
> btrfs fi show:
> Label: none  uuid: 2364c63f-e20c-410f-90b4-05f722ee1c77
>     Total devices 1 FS bytes used 176.27GiB
>     devid    1 size 246.33GiB used 185.01GiB path /dev/sda2
> 
> btrfs fi df /:
> Data, single: total=183.00GiB, used=175.78GiB
> System, single: total=4.00MiB, used=48.00KiB
> Metadata, single: total=2.01GiB, used=504.81MiB
> GlobalReserve, single: total=211.39MiB, used=0.00B
> 


You haven't shown "btrfs fi usage". In any case there were some patches
in more recent kernels which deal with premature ENOSPC:


996478ca9c46 ("btrfs: change how we decide to commit transactions during
flushing")
17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc")


I'd advise you update to 4.14 stable. Otherwise running:
git log --oneline --grep "enospc" fs/btrfs/

will shows you likely candidates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Austin S. Hemmelgarn

On 2018-04-18 11:10, Brendan Hide wrote:

Hi, all

I'm looking for some advice re compression with NVME. Compression helps 
performance with a minor CPU hit - but is it still worth it with the far 
higher throughputs offered by newer PCI and NVME-type SSDs?


I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my 
home desktop. I previously used compression on an older SATA-based Intel 
520 SSD, where compression made sense.


However, the wisdom isn't so clear-cut if the SSD is potentially faster 
than the compression algorithm with my CPU (aging i7 3770).


Testing using a copy of the kernel source tarball in tmpfs  it seems my 
system can compress/decompress at about 670MB/s using zstd with 8 
threads. lzop isn't that far behind. But I'm not sure if the benchmark 
I'm running is the same as how btrfs would be using it internally.
BTRFS compresses chunks of 128k at a time and compresses each block one 
at a time (it doesn't do multi-threaded compression).  You can simulate 
this a bit better by splitting the files you're trying to compress into 
128k chunks (calling `split -b 131072` on the file will do this quickly 
and easily), and then passing all those chunks to the compression 
program _at the same time_ (this eliminates the overhead of re-invoking 
the compressor for each block), and then running it with one thread. 
For reference, the zstd compression in BTRFS uses level 3 by default (as 
does zlib compression IIRC), though I'm not sure about lzop (I think it 
uses the lowest compression setting).


Note that this will still not be entirely accurate (there are 
significant differences in buffer handling in the in-kernel 
implementations because of memory management differences).


Another option is to see how long it takes to copy the test data into a 
ZRAM device.  This will eliminate the storage overhead, and use the same 
compression algorithms that BTRFS does (the only big difference is that 
it compresses by page, so it will use 4k blocks instead of 128k).  zRAM 
currently doesn't support zstd (though patches have been posted), but it 
by default uses lzo, and it supports deflate as well (which is 
essentially the same mathematically as the 'zlib' compression method in 
BTRFS).


Given these numbers I'm inclined to believe compression will make things 
slower - but can't be sure without knowing if I'm testing correctly.
On NVMe, yes, it's probably not worth it for speed.  It may however help 
in other ways.  Compressed writes are smaller than normal writes.  This 
means that rewriting a file that is compressed by the filesystem will 
result in fewer rewritten blocks of storage, which can be useful when 
dealing with flash memory.  Less written data also means you leave a bit 
more free space for the wear-leveling algorithms to work with, which can 
improve performance on some devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: optimize free space tree bitmap conversion

2018-04-18 Thread David Sterba
On Tue, Apr 17, 2018 at 04:19:04PM -0700, Howard McLauchlan wrote:
> Presently, convert_free_space_to_extents() does a linear scan of the
> bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
> 
> This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
> Testing shows a 20-33% decrease in execution time for
> convert_free_space_to_extents().

Sounds good.

> Suggested-by: Omar Sandoval 
> Signed-off-by: Howard McLauchlan 
> ---
> Based on 4.17-rc1
> 
>  fs/btrfs/extent_io.c   | 12 -
>  fs/btrfs/extent_io.h   |  4 +--
>  fs/btrfs/free-space-tree.c | 54 ++
>  3 files changed, 27 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index e99b329002cf..1c0e7ce49556 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5620,12 +5620,12 @@ void copy_extent_buffer(struct extent_buffer *dst, 
> struct extent_buffer *src,
>   }
>  }
>  
> -void le_bitmap_set(u8 *map, unsigned int start, int len)
> +void le_bitmap_set(unsigned long *map, unsigned int start, int len)
>  {
> - u8 *p = map + BIT_BYTE(start);
> + char *p = ((char *)map) + BIT_BYTE(start);
>   const unsigned int size = start + len;
>   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> - u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
> + char mask_to_set = BITMAP_FIRST_BYTE_MASK(start);

Why do you switch to char? As there are bit operations done below (that
you don't change), the unsigned type seems correct.

>  
>   while (len - bits_to_set >= 0) {
>   *p |= mask_to_set;
> @@ -5640,12 +5640,12 @@ void le_bitmap_set(u8 *map, unsigned int start, int 
> len)
>   }
>  }
>  
> -void le_bitmap_clear(u8 *map, unsigned int start, int len)
> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len)
>  {
> - u8 *p = map + BIT_BYTE(start);
> + char *p = ((char *)map) + BIT_BYTE(start);
>   const unsigned int size = start + len;
>   int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> - u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
> + char mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);

Same here and in below. Otherwise it looks ok.

>  
>   while (len - bits_to_clear >= 0) {
>   *p &= ~mask_to_clear;
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a53009694b16..a6db233f4a40 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -84,8 +84,8 @@ static inline int le_test_bit(int nr, const u8 *addr)
>   return 1U & (addr[BIT_BYTE(nr)] >> (nr & (BITS_PER_BYTE-1)));
>  }
>  
> -void le_bitmap_set(u8 *map, unsigned int start, int len);
> -void le_bitmap_clear(u8 *map, unsigned int start, int len);
> +void le_bitmap_set(unsigned long *map, unsigned int start, int len);
> +void le_bitmap_clear(unsigned long *map, unsigned int start, int len);
>  
>  struct extent_state;
>  struct btrfs_root;
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index 32a0f6cb5594..0ddf96b01a3a 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -138,9 +138,9 @@ static inline u32 free_space_bitmap_size(u64 size, u32 
> sectorsize)
>   return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
>  }
>  
> -static u8 *alloc_bitmap(u32 bitmap_size)
> +static unsigned long *alloc_bitmap(u32 bitmap_size)
>  {
> - u8 *ret;
> + unsigned long *ret;
>   unsigned int nofs_flag;
>  
>   /*
> @@ -166,7 +166,8 @@ int convert_free_space_to_bitmaps(struct 
> btrfs_trans_handle *trans,
>   struct btrfs_free_space_info *info;
>   struct btrfs_key key, found_key;
>   struct extent_buffer *leaf;
> - u8 *bitmap, *bitmap_cursor;
> + unsigned long *bitmap;
> + char *bitmap_cursor;
>   u64 start, end;
>   u64 bitmap_range, i;
>   u32 bitmap_size, flags, expected_extent_count;
> @@ -255,7 +256,7 @@ int convert_free_space_to_bitmaps(struct 
> btrfs_trans_handle *trans,
>   goto out;
>   }
>  
> - bitmap_cursor = bitmap;
> + bitmap_cursor = (char *)bitmap;
>   bitmap_range = fs_info->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
>   i = start;
>   while (i < end) {
> @@ -304,13 +305,10 @@ int convert_free_space_to_extents(struct 
> btrfs_trans_handle *trans,
>   struct btrfs_free_space_info *info;
>   struct btrfs_key key, found_key;
>   struct extent_buffer *leaf;
> - u8 *bitmap;
> + unsigned long *bitmap;
>   u64 start, end;
> - /* Initialize to silence GCC. */
> - u64 extent_start = 0;
> - u64 offset;
>   u32 bitmap_size, flags, expected_extent_count;
> - int prev_bit = 0, bit, bitnr;
> + unsigned long nrbits, start_bit, end_bit;
>   u32 extent_count = 0;
>   int done = 0, nr;
>   int ret;
> @@ -348,7 +346,7 @@ int convert_free_space_to_extents(struct 
> btrfs_trans_handle *trans,
> 

Re: nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Nikolay Borisov


On 18.04.2018 18:10, Brendan Hide wrote:
> Hi, all
> 
> I'm looking for some advice re compression with NVME. Compression helps
> performance with a minor CPU hit - but is it still worth it with the far
> higher throughputs offered by newer PCI and NVME-type SSDs?
> 
> I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my
> home desktop. I previously used compression on an older SATA-based Intel
> 520 SSD, where compression made sense.
> 
> However, the wisdom isn't so clear-cut if the SSD is potentially faster
> than the compression algorithm with my CPU (aging i7 3770).
> 
> Testing using a copy of the kernel source tarball in tmpfs  it seems my
> system can compress/decompress at about 670MB/s using zstd with 8
> threads. lzop isn't that far behind. But I'm not sure if the benchmark
> I'm running is the same as how btrfs would be using it internally.
> 
> Given these numbers I'm inclined to believe compression will make things
> slower - but can't be sure without knowing if I'm testing correctly.
> 
> What is the best practice with benchmarking and with NVME/PCI storage?

btrfs doesn't support DAX so using it on NVME doesn't make much sense
performance wise.

> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nvme+btrfs+compression sensibility and benchmark

2018-04-18 Thread Brendan Hide

Hi, all

I'm looking for some advice re compression with NVME. Compression helps 
performance with a minor CPU hit - but is it still worth it with the far 
higher throughputs offered by newer PCI and NVME-type SSDs?


I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my 
home desktop. I previously used compression on an older SATA-based Intel 
520 SSD, where compression made sense.


However, the wisdom isn't so clear-cut if the SSD is potentially faster 
than the compression algorithm with my CPU (aging i7 3770).


Testing using a copy of the kernel source tarball in tmpfs  it seems my 
system can compress/decompress at about 670MB/s using zstd with 8 
threads. lzop isn't that far behind. But I'm not sure if the benchmark 
I'm running is the same as how btrfs would be using it internally.


Given these numbers I'm inclined to believe compression will make things 
slower - but can't be sure without knowing if I'm testing correctly.


What is the best practice with benchmarking and with NVME/PCI storage?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add comment about BTRFS_FS_EXCL_OP

2018-04-18 Thread David Sterba
On Wed, Apr 18, 2018 at 02:59:25PM +0800, Anand Jain wrote:
> Adds comments about BTRFS_FS_EXCL_OP to existing comments
> about the device locks.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]

2018-04-18 Thread David Sterba
On Tue, Apr 17, 2018 at 12:18:45PM +0200, René Rebe wrote:
> On 16 Apr 2018, at 21:15, David Sterba  wrote:
> > On Mon, Apr 16, 2018 at 09:55:45PM +0200, René Rebe wrote:
> >> On 04/16/2018 06:48 PM, David Sterba wrote:
> >>> The warnings are valid, there's unaligned access introduced by patch
> >>> 
> >>> 23b5ec74943f44378b68c0edd8e210a86318ea5e
> >>> btrfs: fix readdir deadlock with pagefault
> >>> 
> >>> The directory entries (struct dir_entry) are copied to a temporary
> >>> buffer as they fit, ie. no alignment, and the members accessed in
> >>> several places.
> >>> 
> >>> The following patch adds the proper unaligned access, only compile-tested.
> >>> Please test and let me know, thanks!
> >> Would have loved to immediately give it a try, however, sorry,
> >> I forgot to mention I'm on the latest stable release -4.16.2-
> >> on a first glance this does not look like it does just apply.
> >> 
> >> I would re-base myself if I would not also have a glibc initialization 
> >> bug to hunt and debug, too :-/
> >> 
> >> If you happen to also rebase it for current -stable, ... ;-)
> > 
> > Sure, attached a 4.16.2 version.
> > <0001-test-readdir-unaligned-access.patch>
> 
> Cool, thanks, built and so far it works, the warnings are gone.
> 
>   https://www.youtube.com/watch?v=XYsKct4T2xk

Thanks for testing on a real hardware! I'm not aware of regular runtime
testing on sparc, the other architectures with strict alignment
requirement have probably the emulation turned on by default so the
warnings do not appear. Or nobody bothers to report them though the
fixes are typically simple.

I like your vintage hw videos, keep going!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix unaligned access in readdir

2018-04-18 Thread David Sterba
The last update to readdir introduced a temporary buffer to store the
emitted readdir data, but as there are file names of variable length,
there's a lot of unaligned access.

This was observed on a sparc64 machine:

  Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 
[btrfs]

Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault")
CC: sta...@vger.kernel.org # 4.14+
Reported-and-tested-by: René Rebe 
Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e064c49c9a9a..d241285a0d2a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -5905,11 +5906,13 @@ static int btrfs_filldir(void *addr, int entries, 
struct dir_context *ctx)
struct dir_entry *entry = addr;
char *name = (char *)(entry + 1);
 
-   ctx->pos = entry->offset;
-   if (!dir_emit(ctx, name, entry->name_len, entry->ino,
- entry->type))
+   ctx->pos = get_unaligned(&entry->offset);
+   if (!dir_emit(ctx, name, get_unaligned(&entry->name_len),
+get_unaligned(&entry->ino),
+get_unaligned(&entry->type)))
return 1;
-   addr += sizeof(struct dir_entry) + entry->name_len;
+   addr += sizeof(struct dir_entry) +
+   get_unaligned(&entry->name_len);
ctx->pos++;
}
return 0;
@@ -5999,14 +6002,15 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
}
 
entry = addr;
-   entry->name_len = name_len;
+   put_unaligned(name_len, &entry->name_len);
name_ptr = (char *)(entry + 1);
read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
   name_len);
-   entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
+   put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
+   &entry->type);
btrfs_dir_item_key_to_cpu(leaf, di, &location);
-   entry->ino = location.objectid;
-   entry->offset = found_key.offset;
+   put_unaligned(location.objectid, &entry->ino);
+   put_unaligned(found_key.offset, &entry->offset);
entries++;
addr += sizeof(struct dir_entry) + name_len;
total_len += sizeof(struct dir_entry) + name_len;
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: update uuid_mutex and device_list_mutex comments

2018-04-18 Thread Anand Jain
Make the uuid_mutex and device_list_mutex comments inline with
the changes.

Signed-off-by: Anand Jain 
---
Hi David,

 Can you kindly add this patch to the set: 'Review uuid_mutex usage'

Thanks, Anand

 fs/btrfs/volumes.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d7f10729713..12617a8a7083 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -155,29 +155,26 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
  *
  * uuid_mutex (global lock)
  * 
- * protects the fs_uuids list that tracks all per-fs fs_devices, resulting from
+ * Protects the fs_uuids list that tracks all per-fs fs_devices, resulting from
  * the SCAN_DEV ioctl registration or from mount either implicitly (the first
- * device) or requested by the device= mount option
- *
- * the mutex can be very coarse and can cover long-running operations
- *
- * protects: updates to fs_devices counters like missing devices, rw devices,
- * seeding, structure cloning, openning/closing devices at mount/umount time
+ * device) or requested by the device= mount option.
  *
  * global::fs_devs - add, remove, updates to the global list
  *
- * does not protect: manipulation of the fs_devices::devices list!
+ * Does not protect: manipulation of the fs_devices::devices list!
  *
  * btrfs_device::name - renames (write side), read is RCU
  *
  * fs_devices::device_list_mutex (per-fs, with RCU)
  * 
- * protects updates to fs_devices::devices, ie. adding and deleting
+ * Protects updates to fs_devices::devices, ie. adding and deleting, and its
+ * counters like missing devices, rw devices, seeding, structure cloning,
+ * openning/closing devices at mount/umount time.
  *
- * simple list traversal with read-only actions can be done with RCU protection
+ * Simple list traversal with read-only actions can be done with RCU 
protection.
  *
- * may be used to exclude some operations from running concurrently without any
- * modifications to the list (see write_all_supers)
+ * May be used to exclude some operations from running concurrently without any
+ * modifications to the list (see write_all_supers).
  *
  * balance_mutex
  * -
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add chattr support for send/receive

2018-04-18 Thread David Sterba
On Tue, Apr 17, 2018 at 04:39:35PM -0700, Howard McLauchlan wrote:
> Presently btrfs send/receive does not propagate inode attribute flags;
> all chattr operations are effectively discarded upon transmission.
> 
> This patch adds kernel support for inode attribute flags. Userspace
> support can be found under the commit:
> 
> btrfs-progs: add chattr support for send/receive
> 
> An associated xfstest can be found at:
> 
> btrfs: add verify chattr support for send/receive test
> 
> A caveat is that a user with an updated kernel (aware of chattrs) and an
> older version of btrfs-progs (unaware of chattrs) will fail to receive
> if a chattr is included in the send stream.
> 
> Signed-off-by: Howard McLauchlan 

This is a send protocol change and must be properly versioned. There are
more known defficiencies from v1, see

https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive#Send_stream_v2_draft

> --- a/fs/btrfs/send.h
> +++ b/fs/btrfs/send.h
> @@ -77,6 +77,7 @@ enum btrfs_send_cmd {
>  
>   BTRFS_SEND_C_END,
>   BTRFS_SEND_C_UPDATE_EXTENT,
> + BTRFS_SEND_C_CHATTR,
>   __BTRFS_SEND_C_MAX,

This change

>  };
>  #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1)
> @@ -114,6 +115,7 @@ enum {
>   BTRFS_SEND_A_CLONE_PATH,
>   BTRFS_SEND_A_CLONE_OFFSET,
>   BTRFS_SEND_A_CLONE_LEN,
> + BTRFS_SEND_A_CHATTR,
>  
>   __BTRFS_SEND_A_MAX,

and this will change numbers of __BTRFS_SEND_*_MAX and defines derived
from them, that must stay fixed for v1, because they're part of the
public API exported from progs as send.h.

Unfortunatelly for anybody who wants to implement new additions to the
send stream, the full versioning, backward compatibility, commandline
options, ioctl extensions need to happen first.

A concrete example how this can be done wrong is the Synology version of
btrfs shipped with their NAS. There are some extensions to the protocol
that work on their kernel, but the send stream cannot be used on upstram
kernels.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs-progs: dump-tree: add degraded option

2018-04-18 Thread Anand Jain
From: Anand Jain 

btrfs inspect dump-tree picks the disk with the largest generation
to read the root tree by scanning for the required devices by default.

But in 2 or more disks RAID1/5/6 you may need to know what's in the
disks individually, so this option --noscan indicates to use only the
given disk to dump.

For example:

 mkfs.btrfs -fq -draid1 -mraid1 /dev/sd[b-d]
 btrfs in dump-tree --noscan /dev/sdb /dev/sdc

 However as usual without noscan option it works as it scans for the
 devices by its own. And if minimum required number of devices are
 not provided then when --noscan option is used, it errors out as
 shown below.

 btrfs in dump-tree --noscan /dev/sdb
 check arg type : /dev/sdbbtrfs-progs v4.14.1
 warning, device 3 is missing
 warning, device 2 is missing
 bytenr mismatch, want=22036480, have=0
 ERROR: cannot read chunk root
 ERROR: unable to open /dev/sdb

Signed-off-by: Anand Jain 
---
v2->v3: make it scalable for more than two disks in noscan mode
v1->v2: rename --degraded to --noscan

 cmds-inspect-dump-tree.c | 60 ++--
 1 file changed, 53 insertions(+), 7 deletions(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 0802b31e9596..8625cd608ad0 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kerncompat.h"
 #include "radix-tree.h"
@@ -183,6 +184,46 @@ static u64 treeid_from_string(const char *str, const char 
**end)
return id;
 }
 
+static int scan_args(char **argv, int argc, int tmpind, int noscan)
+{
+   int fd;
+   int ret;
+   u64 num_devices;
+   struct btrfs_fs_devices *tmp_devices;
+
+   while (tmpind < argc) {
+   ret = check_arg_type(argv[tmpind]);
+   if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) {
+   error("not a block device or regular file: %s",
+ argv[tmpind]);
+   return -EINVAL;
+   }
+
+   if (!noscan) {
+   tmpind++;
+   continue;
+   }
+
+   fd = open(argv[tmpind], O_RDONLY);
+   if (fd < 0) {
+   error("cannot open %s: %m", argv[tmpind]);
+   return -EINVAL;
+   }
+   ret = btrfs_scan_one_device(fd, argv[tmpind],
+   &tmp_devices, &num_devices,
+   BTRFS_SUPER_INFO_OFFSET,
+   SBREAD_DEFAULT);
+   close(fd);
+   if (ret) {
+   error("cannot scan %s: %s",
+ argv[tmpind], strerror(-ret));
+   return ret;
+   }
+   tmpind++;
+   }
+   return 0;
+}
+
 const char * const cmd_inspect_dump_tree_usage[] = {
"btrfs inspect-internal dump-tree [options] device",
"Dump tree structures from a given device",
@@ -199,6 +240,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
"-b|--block  print info from the specified block only",
"-t|--tree print only tree with the given id (string or 
number)",
"--follow   use with -b, to show all children tree blocks 
of ",
+   "--noscan   only uses the devices provided as the argument",
NULL
 };
 
@@ -213,7 +255,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
struct btrfs_disk_key disk_key;
struct btrfs_key found_key;
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
-   int ret;
+   int ret = 0;
int slot;
int extent_only = 0;
int device_only = 0;
@@ -225,10 +267,12 @@ int cmd_inspect_dump_tree(int argc, char **argv)
struct btrfs_root *tree_root_scan;
u64 tree_id = 0;
bool follow = false;
+   bool noscan = false;
 
while (1) {
int c;
-   enum { GETOPT_VAL_FOLLOW = 256 };
+   enum { GETOPT_VAL_FOLLOW = 256,
+  GETOPT_VAL_NOSCAN = 257};
static const struct option long_options[] = {
{ "extents", no_argument, NULL, 'e'},
{ "device", no_argument, NULL, 'd'},
@@ -238,6 +282,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
{ "block", required_argument, NULL, 'b'},
{ "tree", required_argument, NULL, 't'},
{ "follow", no_argument, NULL, GETOPT_VAL_FOLLOW },
+   { "noscan", no_argument, NULL, GETOPT_VAL_NOSCAN },
{ NULL, 0, NULL, 0 }
};
 
@@ -293,19 +338,20 @@ int cmd_inspect_dump_tree(int argc, char **argv)
case GETOPT_VAL_FOLLOW:
follow = true;
break;
+   case GETOP

Re: Hard link not persisted on fsync

2018-04-18 Thread Filipe Manana
On Wed, Apr 18, 2018 at 1:02 AM, Jayashree Mohan
 wrote:
> Hi,
>
> A gentle reminder on the crash consistency bug that we found on btrfs:

Why do you call it a consistency bug?
The filesystem does not stay in inconsistent state. The link count
stays 1 and the dentry used for fsync (foo) is persisted.
An inconsistency would be if we ended up with a link count of 2 and
only one of the dentries was persisted, or if we ended up with a link
count of 1 and both dentries were persisted.
Those cases would be detected by an fsck and would could fs operations
to fail unexpectedly (attemping to remove a dentry, etc).

> Link count of a file is not persisted even after a fsync. We believe a
> filesystem that ensures strictly ordered metadata behaviour should be
> able to persist the hard link after a fsync on the original file.

The thing is there's no written specification about what's the
expected and correct behavior.
Yes, I'm aware of Dave's explanation on strictly ordered metadata on
the other thread and transaction details, but things on btrfs work
very differently from xfs (I'm not saying he's wrong).

For me it makes more sense to persist the hard link, but again,
there's a lack of a specification that demands such behaviour, and I'm
not aware of applications needing that behaviour nor user reports
about it.

> Could you comment on why btrfs does not exhibit this behavior, and if
> it's something you'd want to fix?

I don't know the "why", as I am not the original author of the log
tree (what is used to implement fsync on btrfs), so either it's
accidental behavior (the most likely) or intentional.

Since it's not something that causes any corruption, fs inconsistency,
crash, etc, nor there are user reports complaining about this (afaik),
for me it would be far from a priority at the moment as I'm trying to
fix corruptions and similar issues (not necessarily caused by fsync).

Plus, adding such behaviour would have to be done carefully to not
impact performance, as checking if the node has multiple hard links
and which ones need to be persisted (created in the current,
uncommitted, transaction) may have a measurable impact. The current
behaviour it to only guarantee persisting the dentry used for the
fsync call.


>
> Thanks,
> Jayashree Mohan
>
>
>
> On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan
>  wrote:
>> Hi,
>>
>> The following seems to be a crash consistency bug on btrfs, where in
>> the link count is not persisted even after a fsync on the original
>> file.
>>
>> Consider the following workload :
>> creat foo
>> link (foo, A/bar)
>> fsync(foo)
>> ---Crash---
>>
>> Now, on recovery we expect the metadata of foo to be persisted i.e
>> have a link count of 2. However in btrfs, the link count is 1 and file
>> A/bar is not persisted. The expected behaviour would be to persist the
>> dependencies of inode foo. That is to say, shouldn't fsync of foo
>> persist A/bar and correctly update the link count?
>> Note that ext4, xfs and f2fs recover to the correct link count of 2
>> for the above workload.
>>
>> Let us know what you think about this behavior.
>>
>> Thanks,
>> Jayashree Mohan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add chattr support for send/receive

2018-04-18 Thread Filipe Manana
On Wed, Apr 18, 2018 at 12:39 AM, Howard McLauchlan  wrote:
> Presently btrfs send/receive does not propagate inode attribute flags;
> all chattr operations are effectively discarded upon transmission.
>
> This patch adds kernel support for inode attribute flags. Userspace
> support can be found under the commit:
>
> btrfs-progs: add chattr support for send/receive
>
> An associated xfstest can be found at:
>
> btrfs: add verify chattr support for send/receive test
>
> A caveat is that a user with an updated kernel (aware of chattrs) and an
> older version of btrfs-progs (unaware of chattrs) will fail to receive
> if a chattr is included in the send stream.

So we do have several things missing in send besides attribute flags,
like hole punching for example (there's a list on the wiki).
We can't just add a new command and introduce such caveat every time
we implement one of the missing and desired features.

In 2014, while wanting to implement some of those features, I
introduced a way to bump the send stream version with room (commands)
for all those missing features, so that all could be implemented later
without adding further backward incompatibility between kernel
versions btrfs-progs versions.
Some of the threads for reference:

https://patchwork.kernel.org/patch/4021491/
https://www.spinics.net/lists/linux-btrfs/msg35169.html

It never took off, and honestly I don't remember why as no one add
more comments on the latest versions of the kernel and btrfs-progs
patchsets.


>
> Signed-off-by: Howard McLauchlan 
> ---
> Based on 4.17-rc1
>
>  fs/btrfs/ctree.h |   2 +
>  fs/btrfs/ioctl.c |   2 +-
>  fs/btrfs/send.c  | 176 +++
>  fs/btrfs/send.h  |   2 +
>  4 files changed, 154 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5474ef14d6e6..a0dc6a8a37eb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1436,6 +1436,8 @@ struct btrfs_map_token {
> unsigned long offset;
>  };
>
> +unsigned int btrfs_flags_to_ioctl(unsigned int flags);
> +
>  #define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \
> ((bytes) >> (fs_info)->sb->s_blocksize_bits)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 632e26d6f7ce..36ce1e589f9e 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -106,7 +106,7 @@ static unsigned int btrfs_mask_flags(umode_t mode, 
> unsigned int flags)
>  /*
>   * Export inode flags to the format expected by the FS_IOC_GETFLAGS ioctl.
>   */
> -static unsigned int btrfs_flags_to_ioctl(unsigned int flags)
> +unsigned int btrfs_flags_to_ioctl(unsigned int flags)
>  {
> unsigned int iflags = 0;
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 221e5cdb060b..da521a5a1843 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -101,6 +101,13 @@ struct send_ctx {
> u64 cur_inode_last_extent;
> u64 cur_inode_next_write_offset;
>
> +   /*
> +* state for chattr purposes
> +*/
> +   u64 cur_inode_flip_flags;
> +   u64 cur_inode_receive_flags;
> +   int receive_flags_valid;
> +
> u64 send_progress;
>
> struct list_head new_refs;
> @@ -798,7 +805,7 @@ static int send_rmdir(struct send_ctx *sctx, struct 
> fs_path *path)
>   */
>  static int __get_inode_info(struct btrfs_root *root, struct btrfs_path *path,
>   u64 ino, u64 *size, u64 *gen, u64 *mode, u64 *uid,
> - u64 *gid, u64 *rdev)
> + u64 *gid, u64 *rdev, u64 *flags)
>  {
> int ret;
> struct btrfs_inode_item *ii;
> @@ -828,6 +835,8 @@ static int __get_inode_info(struct btrfs_root *root, 
> struct btrfs_path *path,
> *gid = btrfs_inode_gid(path->nodes[0], ii);
> if (rdev)
> *rdev = btrfs_inode_rdev(path->nodes[0], ii);
> +   if (flags)
> +   *flags = btrfs_inode_flags(path->nodes[0], ii);
>
> return ret;
>  }
> @@ -835,7 +844,7 @@ static int __get_inode_info(struct btrfs_root *root, 
> struct btrfs_path *path,
>  static int get_inode_info(struct btrfs_root *root,
>   u64 ino, u64 *size, u64 *gen,
>   u64 *mode, u64 *uid, u64 *gid,
> - u64 *rdev)
> + u64 *rdev, u64 *flags)
>  {
> struct btrfs_path *path;
> int ret;
> @@ -844,7 +853,7 @@ static int get_inode_info(struct btrfs_root *root,
> if (!path)
> return -ENOMEM;
> ret = __get_inode_info(root, path, ino, size, gen, mode, uid, gid,
> -  rdev);
> +  rdev, flags);
> btrfs_free_path(path);
> return ret;
>  }
> @@ -1233,7 +1242,7 @@ static int __iterate_backrefs(u64 ino, u64 offset, u64 
> root, void *ctx_)
>  * accept clones from these extents.
>  */
> ret = __get_inode_info(found->root, 

Re: [PATCH v2] btrfs: Rewrite retry logic in do_chunk_alloc

2018-04-18 Thread Nikolay Borisov


On 18.04.2018 10:27, Nikolay Borisov wrote:
> do_chunk_alloc implements logic to detect whether there is currently
> pending chunk allocation  (by means of space_info->chunk_alloc being
> set) and if so it loops around to the 'again' label. Additionally,
> based on the state of the space_info (e.g. whether it's full or not)
> and the return value of should_alloc_chunk() it decides whether this
> is a "hard" error (ENOSPC) or we can just return 0.
> 
> This patch refactors all of this:
> 
> 1. Put order to the scattered ifs handling the various cases in an
> easy-to-read if {} else if{} branches. This makes clear the various
> cases we are interested in handling.
> 
> 2. Call should_alloc_chunk only once and use the result in the
> if/else if constructs. All of this is done under space_info->lock, so
> even before multiple calls of should_alloc_chunk were unnecessary.
> 
> 3. Rewrite the "do {} while()" loop currently implemented via label
> into an explicit loop construct.
> 
> 4. Move the mutex locking for the case where the caller is the one doing the 
> allocation. For the case where the caller needs to wait a concurrent 
> allocation, 
> introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword 
> the 
> comment. 
> 
> 5. Switch local vars to bool type where pertinent.
> 
> All in all this shouldn't introduce any functional changes.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> v2: 
>  * Introduce missing logic in the case where we have to loop. The correct 
>  behavior when a concurrent allocation is executing is to acquire/release the 
>  mutex and loop to check if it makes sense to continue with our allocation 
> try. 
> 
>  fs/btrfs/extent-tree.c | 73 
> +-
>  1 file changed, 36 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7bfc03494c39..bfb19bcdeee3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4601,7 +4601,8 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
> struct btrfs_fs_info *fs_info, u64 flags, int force)
>  {
>   struct btrfs_space_info *space_info;
> - int wait_for_alloc = 0;
> + bool wait_for_alloc = false;
> + bool should_alloc = false;
>   int ret = 0;
>  
>   /* Don't re-enter if we're already allocating a chunk */
> @@ -4611,45 +4612,43 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
>   space_info = __find_space_info(fs_info, flags);
>   ASSERT(space_info);
>  
> -again:
> - spin_lock(&space_info->lock);
> - if (force < space_info->force_alloc)
> - force = space_info->force_alloc;
> - if (space_info->full) {
> - if (should_alloc_chunk(fs_info, space_info, force))
> - ret = -ENOSPC;
> - else
> - ret = 0;
> - spin_unlock(&space_info->lock);
> - return ret;
> - }
> -
> - if (!should_alloc_chunk(fs_info, space_info, force)) {
> - spin_unlock(&space_info->lock);
> - return 0;
> - } else if (space_info->chunk_alloc) {
> - wait_for_alloc = 1;
> - } else {
> - space_info->chunk_alloc = 1;
> - }
> -
> - spin_unlock(&space_info->lock);
> -
> - mutex_lock(&fs_info->chunk_mutex);
> + do {
> + spin_lock(&space_info->lock);
> + if (force < space_info->force_alloc)
> + force = space_info->force_alloc;
> + should_alloc = should_alloc_chunk(fs_info, space_info, force);
> + if (space_info->full) {
> + /* No more free physical space */
> + if (should_alloc)
> + ret = -ENOSPC;
> + else
> + ret = 0;
> + spin_unlock(&space_info->lock);
> + return ret;
> + } else if (!should_alloc) {
> + spin_unlock(&space_info->lock);
> + return 0;
> + } else if (space_info->chunk_alloc) {
> + /* Someone is already allocating, so we need to block
> +  * while this someone is finished and then loop, to
> +  * recheck if we should continue with our allocation
> +  * try
> +  */
> + wait_for_alloc = true;
> + spin_unlock(&space_info->lock);
> + mutex_lock(&fs_info->chunk_mutex);
> + mutex_unlock(&fs_info->chunk_mutex);
> + } else {
> + /* Proceed with allocation */
> + space_info->chunk_alloc = 1;
> + wait_for_alloc = false;
> + spin_unlock(&space_info->lock);
> + }
>  
> - /*
> -  * The chunk_mutex is held throughout the en

[PATCH v2] btrfs: Rewrite retry logic in do_chunk_alloc

2018-04-18 Thread Nikolay Borisov
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation  (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.

This patch refactors all of this:

1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.

2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.

3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.

4. Move the mutex locking for the case where the caller is the one doing the 
allocation. For the case where the caller needs to wait a concurrent 
allocation, 
introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword the 
comment. 

5. Switch local vars to bool type where pertinent.

All in all this shouldn't introduce any functional changes.

Signed-off-by: Nikolay Borisov 
---

v2: 
 * Introduce missing logic in the case where we have to loop. The correct 
 behavior when a concurrent allocation is executing is to acquire/release the 
 mutex and loop to check if it makes sense to continue with our allocation try. 

 fs/btrfs/extent-tree.c | 73 +-
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7bfc03494c39..bfb19bcdeee3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4601,7 +4601,8 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
  struct btrfs_fs_info *fs_info, u64 flags, int force)
 {
struct btrfs_space_info *space_info;
-   int wait_for_alloc = 0;
+   bool wait_for_alloc = false;
+   bool should_alloc = false;
int ret = 0;
 
/* Don't re-enter if we're already allocating a chunk */
@@ -4611,45 +4612,43 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
 
-again:
-   spin_lock(&space_info->lock);
-   if (force < space_info->force_alloc)
-   force = space_info->force_alloc;
-   if (space_info->full) {
-   if (should_alloc_chunk(fs_info, space_info, force))
-   ret = -ENOSPC;
-   else
-   ret = 0;
-   spin_unlock(&space_info->lock);
-   return ret;
-   }
-
-   if (!should_alloc_chunk(fs_info, space_info, force)) {
-   spin_unlock(&space_info->lock);
-   return 0;
-   } else if (space_info->chunk_alloc) {
-   wait_for_alloc = 1;
-   } else {
-   space_info->chunk_alloc = 1;
-   }
-
-   spin_unlock(&space_info->lock);
-
-   mutex_lock(&fs_info->chunk_mutex);
+   do {
+   spin_lock(&space_info->lock);
+   if (force < space_info->force_alloc)
+   force = space_info->force_alloc;
+   should_alloc = should_alloc_chunk(fs_info, space_info, force);
+   if (space_info->full) {
+   /* No more free physical space */
+   if (should_alloc)
+   ret = -ENOSPC;
+   else
+   ret = 0;
+   spin_unlock(&space_info->lock);
+   return ret;
+   } else if (!should_alloc) {
+   spin_unlock(&space_info->lock);
+   return 0;
+   } else if (space_info->chunk_alloc) {
+   /* Someone is already allocating, so we need to block
+* while this someone is finished and then loop, to
+* recheck if we should continue with our allocation
+* try
+*/
+   wait_for_alloc = true;
+   spin_unlock(&space_info->lock);
+   mutex_lock(&fs_info->chunk_mutex);
+   mutex_unlock(&fs_info->chunk_mutex);
+   } else {
+   /* Proceed with allocation */
+   space_info->chunk_alloc = 1;
+   wait_for_alloc = false;
+   spin_unlock(&space_info->lock);
+   }
 
-   /*
-* The chunk_mutex is held throughout the entirety of a chunk
-* allocation, so once we've acquired the chunk_mutex we know that the
-* other guy is done and we need to recheck and see if we should

[PATCH] btrfs-progs: restore: fix mistake on overwrite_ok() if relative path is given

2018-04-18 Thread Ting-Chang
From: Ting-Chang Hou 

fstatat will return -1 with errno EBADF if path_name is relative path.
This caused an error of the return value of overwrite_ok().
When restoring the subvolume to destination with relative path,
it will overwrite the existing file rather than skip it.

Signed-off-by: tchou 
---
 cmds-restore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index ade35f0..dc042e2 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -825,7 +825,7 @@ static int overwrite_ok(const char * path)
int ret;
 
/* don't be fooled by symlinks */
-   ret = fstatat(-1, path_name, &st, AT_SYMLINK_NOFOLLOW);
+   ret = fstatat(AT_FDCWD, path_name, &st, AT_SYMLINK_NOFOLLOW);
 
if (!ret) {
if (overwrite)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html