date:20171221

Re: [PATCH 04/10] Btrfs: extent map selftest: buffered write vs dio read

2017-12-21 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> This test case simulates the racy situation of buffered write vs dio
> read, and see if btrfs_get_extent() would return -EEXIST.

Isn't mixing dio/buffered IO on the same file (range?) considered
dangerous in any case?

> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/tests/extent-map-tests.c | 73 
> +++
>  1 file changed, 73 insertions(+)
> 
> diff --git a/fs/btrfs/tests/extent-map-tests.c 
> b/fs/btrfs/tests/extent-map-tests.c
> index 0407396..2adf55f 100644
> --- a/fs/btrfs/tests/extent-map-tests.c
> +++ b/fs/btrfs/tests/extent-map-tests.c
> @@ -181,6 +181,78 @@ static void test_case_2(struct extent_map_tree *em_tree)
>   free_extent_map_tree(em_tree);
>  }
>  
> +static void __test_case_3(struct extent_map_tree *em_tree, u64 start)
> +{
> + struct extent_map *em;
> + u64 len = SZ_4K;
> + int ret;
> +
> + em = alloc_extent_map();
> + if (!em)
> + /* Skip this test on error. */
> + return;
> +
> + /* Add [4K, 8K) */
> + em->start = SZ_4K;
> + em->len = SZ_4K;
> + em->block_start = SZ_4K;
> + em->block_len = SZ_4K;
> + ret = add_extent_mapping(em_tree, em, 0);
> + ASSERT(ret == 0);
> + free_extent_map(em);
> +
> + em = alloc_extent_map();
> + if (!em)
> + goto out;
> +
> + /* Add [0, 16K) */
> + em->start = 0;
> + em->len = SZ_16K;
> + em->block_start = 0;
> + em->block_len = SZ_16K;
> + ret = btrfs_add_extent_mapping(em_tree, , start, len);
> + if (ret)
> + test_msg("case3 [0x%llx 0x%llx): ret %d\n",
> +  start, start + len, ret);
> + /*
> +  * Since bytes within em are contiguous, em->block_start is identical to
> +  * em->start.
> +  */
> + if (em &&
> + (start < em->start || start + len > extent_map_end(em) ||
> +  em->start != em->block_start || em->len != em->block_len))
> + test_msg("case3 [0x%llx 0x%llx): ret %d em (start 0x%llx len 
> 0x%llx block_start 0x%llx block_len 0x%llx)\n",
> +  start, start + len, ret, em->start, em->len,
> +  em->block_start, em->block_len);
> + free_extent_map(em);
> +out:
> + /* free memory */
> + free_extent_map_tree(em_tree);
> +}
> +
> +/*
> + * Test scenario:
> + *
> + * Suppose that no extent map has been loaded into memory yet.
> + * There is a file extent [0, 16K), two jobs are running concurrently
> + * against it, t1 is buffered writing to [4K, 8K) and t2 is doing dio
> + * read from [0, 4K) or [8K, 12K) or [12K, 16K).
> + *
> + * t1 goes ahead of t2 and adds em [4K, 8K) into tree.
> + *
> + * t1   t2
> + *  cow_file_range()  btrfs_get_extent()
> + *-> lookup_extent_mapping()
> + *   -> add_extent_mapping()
> + *-> add_extent_mapping()
> + */
> +static void test_case_3(struct extent_map_tree *em_tree)
> +{
> + __test_case_3(em_tree, 0);
> + __test_case_3(em_tree, SZ_8K);
> + __test_case_3(em_tree, (12 * 1024ULL));
> +}
> +
>  int btrfs_test_extent_map()
>  {
>   struct extent_map_tree *em_tree;
> @@ -196,6 +268,7 @@ int btrfs_test_extent_map()
>  
>   test_case_1(em_tree);
>   test_case_2(em_tree);
> + test_case_3(em_tree);
>  
>   kfree(em_tree);
>   return 0;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] Btrfs: add extent map selftests

2017-12-21 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> We've observed that btrfs_get_extent() and merge_extent_mapping() could
> return -EEXIST in several cases, and they are caused by some racy
> condition, e.g dio read vs dio write, which makes the problem very tricky
> to reproduce.
> 
> This adds extent map selftests in order to simulate those racy situations.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/Makefile |   2 +-
>  fs/btrfs/tests/btrfs-tests.c  |   3 +
>  fs/btrfs/tests/btrfs-tests.h  |   1 +
>  fs/btrfs/tests/extent-map-tests.c | 202 
> ++
>  4 files changed, 207 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/tests/extent-map-tests.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 6fe881d..0c43736 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -19,4 +19,4 @@ btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
>  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
>   tests/extent-buffer-tests.o tests/btrfs-tests.o \
>   tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
> - tests/free-space-tree-tests.o
> + tests/free-space-tree-tests.o tests/extent-map-tests.o
> diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
> index d3f2537..9786d8c 100644
> --- a/fs/btrfs/tests/btrfs-tests.c
> +++ b/fs/btrfs/tests/btrfs-tests.c
> @@ -277,6 +277,9 @@ int btrfs_run_sanity_tests(void)
>   goto out;
>   }
>   }
> + ret = btrfs_test_extent_map();
> + if (ret)
> + goto out;
>  out:
>   btrfs_destroy_test_fs();
>   return ret;
> diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
> index 266f1e3..bc0615b 100644
> --- a/fs/btrfs/tests/btrfs-tests.h
> +++ b/fs/btrfs/tests/btrfs-tests.h
> @@ -33,6 +33,7 @@ int btrfs_test_extent_io(u32 sectorsize, u32 nodesize);
>  int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
>  int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
>  int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
> +int btrfs_test_extent_map(void);
>  struct inode *btrfs_new_test_inode(void);
>  struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 
> sectorsize);
>  void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/tests/extent-map-tests.c 
> b/fs/btrfs/tests/extent-map-tests.c
> new file mode 100644
> index 000..0407396
> --- /dev/null
> +++ b/fs/btrfs/tests/extent-map-tests.c
> @@ -0,0 +1,202 @@
> +/*
> + * Copyright (C) 2017 Oracle.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include 
> +#include "btrfs-tests.h"
> +#include "../ctree.h"
> +
> +static void free_extent_map_tree(struct extent_map_tree *em_tree)
> +{
> + struct extent_map *em;
> + struct rb_node *node;
> +
> + while (!RB_EMPTY_ROOT(_tree->map)) {
> + node = rb_first(_tree->map);
> + em = rb_entry(node, struct extent_map, rb_node);
> + remove_extent_mapping(em_tree, em);
> +
> +#ifdef CONFIG_BTRFS_DEBUG
> + if (refcount_read(>refs) != 1) {
> + test_msg(
> +  "Oops we have a em leak: em (start 0x%llx len 
> 0x%llx block_start 0x%llx block_len 0x%llx) refs %d\n",
> +  em->start, em->len, em->block_start,
> +  em->block_len, refcount_read(>refs));
> +
> + refcount_set(>refs, 1);
> + }
> +#endif
> + free_extent_map(em);
> + }
> +}
> +
> +/*
> + * Test scenario:
> + *
> + * Suppose that no extent map has been loaded into memory yet, there is a 
> file
> + * extent [0, 16K), followed by another file extent [16K, 20K), two dio reads
> + * are entering btrfs_get_extent() concurrently, t1 is reading [8K, 16K), t2 
> is
> + * reading [0, 8K)
> + *
> + * t1t2
> + *  btrfs_get_extent()  btrfs_get_extent()
> + *-> lookup_extent_mapping()  ->lookup_extent_mapping()
> + *-> add_extent_mapping(0, 16K)
> + *-> return em
> + *->add_extent_mapping(0, 16K)
> + *-> #handle -EEXIST
> + */
> +static void

Re: [PATCH 01/10] Btrfs: add helper for em merge logic

2017-12-21 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> This is a prepare work for the following extent map selftest, which
> runs tests against em merge logic.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ctree.h |   2 ++
>  fs/btrfs/inode.c | 101 
> ++-
>  2 files changed, 58 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b2e09fe..328f40f 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3148,6 +3148,8 @@ struct btrfs_delalloc_work 
> *btrfs_alloc_delalloc_work(struct inode *inode,
>   int delay_iput);
>  void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
>  
> +int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
> +  struct extent_map **em_in, u64 start, u64 len);
>  struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
>   struct page *page, size_t pg_offset, u64 start,
>   u64 len, int create);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e1a7f3c..527df6f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6911,6 +6911,61 @@ static noinline int uncompress_inline(struct 
> btrfs_path *path,
>   return ret;
>  }
>  
> +int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
> +  struct extent_map **em_in, u64 start, u64 len)

How about adding the following comment above the function: 

/** 
 * btrfs_add_extent_mapping - try to add given extent mapping   
 * @em_tree - the extent tree into which we want to add the mapping 
 * @em_in - extent we are inserting 
 * @start - the start of the logical range of the extent we are adding  
 * @len - logical length of the extent  
 *  
 * Insert @em_in into @em_tree. In case there is an overlapping range, handle   
 * the -EEXIST by either:   
 * a) Returning the existing extent in @em_in if there is a full overlap
 * b) Merge the extents if they are near each other.
 *  
 * Returns 0 on success or a negative error code   
 *  
 */ 

Also one thing which I'm not very clear is why do we need the start/len, aren't 
those already set in em_in ?



> +{
> + int ret;
> + struct extent_map *em = *em_in;
> +
> + ret = add_extent_mapping(em_tree, em, 0);
> + /* it is possible that someone inserted the extent into the tree
> +  * while we had the lock dropped.  It is also possible that
> +  * an overlapping map exists in the tree
> +  */
> + if (ret == -EEXIST) {
> + struct extent_map *existing;
> +
> + ret = 0;
> +
> + existing = search_extent_mapping(em_tree, start, len);
> +
> + /*
> +  * existing will always be non-NULL, since there must be
> +  * extent causing the -EEXIST.
> +  */
> + if (existing->start == em->start &&
> + extent_map_end(existing) >= extent_map_end(em) &&
> + em->block_start == existing->block_start) {
> + /*
> +  * The existing extent map already encompasses the
> +  * entire extent map we tried to add.
> +  */
> + free_extent_map(em);
> + *em_in = existing;
> + ret = 0;
> + } else if (start >= extent_map_end(existing) ||
> + start <= existing->start) {
> + /*
> +  * The existing extent map is the one nearest to
> +  * the [start, start + len) range which overlaps
> +  */
> + ret = merge_extent_mapping(em_tree, existing,
> +em, start);
> + free_extent_map(existing);
> + if (ret) {
> + free_extent_map(em);
> + *em_in = NULL;
> + }
> + } else {
> + free_extent_map(em);
> + *em_in = existing;
> + ret = 0;
> + }
> + }
> + ASSERT(ret == 0 || ret == -EEXIST);
> + return ret;
> +}
> +
>  /*
>   * a bit scary, this does extent mapping from logical file offset to the 
> disk.
>   * the ugly parts come from merging extents from the disk

[PATCH v2 01/10] btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans

2017-12-21 Thread Qu Wenruo

Btrfs uses 2 different method to resever metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
   This is quite straightforward, caller will use the trans handler
   allocated to modify b-trees.

   In this case, reserved metadata should be kept until qgroup numbers
   are updated.

2) Reserve by using block_rsv first, and later btrfs_join_transaction()
   This is more complicated, caller will reserve space using block_rsv
   first, and then later call btrfs_join_transaction() to get a trans
   handler.

   In this case, before we modify trees, the reserved space can be
   modified on demand, and after btrfs_join_transaction(), such reserved
   space should also be kept until qgroup numbers are updated.

Since this two types behaves quite different to each other, split the
original "META" reservation type into 2 sub-types:
  META_PERTRANS:
For above case 1)

  META_PREALLOC:
For reservation happened before btrfs_join_transaction() of case 2)

NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be address in later patches.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c   |  8 +++---
 fs/btrfs/qgroup.c| 22 +++---
 fs/btrfs/qgroup.h| 68 
 fs/btrfs/transaction.c   |  8 +++---
 include/trace/events/btrfs.h |  5 ++--
 5 files changed, 86 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f4328511ac8..20923fce06c4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5989,7 +5989,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
/* One for parent inode, two for dir entries */
num_bytes = 3 * fs_info->nodesize;
-   ret = btrfs_qgroup_reserve_meta(root, num_bytes, true);
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
if (ret)
return ret;
} else {
@@ -6008,7 +6008,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
ret = btrfs_block_rsv_migrate(global_rsv, rsv, num_bytes, 1);
 
if (ret && *qgroup_reserved)
-   btrfs_qgroup_free_meta(root, *qgroup_reserved);
+   btrfs_qgroup_free_meta_prealloc(root, *qgroup_reserved);
 
return ret;
 }
@@ -6084,7 +6084,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
spin_unlock(>lock);
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
-   ret = btrfs_qgroup_reserve_meta(root,
+   ret = btrfs_qgroup_reserve_meta_prealloc(root,
nr_extents * fs_info->nodesize, true);
if (ret)
goto out_fail;
@@ -6092,7 +6092,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes)
 
ret = btrfs_inode_rsv_refill(inode, flush);
if (unlikely(ret)) {
-   btrfs_qgroup_free_meta(root,
+   btrfs_qgroup_free_meta_prealloc(root,
   nr_extents * fs_info->nodesize);
goto out_fail;
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 39bdf5341372..0b85ec21ac1d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -69,8 +69,10 @@ static const char *qgroup_rsv_type_str(enum 
btrfs_qgroup_rsv_type type)
 {
if (type == BTRFS_QGROUP_RSV_DATA)
return "data";
-   if (type == BTRFS_QGROUP_RSV_META)
-   return "meta";
+   if (type == BTRFS_QGROUP_RSV_META_PERTRANS)
+   return "meta_pertrans";
+   if (type == BTRFS_QGROUP_RSV_META_PREALLOC)
+   return "meta_prealloc";
return NULL;
 }
 #endif
@@ -3076,8 +3078,8 @@ int btrfs_qgroup_release_data(struct inode *inode, u64 
start, u64 len)
return __btrfs_qgroup_release_data(inode, NULL, start, len, 0);
 }
 
-int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
- bool enforce)
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+   enum btrfs_qgroup_rsv_type type, bool enforce)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
int ret;
@@ -3088,14 +3090,14 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, 
int num_bytes,
 
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
trace_qgroup_meta_reserve(root, (s64)num_bytes);
-   ret = qgroup_reserve(root, num_bytes, enforce, BTRFS_QGROUP_RSV_META);
+   ret = qgroup_reserve(root, num_bytes, enforce, type);
if (ret < 0)
return ret;
atomic64_add(num_bytes, >qgroup_meta_rsv);
return ret;
 }
 
-void

[PATCH v2 07/10] btrfs: qgroup: Update trace events for metadata reservation

2017-12-21 Thread Qu Wenruo

Now trace_qgroup_meta_reserve() will have extra type parameter.

And introduce two new trace events:
1) trace_qgroup_meta_free_all_pertrans()
   For btrfs_qgroup_free_meta_all_pertrans()

2) trace_qgroup_meta_convert()
   For btrfs_qgroup_convert_reserved_meta()

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c|  7 +++---
 include/trace/events/btrfs.h | 55 ++--
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 96ed678b3588..ee5b05dd10a9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3150,7 +3150,7 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, 
int num_bytes,
return 0;
 
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
-   trace_qgroup_meta_reserve(root, (s64)num_bytes);
+   trace_qgroup_meta_reserve(root, type, (s64)num_bytes);
ret = qgroup_reserve(root, num_bytes, enforce, type);
if (ret < 0)
return ret;
@@ -3175,7 +3175,7 @@ void btrfs_qgroup_free_meta_all_pertrans(struct 
btrfs_root *root)
return;
 
/* TODO: Update trace point to handle such free */
-   trace_qgroup_meta_reserve(root, 0);
+   trace_qgroup_meta_free_all_pertrans(root);
/* Special value -1 means to free all reserved space */
btrfs_qgroup_free_refroot(fs_info, root->objectid, (u64)-1,
  BTRFS_QGROUP_RSV_META_PERTRANS);
@@ -3197,7 +3197,7 @@ void __btrfs_qgroup_free_meta(struct btrfs_root *root, 
int num_bytes,
 */
num_bytes = sub_root_meta_rsv(root, num_bytes, type);
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
-   trace_qgroup_meta_reserve(root, -(s64)num_bytes);
+   trace_qgroup_meta_reserve(root, type, -(s64)num_bytes);
btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes, type);
 }
 
@@ -3257,6 +3257,7 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root 
*root, int num_bytes)
/* Same as btrfs_qgroup_free_meta_prealloc() */
num_bytes = sub_root_meta_rsv(root, num_bytes,
  BTRFS_QGROUP_RSV_META_PREALLOC);
+   trace_qgroup_meta_convert(root, num_bytes);
qgroup_convert_meta(fs_info, root->objectid, num_bytes);
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index e3218e673e1d..4460f1e84e45 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1629,6 +1629,28 @@ TRACE_EVENT(qgroup_update_reserve,
 
 TRACE_EVENT(qgroup_meta_reserve,
 
+   TP_PROTO(struct btrfs_root *root, s64 diff, int type),
+
+   TP_ARGS(root, diff, type),
+
+   TP_STRUCT__entry_btrfs(
+   __field(u64,refroot )
+   __field(s64,diff)
+   __field(int,type)
+   ),
+
+   TP_fast_assign_btrfs(root->fs_info,
+   __entry->refroot= root->objectid;
+   __entry->diff   = diff;
+   ),
+
+   TP_printk_btrfs("refroot=%llu(%s) type=%s diff=%lld",
+   show_root_type(__entry->refroot),
+   show_qgroup_rsv_type(__entry->type), __entry->diff)
+);
+
+TRACE_EVENT(qgroup_meta_convert,
+
TP_PROTO(struct btrfs_root *root, s64 diff),
 
TP_ARGS(root, diff),
@@ -1636,6 +1658,7 @@ TRACE_EVENT(qgroup_meta_reserve,
TP_STRUCT__entry_btrfs(
__field(u64,refroot )
__field(s64,diff)
+   __field(int,type)
),
 
TP_fast_assign_btrfs(root->fs_info,
@@ -1643,8 +1666,36 @@ TRACE_EVENT(qgroup_meta_reserve,
__entry->diff   = diff;
),
 
-   TP_printk_btrfs("refroot=%llu(%s) diff=%lld",
-   show_root_type(__entry->refroot), __entry->diff)
+   TP_printk_btrfs("refroot=%llu(%s) type=%s->%s diff=%lld",
+   show_root_type(__entry->refroot),
+   show_qgroup_rsv_type(BTRFS_QGROUP_RSV_META_PREALLOC),
+   show_qgroup_rsv_type(BTRFS_QGROUP_RSV_META_PERTRANS),
+   __entry->diff)
+);
+
+TRACE_EVENT(qgroup_meta_free_all_pertrans,
+
+   TP_PROTO(struct btrfs_root *root),
+
+   TP_ARGS(root),
+
+   TP_STRUCT__entry_btrfs(
+   __field(u64,refroot )
+   __field(s64,diff)
+   __field(int,type)
+   ),
+
+   TP_fast_assign_btrfs(root->fs_info,
+   __entry->refroot= root->objectid;
+   spin_lock(>qgroup_meta_rsv_lock);
+   __entry->diff   = -(s64)root->qgroup_meta_rsv_pertrans;
+   spin_unlock(>qgroup_meta_rsv_lock);
+

[PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2017-12-21 Thread Qu Wenruo

Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care things like csum size or extent usage for
whole tree COW.

Qgroup care more about net change of extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in CoWed tree block, leaving no change in extent usage.
Or cause leaf split, result one new net extent, increasing qgroup number
by nodesize.
(Or even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize)

So here instead of using the way overkilled calculation for extent
allocator, which cares more about accurate and no error, qgroup doesn't
need that over-calculated reservation.

This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   | 18 +
 fs/btrfs/extent-tree.c | 55 --
 2 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c58f92c2d44..97783ba91e00 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -467,6 +467,24 @@ struct btrfs_block_rsv {
unsigned short full;
unsigned short type;
unsigned short failfast;
+
+   /*
+* Qgroup equivalent for @size @reserved
+*
+* Unlike normal normal @size/@reserved for inode rsv,
+* qgroup doesn't care about things like csum size nor how many tree
+* blocks it will need to reserve.
+*
+* Qgroup cares more about *NET* change of extent usage.
+* So for ONE newly inserted file extent, in worst case it will cause
+* leaf split and level increase, nodesize for each file extent
+* is already way overkilled.
+*
+* In short, qgroup_size/reserved is the up limit of possible needed
+* qgroup metadata reservation.
+*/
+   u64 qgroup_rsv_size;
+   u64 qgroup_rsv_reserved;
 };
 
 /*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 986660f301de..9d579c7bcf7f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct 
btrfs_fs_info *fs_info,
 
 static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
struct btrfs_block_rsv *block_rsv,
-   struct btrfs_block_rsv *dest, u64 num_bytes)
+   struct btrfs_block_rsv *dest, u64 num_bytes,
+   u64 *qgroup_to_release_ret)
 {
struct btrfs_space_info *space_info = block_rsv->space_info;
+   u64 qgroup_to_release = 0;
u64 ret;
 
spin_lock(_rsv->lock);
-   if (num_bytes == (u64)-1)
+   if (num_bytes == (u64)-1) {
num_bytes = block_rsv->size;
+   qgroup_to_release = block_rsv->qgroup_rsv_size;
+   }
block_rsv->size -= num_bytes;
if (block_rsv->reserved >= block_rsv->size) {
num_bytes = block_rsv->reserved - block_rsv->size;
@@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info 
*fs_info,
} else {
num_bytes = 0;
}
+   if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
+   qgroup_to_release = block_rsv->qgroup_rsv_reserved -
+   block_rsv->qgroup_rsv_size;
+   block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
+   } else
+   qgroup_to_release = 0;
spin_unlock(_rsv->lock);
 
ret = num_bytes;
@@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info 
*fs_info,
space_info_add_old_bytes(fs_info, space_info,
 num_bytes);
}
+   if (qgroup_to_release_ret)
+   *qgroup_to_release_ret = qgroup_to_release;
return ret;
 }
 
@@ -5744,17 +5756,21 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
u64 num_bytes = 0;
+   u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
spin_lock(_rsv->lock);
if (block_rsv->reserved < block_rsv->size)
num_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
+  block_rsv->qgroup_rsv_reserved;
spin_unlock(_rsv->lock);
 
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
if (ret)

[PATCH v2 08/10] Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"

2017-12-21 Thread Qu Wenruo

This reverts commit 48a89bc4f2ceab87bc858a8eb189636b09c846a7.

The idea to commit transaction and free some space after hitting qgroup
limit is good, although the problem is it will easily cause deadlocks.

One deadlock example is caused by trying to flush data while still
holding it:
Call Trace:
 __schedule+0x49d/0x10f0
 schedule+0xc6/0x290
 schedule_timeout+0x187/0x1c0
 wait_for_completion+0x204/0x3a0
 btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
 qgroup_reserve+0x913/0xa10 [btrfs]
 btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
 btrfs_check_data_free_space+0x96/0xd0 [btrfs]
 __btrfs_buffered_write+0x3ac/0xd40 [btrfs]
 btrfs_file_write_iter+0x62a/0xba0 [btrfs]
 __vfs_write+0x320/0x430
 vfs_write+0x107/0x270
 SyS_write+0xbf/0x150
 do_syscall_64+0x1b0/0x3d0
 entry_SYSCALL64_slow_path+0x25/0x25

Another case can be caused by trying to commit one transaction while
nesting with trans handler hold by ourselves:
btrfs_start_transaction()
|- btrfs_qgroup_reserve_meta_pertrans()
   |- qgroup_reserve()
  |- btrfs_join_transaction()
  |- btrfs_commit_transaction()

The retry is causing more problem than expectation when limit is
enabled.
At least graceful EDQUOT is way better than kernel deadlock.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 23 ---
 1 file changed, 23 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ee5b05dd10a9..6d5b476feb08 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2416,7 +2416,6 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
struct btrfs_fs_info *fs_info = root->fs_info;
u64 ref_root = root->root_key.objectid;
int ret = 0;
-   int retried = 0;
struct ulist_node *unode;
struct ulist_iterator uiter;
 
@@ -2430,7 +2429,6 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
capable(CAP_SYS_RESOURCE))
enforce = false;
 
-retry:
spin_lock(_info->qgroup_lock);
quota_root = fs_info->quota_root;
if (!quota_root)
@@ -2457,27 +2455,6 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
qg = unode_aux_to_qgroup(unode);
 
if (enforce && !qgroup_check_limits(qg, num_bytes)) {
-   /*
-* Commit the tree and retry, since we may have
-* deletions which would free up space.
-*/
-   if (!retried && qgroup_rsv_total(qg) > 0) {
-   struct btrfs_trans_handle *trans;
-
-   spin_unlock(_info->qgroup_lock);
-   ret = btrfs_start_delalloc_inodes(root, 0);
-   if (ret)
-   return ret;
-   btrfs_wait_ordered_extents(root, U64_MAX, 0, 
(u64)-1);
-   trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans))
-   return PTR_ERR(trans);
-   ret = btrfs_commit_transaction(trans);
-   if (ret)
-   return ret;
-   retried++;
-   goto retry;
-   }
ret = -EDQUOT;
goto out;
}
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 05/10] btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item

2017-12-21 Thread Qu Wenruo

Quite similar for delalloc, some modification to delayed-inode and
delayed-item reservation.
Also needs extra parameter for release case to distinguish normal release and
error release.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-inode.c | 56 ++--
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..e8d571d2ba06 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -22,6 +22,7 @@
 #include "disk-io.h"
 #include "transaction.h"
 #include "ctree.h"
+#include "qgroup.h"
 
 #define BTRFS_DELAYED_WRITEBACK512
 #define BTRFS_DELAYED_BACKGROUND   128
@@ -528,11 +529,12 @@ static struct btrfs_delayed_item 
*__btrfs_next_delayed_item(
 }
 
 static int btrfs_delayed_item_reserve_metadata(struct btrfs_trans_handle 
*trans,
-  struct btrfs_fs_info *fs_info,
+  struct btrfs_root *root,
   struct btrfs_delayed_item *item)
 {
struct btrfs_block_rsv *src_rsv;
struct btrfs_block_rsv *dst_rsv;
+   struct btrfs_fs_info *fs_info = root->fs_info;
u64 num_bytes;
int ret;
 
@@ -543,6 +545,12 @@ static int btrfs_delayed_item_reserve_metadata(struct 
btrfs_trans_handle *trans,
dst_rsv = _info->delayed_block_rsv;
 
num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
+
+   /*
+* Here we migrate space rsv from transaction rsv, since have
+* already reserved space when starting a transaction.
+* So no need to reserve qgroup space here.
+*/
ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
if (!ret) {
trace_btrfs_space_reservation(fs_info, "delayed_item",
@@ -554,15 +562,20 @@ static int btrfs_delayed_item_reserve_metadata(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-static void btrfs_delayed_item_release_metadata(struct btrfs_fs_info *fs_info,
+static void btrfs_delayed_item_release_metadata(struct btrfs_root *root,
struct btrfs_delayed_item *item)
 {
struct btrfs_block_rsv *rsv;
+   struct btrfs_fs_info *fs_info = root->fs_info;
 
if (!item->bytes_reserved)
return;
 
rsv = _info->delayed_block_rsv;
+   /*
+* Check btrfs_delayed_item_reserve_metadata() to see why we don't need
+* to release/reserve qgroup space.
+*/
trace_btrfs_space_reservation(fs_info, "delayed_item",
  item->key.objectid, item->bytes_reserved,
  0);
@@ -598,6 +611,10 @@ static int btrfs_delayed_inode_reserve_metadata(
 */
if (!src_rsv || (!trans->bytes_reserved &&
 src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root,
+   fs_info->nodesize, true);
+   if (ret < 0)
+   return ret;
ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
  BTRFS_RESERVE_NO_FLUSH);
/*
@@ -614,7 +631,9 @@ static int btrfs_delayed_inode_reserve_metadata(
  "delayed_inode",
  btrfs_ino(inode),
  num_bytes, 1);
-   }
+   } else
+   btrfs_qgroup_free_meta_prealloc(root,
+   fs_info->nodesize);
return ret;
}
 
@@ -629,7 +648,8 @@ static int btrfs_delayed_inode_reserve_metadata(
 }
 
 static void btrfs_delayed_inode_release_metadata(struct btrfs_fs_info *fs_info,
-   struct btrfs_delayed_node *node)
+   struct btrfs_delayed_node *node,
+   bool qgroup_free)
 {
struct btrfs_block_rsv *rsv;
 
@@ -641,6 +661,12 @@ static void btrfs_delayed_inode_release_metadata(struct 
btrfs_fs_info *fs_info,
  node->inode_id, node->bytes_reserved, 0);
btrfs_block_rsv_release(fs_info, rsv,
node->bytes_reserved);
+   if (qgroup_free)
+   btrfs_qgroup_free_meta_prealloc(node->root,
+   node->bytes_reserved);
+   else
+   btrfs_qgroup_convert_reserved_meta(node->root,
+   node->bytes_reserved);
node->bytes_reserved = 0;
 }
 
@@ -742,7 +768,7 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
curr->data_len);

[PATCH v2 00/10] Use split qgroup rsv type

2017-12-21 Thread Qu Wenruo

[Overall]
The previous rework on qgroup reservation system put a lot of effort on
data, which works quite fine.

But it takes less focus on metadata reservation, causing some problem
like metadata reservation underflow and noisy kernel warning.

This patchset will try to address the remaining problem of metadata
reservation.

The idea of new qgroup metadata reservation is to use 2 types of
metadata reservation:
1) Per-transaction reservation
   Life span will be inside a transaction. Will be freed at transaction
   commit time.

2) Preallocated reservation
   For case where we reserve space before starting a transaction.
   Operation like dealloc and delayed-inode/item belongs to this type.

   This works similar to block_rsv, its reservation can be
   reserved/released at any timing caller like.

   The only point to notice is, if preallocated reservation is used and
   finished without problem, it should be converted to per-transaction
   type instead of just freeing.
   This is to co-operate with qgroup update at commit time.

For preallocated type, this patch will integrate them into inode_rsv
mechanism reworked by Josef, and delayed-inode/item reservation.


[Problem: Over-reserve for metadata operation]
With latest work on using more accurate and less over-estimated number
for delalloc, now for 128M limit, we can write about 123M.
(Although still worst than previous 126M, but that's due to we can free
prealloc reserved space in previous implementation)

But it can't handle metadata operation, like inode creation well.

For test case like btrfs/139, we can only create about 50+ 4M files
before hitting 1G limit.
(Double checked, there is no qgroup rsv leaking).

So there is still some room to improve the over-reserve behavior.

[Patch structure]
Patch 1~8 are mostly the same, while some of them receive some small
updates.
Patch 5 undergoes some small EDQUOT handler fix exposed by fstests.
Patch 9~10 are new patches to address the over-reserve behavior.

Changelog:
v2:
  Use independent qgroup rsv numbers other than reuse over-killed
  numbers used by block_rsv. Which greatly reduce the early EDQUOT
  problem for delalloc.

  Use transaction_kthread to do early commit in hope to free up some
  pertrans space.


Qu Wenruo (10):
  btrfs: qgroup: Split meta rsv type into meta_prealloc and
meta_pertrans
  btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup
  btrfs: qgroup: Introduce function to convert META_PREALLOC into
META_PERTRANS
  btrfs: qgroup: Use separate meta reservation type for delalloc
  btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and
item
  btrfs: qgroup: Use root->qgroup_meta_rsv_* to record qgroup meta
reserved space
  btrfs: qgroup: Update trace events for metadata reservation
  Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"
  btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
  btrfs: qgroup: Use independent and accurate per inode qgroup rsv

 fs/btrfs/ctree.h |  33 +-
 fs/btrfs/delayed-inode.c |  56 +++---
 fs/btrfs/disk-io.c   |   2 +-
 fs/btrfs/extent-tree.c   |  98 --
 fs/btrfs/file.c  |  15 +--
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode-map.c |   4 +-
 fs/btrfs/inode.c |  27 ++---
 fs/btrfs/ioctl.c |  10 +-
 fs/btrfs/ordered-data.c  |   2 +-
 fs/btrfs/qgroup.c| 241 ++-
 fs/btrfs/qgroup.h|  76 +-
 fs/btrfs/relocation.c|   9 +-
 fs/btrfs/transaction.c   |   8 +-
 include/trace/events/btrfs.h |  60 ++-
 15 files changed, 500 insertions(+), 143 deletions(-)

-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/10] btrfs: qgroup: Use separate meta reservation type for delalloc

2017-12-21 Thread Qu Wenruo

Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h|  9 ++---
 fs/btrfs/extent-tree.c  | 45 +
 fs/btrfs/file.c | 15 ---
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/inode-map.c|  4 ++--
 fs/btrfs/inode.c| 27 ++-
 fs/btrfs/ioctl.c| 10 ++
 fs/btrfs/ordered-data.c |  2 +-
 fs/btrfs/relocation.c   |  9 +
 9 files changed, 68 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bd2bf589e6c8..091c0e06b32e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2728,7 +2728,8 @@ int btrfs_check_data_free_space(struct inode *inode,
 void btrfs_free_reserved_data_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode,
-   struct extent_changeset *reserved, u64 start, u64 len);
+ struct extent_changeset *reserved,
+ u64 start, u64 len, bool qgroup_free);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
@@ -2743,10 +2744,12 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 u64 *qgroup_reserved, bool use_global_rsv);
 void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
  struct btrfs_block_rsv *rsv);
-void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes,
+   bool qgroup_free);
 
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
+bool qgroup_free);
 int btrfs_delalloc_reserve_space(struct inode *inode,
struct extent_changeset **reserved, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 20923fce06c4..986660f301de 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5754,6 +5754,9 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
if (num_bytes == 0)
return 0;
 
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
+   if (ret)
+   return ret;
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
@@ -5766,11 +5769,16 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 /**
  * btrfs_inode_rsv_release - release any excessive reservation.
  * @inode - the inode we need to release from.
+ * @qgroup_free - free or convert qgroup meta.
+ *   Unlike normal operation, qgroup meta reservation needs to know if
+ *   we are freeing qgroup reservation or just convert them into per-trans.
+ *   Normally @qgroup_free is true for error handler, and false for normal
+ *   release.
  *
  * This is the same as btrfs_block_rsv_release, except that it handles the
  * tracepoint for the reservation.
  */
-void

[PATCH v2 02/10] btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup

2017-12-21 Thread Qu Wenruo

Since qgroup has seperate metadata reservation types now, we can
completely get rid of the old root->qgroup_meta_rsv, which mostly acts
as current META_PERTRANS reservation type.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  3 ---
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/qgroup.c  | 34 +-
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 13c260b525a1..bd2bf589e6c8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1260,9 +1260,6 @@ struct btrfs_root {
int send_in_progress;
struct btrfs_subvolume_writers *subv_writers;
atomic_t will_be_snapshotted;
-
-   /* For qgroup metadata space reserve */
-   atomic64_t qgroup_meta_rsv;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a8ecccfc36de..03ea3b926fac 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1184,7 +1184,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(>orphan_inodes, 0);
refcount_set(>refs, 1);
atomic_set(>will_be_snapshotted, 0);
-   atomic64_set(>qgroup_meta_rsv, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 0b85ec21ac1d..ce3d6c95d297 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2509,6 +2509,16 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
return ret;
 }
 
+/*
+ * Free @num_bytes of reserved space with @type for qgroup.
+ * (normally level 0 qgroup).
+ *
+ * Will handle all higher level qgroup either.
+ *
+ * NOTE: If @num_bytes is (u64)-1, this means to free all bytes of
+ * this qgroup.
+ * This special case is only used for META_PERTRANS type.
+ */
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
   u64 ref_root, u64 num_bytes,
   enum btrfs_qgroup_rsv_type type)
@@ -2525,6 +2535,10 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
if (num_bytes == 0)
return;
 
+   if (num_bytes == (u64)-1 && type != BTRFS_QGROUP_RSV_META_PERTRANS) {
+   WARN(1, "%s: Invalid type to free", __func__);
+   return;
+   }
spin_lock(_info->qgroup_lock);
 
quota_root = fs_info->quota_root;
@@ -2535,6 +2549,13 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
if (!qgroup)
goto out;
 
+   /*
+* We're freeing all pertrans rsv, get current value from level 0
+* qgroup as real num_bytes to free.
+*/
+   if (num_bytes == (u64)-1)
+   num_bytes = qgroup->rsv.values[type];
+
ulist_reinit(fs_info->qgroup_ulist);
ret = ulist_add(fs_info->qgroup_ulist, qgroup->qgroupid,
(uintptr_t)qgroup, GFP_ATOMIC);
@@ -3093,24 +3114,21 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root 
*root, int num_bytes,
ret = qgroup_reserve(root, num_bytes, enforce, type);
if (ret < 0)
return ret;
-   atomic64_add(num_bytes, >qgroup_meta_rsv);
return ret;
 }
 
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
-   u64 reserved;
 
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) ||
!is_fstree(root->objectid))
return;
 
-   reserved = atomic64_xchg(>qgroup_meta_rsv, 0);
-   if (reserved == 0)
-   return;
-   trace_qgroup_meta_reserve(root, -(s64)reserved);
-   btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved,
+   /* TODO: Update trace point to handle such free */
+   trace_qgroup_meta_reserve(root, 0);
+   /* Special value -1 means to free all reserved space */
+   btrfs_qgroup_free_refroot(fs_info, root->objectid, (u64)-1,
  BTRFS_QGROUP_RSV_META_PERTRANS);
 }
 
@@ -3124,8 +3142,6 @@ void __btrfs_qgroup_free_meta(struct btrfs_root *root, 
int num_bytes,
return;
 
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
-   WARN_ON(atomic64_read(>qgroup_meta_rsv) < num_bytes);
-   atomic64_sub(num_bytes, >qgroup_meta_rsv);
trace_qgroup_meta_reserve(root, -(s64)num_bytes);
btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes, type);
 }
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/10] btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT

2017-12-21 Thread Qu Wenruo

Unlike previous method to try commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about lock context.

Since it's an asynchronous function call and we won't wait transaction
commit, unlike previous method, we must call it before we hit qgroup
limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commitment.
(meta_prealloc won't be cleaned in transaction commitment, it's useless
anyway)

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 43 +--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 6d5b476feb08..02eed4250815 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ctree.h"
 #include "transaction.h"
@@ -2395,8 +2396,21 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes)
+/*
+ * Two limits to commit transaction in advance.
+ *
+ * For RATIO, it will be 1/RATIO of the remaining limit
+ * (excluding data and prealloc meta) as threshold.
+ * For SIZE, it will be in byte unit as threshold.
+ */
+#define QGROUP_PERTRANS_RATIO  32
+#define QGROUP_PERTRANS_SIZE   SZ_16M
+static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
+   const struct btrfs_qgroup *qg, u64 num_bytes)
 {
+   u64 limit;
+   u64 threshold;
+
if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
qgroup_rsv_total(qg) + (s64)qg->rfer + num_bytes > qg->max_rfer)
return false;
@@ -2405,6 +2419,31 @@ static bool qgroup_check_limits(const struct 
btrfs_qgroup *qg, u64 num_bytes)
qgroup_rsv_total(qg) + (s64)qg->excl + num_bytes > qg->max_excl)
return false;
 
+   /*
+* Even if we passed the check, it's better to check if reservation
+* for meta_pertrans is pushing us near limit.
+* If there is too much pertrans reservation or it's near the limit,
+* let's try commit transaction to free some, using transaction_kthread
+*/
+   if ((qg->lim_flags & (BTRFS_QGROUP_LIMIT_MAX_RFER |
+ BTRFS_QGROUP_LIMIT_MAX_EXCL))) {
+   if (qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL)
+   limit = qg->max_excl;
+   else
+   limit = qg->max_rfer;
+   threshold = (limit - qg->rsv.values[BTRFS_QGROUP_RSV_DATA] -
+   qg->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC]) /
+   QGROUP_PERTRANS_RATIO;
+   threshold = min_t(u64, threshold, QGROUP_PERTRANS_SIZE);
+
+   /*
+* Use transaction_kthread to commit transaction, so we no
+* longer need to bother nested transaction nor lock context.
+*/
+   if (qg->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS] > threshold)
+   wake_up_process(fs_info->transaction_kthread);
+   }
+
return true;
 }
 
@@ -2454,7 +2493,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
 
qg = unode_aux_to_qgroup(unode);
 
-   if (enforce && !qgroup_check_limits(qg, num_bytes)) {
+   if (enforce && !qgroup_check_limits(fs_info, qg, num_bytes)) {
ret = -EDQUOT;
goto out;
}
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/10] btrfs: qgroup: Use root->qgroup_meta_rsv_* to record qgroup meta reserved space

2017-12-21 Thread Qu Wenruo

For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no byte was really reserved, while at release
time, quota is enabled so we will try to release some bytes we didn't
really own.

Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.

To address this, record qgroup meta reserved bytes into
root->qgroup_meta_rsv_pertrans/prealloc.
So at releasing time we won't free any bytes we didn't reserve.

For DATA, it's already handled by io_tree, so nothing needs to be
worried about.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  5 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/qgroup.c  | 66 ++
 3 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 091c0e06b32e..0c58f92c2d44 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1260,6 +1260,11 @@ struct btrfs_root {
int send_in_progress;
struct btrfs_subvolume_writers *subv_writers;
atomic_t will_be_snapshotted;
+
+   /* For qgroup metadata reserved space */
+   spinlock_t qgroup_meta_rsv_lock;
+   u64 qgroup_meta_rsv_pertrans;
+   u64 qgroup_meta_rsv_prealloc;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 03ea3b926fac..d8eaadac4330 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1168,6 +1168,7 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
spin_lock_init(>accounting_lock);
spin_lock_init(>log_extents_lock[0]);
spin_lock_init(>log_extents_lock[1]);
+   spin_lock_init(>qgroup_meta_rsv_lock);
mutex_init(>objectid_mutex);
mutex_init(>log_mutex);
mutex_init(>ordered_extent_mutex);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 24fc6e46f717..96ed678b3588 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2549,11 +2549,11 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
if (!qgroup)
goto out;
 
-   /*
-* We're freeing all pertrans rsv, get current value from level 0
-* qgroup as real num_bytes to free.
-*/
if (num_bytes == (u64)-1)
+   /*
+* We're freeing all pertrans rsv, get reserved value from
+* level 0 qgroup as real num_bytes to free.
+*/
num_bytes = qgroup->rsv.values[type];
 
ulist_reinit(fs_info->qgroup_ulist);
@@ -3099,6 +3099,46 @@ int btrfs_qgroup_release_data(struct inode *inode, u64 
start, u64 len)
return __btrfs_qgroup_release_data(inode, NULL, start, len, 0);
 }
 
+static void add_root_meta_rsv(struct btrfs_root *root, int num_bytes,
+ enum btrfs_qgroup_rsv_type type)
+{
+   if (type != BTRFS_QGROUP_RSV_META_PREALLOC &&
+   type != BTRFS_QGROUP_RSV_META_PERTRANS)
+   return;
+   if (num_bytes == 0)
+   return;
+
+   spin_lock(>qgroup_meta_rsv_lock);
+   if (type == BTRFS_QGROUP_RSV_META_PREALLOC)
+   root->qgroup_meta_rsv_prealloc += num_bytes;
+   else
+   root->qgroup_meta_rsv_pertrans += num_bytes;
+   spin_unlock(>qgroup_meta_rsv_lock);
+}
+
+static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
+enum btrfs_qgroup_rsv_type type)
+{
+   if (type != BTRFS_QGROUP_RSV_META_PREALLOC &&
+   type != BTRFS_QGROUP_RSV_META_PERTRANS)
+   return 0;
+   if (num_bytes == 0)
+   return 0;
+
+   spin_lock(>qgroup_meta_rsv_lock);
+   if (type == BTRFS_QGROUP_RSV_META_PREALLOC) {
+   num_bytes = min_t(u64, root->qgroup_meta_rsv_prealloc,
+ num_bytes);
+   root->qgroup_meta_rsv_prealloc -= num_bytes;
+   } else {
+   num_bytes = min_t(u64, root->qgroup_meta_rsv_pertrans,
+ num_bytes);
+   root->qgroup_meta_rsv_pertrans -= num_bytes;
+   }
+   spin_unlock(>qgroup_meta_rsv_lock);
+   return num_bytes;
+}
+
 int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
enum btrfs_qgroup_rsv_type type, bool enforce)
 {
@@ -3114,6 +3154,15 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, 
int num_bytes,
ret = qgroup_reserve(root, num_bytes, enforce, type);
if (ret < 0)
return ret;
+   /*
+* Record what we have reserved into root.
+*
+* To avoid quota disabled->enabled underflow.
+* In that case, we may try to free space we haven't reserved
+* (since quota was disabled), so record what we reserved into root.
+* And ensure later release won't underflow this number.
+*/
+

[PATCH v2 03/10] btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS

2017-12-21 Thread Qu Wenruo

For meta_prealloc reservation user, after btrfs_join_transaction()
caller will modify tree so part (or even all) meta_prealloc reservation
should be converted to meta_pertrans until transaction commit time.

This patch introduce a new function,
btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
reservation user.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 56 +++
 fs/btrfs/qgroup.h |  8 
 2 files changed, 64 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ce3d6c95d297..24fc6e46f717 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3146,6 +3146,62 @@ void __btrfs_qgroup_free_meta(struct btrfs_root *root, 
int num_bytes,
btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes, type);
 }
 
+static void qgroup_convert_meta(struct btrfs_fs_info *fs_info, u64 ref_root,
+   int num_bytes)
+{
+   struct btrfs_root *quota_root = fs_info->quota_root;
+   struct btrfs_qgroup *qgroup;
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+   int ret = 0;
+
+   if (num_bytes == 0)
+   return;
+   if (!quota_root)
+   return;
+
+   spin_lock(_info->qgroup_lock);
+   qgroup = find_qgroup_rb(fs_info, ref_root);
+   if (!qgroup)
+   goto out;
+   ulist_reinit(fs_info->qgroup_ulist);
+   ret = ulist_add(fs_info->qgroup_ulist, qgroup->qgroupid,
+  (uintptr_t)qgroup, GFP_ATOMIC);
+   if (ret < 0)
+   goto out;
+   ULIST_ITER_INIT();
+   while ((unode = ulist_next(fs_info->qgroup_ulist, ))) {
+   struct btrfs_qgroup *qg;
+   struct btrfs_qgroup_list *glist;
+
+   qg = unode_aux_to_qgroup(unode);
+
+   qgroup_rsv_release(fs_info, qg, num_bytes,
+   BTRFS_QGROUP_RSV_META_PREALLOC);
+   qgroup_rsv_add(fs_info, qg, num_bytes,
+   BTRFS_QGROUP_RSV_META_PERTRANS);
+   list_for_each_entry(glist, >groups, next_group) {
+   ret = ulist_add(fs_info->qgroup_ulist,
+   glist->group->qgroupid,
+   (uintptr_t)glist->group, GFP_ATOMIC);
+   if (ret < 0)
+   goto out;
+   }
+   }
+out:
+   spin_unlock(_info->qgroup_lock);
+}
+
+void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) ||
+   !is_fstree(root->objectid))
+   return;
+   qgroup_convert_meta(fs_info, root->objectid, num_bytes);
+}
+
 /*
  * Check qgroup reserved space leaking, normally at destroy inode
  * time
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index b47740e2e017..4814d680c50f 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -332,5 +332,13 @@ static inline void btrfs_qgroup_free_meta_prealloc(struct 
btrfs_root *root,
  */
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root);
 
+/*
+ * Convert @num_bytes of META_PREALLOCATED reservation to META_PERTRANS.
+ *
+ * This is called when preallocated meta reservation needs to be used.
+ * Normally after btrfs_join_transaction() call.
+ */
+void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int 
num_bytes);
+
 void btrfs_qgroup_check_reserved_leak(struct inode *inode);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2017-12-21 Thread jianchao.wang

Sorry for my non-detailed description. 

On 12/21/2017 09:50 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
>> It's worrying that even though the blk_mark_rq_complete() here is intended 
>> to synchronize with
>> timeout path, but it indeed give the blk_mq_complete_request() the 
>> capability to exclude with 
There could be scenario where the driver itself stop a request itself with 
blk_mq_complete_request() or
some other interface that will invoke it, races with the normal completion path 
where a same request comes.
For example:
a reset could be triggered through sysfs on nvme-rdma
Then the driver will cancel all the reqs, including in-flight ones.
nvme_rdma_reset_ctrl_work()
nvme_rdma_shutdown_ctrl()

if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(>ctrl); //quiesce the queue
blk_mq_tagset_busy_iter(>tag_set,
nvme_cancel_request, >ctrl); //invoke 
blk_mq_complete_request()
nvme_rdma_destroy_io_queues(ctrl, shutdown);
}


These operations could race with the normal completion path of in-flight ones.
It should drain all the in-flight ones first here. But there maybe some other
places similar with this.
 
>> itself. Maybe this capability should be reserved.
> 
> Can you explain further how that'd help?  The problem there is that if
> you have two competing completions, where one isn't a timeout, there's
> nothing synchronizing the reuse of the request.  IOW, the losing on
> can easily complete the next recycle instance.  The atomic bitops
> might feel like more protection but it's just feels.

In above case, the request may simultaneously enter requeue and end path.

Thanks
Jianchao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: qgroup: remove unused label 'retry'

2017-12-21 Thread Colin King

From: Colin Ian King 

Label 'retry' is not used, remove it. Cleans up a clang build
warning:

warning: label ‘retry’ defined but not used [-Wunused-label]

Fixes: b283738ab0ad ("Revert "btrfs: qgroups: Retry after commit on getting 
EDQUOT"")
Signed-off-by: Colin Ian King 
---
 fs/btrfs/qgroup.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index de3b96f1005b..2b89848e767d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2429,7 +2429,6 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
capable(CAP_SYS_RESOURCE))
enforce = false;
 
-retry:
spin_lock(_info->qgroup_lock);
quota_root = fs_info->quota_root;
if (!quota_root)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes

2017-12-21 Thread Liu Bo

On Sat, Dec 16, 2017 at 08:42:51AM +0200, Nikolay Borisov wrote:
> 
> 
> On 15.12.2017 21:58, Chris Mason wrote:
> > refcounts have a generic implementation and an asm optimized one.  The
> > generic version has extra debugging to make sure that once a refcount
> > goes to zero, refcount_inc won't increase it.
>   
> 
> I guess you meant to say refcount_add

refcount_inc may also just throw a warning without bumping the refcnt.

Thanks,

-liubo
> > 
> > The btrfs delayed inode code wasn't expecting this, and we're tripping
> > over the warnings when the generic refcounts are used.  We ended up with
> > this race:
> > 
> > Process A Process B
> >   btrfs_get_delayed_node()
> >   spin_lock(root->inode_lock)
> >   radix_tree_lookup()
> > __btrfs_release_delayed_node()
> > refcount_dec_and_test(_node->refs)
> > our refcount is now zero
> >   refcount_add(2) <---
> >   warning here, refcount
> >   unchanged
> > 
> > spin_lock(root->inode_lock)
> > radix_tree_delete()
> > 
> > With the generic refcounts, we actually warn again when process B above
> > tries to release his refcount because refcount_add() turned into a
> > no-op.
> > 
> > We saw this in production on older kernels without the asm optimized
> > refcounts.
> > 
> > The fix used here is to use refcount_inc_not_zero() to detect when the
> > object is in the middle of being freed and return NULL.  This is almost
> > always the right answer anyway, since we usually end up pitching the
> > delayed_node if it didn't have fresh data in it.
> > 
> > This also changes __btrfs_release_delayed_node() to remove the extra
> > check for zero refcounts before radix tree deletion.
> > btrfs_get_delayed_node() was the only path that was allowing refcounts
> > to go from zero to one.
> > 
> > Signed-off-by: Chris Mason 
> > Fixes: 6de5f18e7b0da
> > cc:  #4.12+
> > ---
> >  fs/btrfs/delayed-inode.c | 45 ++---
> >  1 file changed, 34 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index 5d73f79..84c54af 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -87,6 +87,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
> >  
> > spin_lock(>inode_lock);
> > node = radix_tree_lookup(>delayed_nodes_tree, ino);
> > +
> > if (node) {
> > if (btrfs_inode->delayed_node) {
> > refcount_inc(>refs);  /* can be accessed */
> > @@ -94,9 +95,30 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
> > spin_unlock(>inode_lock);
> > return node;
> > }
> > -   btrfs_inode->delayed_node = node;
> > -   /* can be accessed and cached in the inode */
> > -   refcount_add(2, >refs);
> > +
> > +   /* it's possible that we're racing into the middle of
> > +* removing this node from the radix tree.  In this case,
> > +* the refcount was zero and it should never go back
> > +* to one.  Just return NULL like it was never in the radix
> > +* at all; our release function is in the process of removing
> > +* it.
> > +*
> > +* Some implementations of refcount_inc refuse to
> > +* bump the refcount once it has hit zero.  If we don't do
> > +* this dance here, refcount_inc() may decide to
> > +* just WARN_ONCE() instead of actually bumping the refcount.
> > +*
> > +* If this node is properly in the radix, we want to
> > +* bump the refcount twice, once for the inode
> > +* and once for this get operation.
> > +*/
> > +   if (refcount_inc_not_zero(>refs)) {
> > +   refcount_inc(>refs);
> > +   btrfs_inode->delayed_node = node;
> > +   } else {
> > +   node = NULL;
> > +   }
> > +
> > spin_unlock(>inode_lock);
> > return node;
> > }
> > @@ -254,17 +276,18 @@ static void __btrfs_release_delayed_node(
> > mutex_unlock(_node->mutex);
> >  
> > if (refcount_dec_and_test(_node->refs)) {
> > -   bool free = false;
> > struct btrfs_root *root = delayed_node->root;
> > +
> > spin_lock(>inode_lock);
> > -   if (refcount_read(_node->refs) == 0) {
> > -   radix_tree_delete(>delayed_nodes_tree,
> > - delayed_node->inode_id);
> > -   free = true;
> > -   }
> > +   /*
> >

[PATCH 02/10] Btrfs: move extent map specific code to extent_map.c

2017-12-21 Thread Liu Bo

These helpers are extent map specific, this moves them to extent_map.c.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h  |   2 -
 fs/btrfs/extent_map.c | 118 ++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/inode.c  | 117 -
 4 files changed, 120 insertions(+), 119 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 328f40f..b2e09fe 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3148,8 +3148,6 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
int delay_iput);
 void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
 
-int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
-struct extent_map **em_in, u64 start, u64 len);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start,
u64 len, int create);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2e348fb..51665de 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -454,3 +454,121 @@ void replace_extent_mapping(struct extent_map_tree *tree,
 
setup_extent_mapping(tree, new, modified);
 }
+
+static struct extent_map *next_extent_map(struct extent_map *em)
+{
+   struct rb_node *next;
+
+   next = rb_next(>rb_node);
+   if (!next)
+   return NULL;
+   return container_of(next, struct extent_map, rb_node);
+}
+
+static struct extent_map *prev_extent_map(struct extent_map *em)
+{
+   struct rb_node *prev;
+
+   prev = rb_prev(>rb_node);
+   if (!prev)
+   return NULL;
+   return container_of(prev, struct extent_map, rb_node);
+}
+
+/*
+ * Given an existing extent in the tree, the existing extent is the nearest
+ * extent to map_start, and an extent that you want to insert, deal with 
overlap
+ * and insert the best fitted new extent into the tree.
+ */
+static int merge_extent_mapping(struct extent_map_tree *em_tree,
+   struct extent_map *existing,
+   struct extent_map *em,
+   u64 map_start)
+{
+   struct extent_map *prev;
+   struct extent_map *next;
+   u64 start;
+   u64 end;
+   u64 start_diff;
+
+   BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
+
+   if (existing->start > map_start) {
+   next = existing;
+   prev = prev_extent_map(next);
+   } else {
+   prev = existing;
+   next = next_extent_map(prev);
+   }
+
+   start = prev ? extent_map_end(prev) : em->start;
+   start = max_t(u64, start, em->start);
+   end = next ? next->start : extent_map_end(em);
+   end = min_t(u64, end, extent_map_end(em));
+   start_diff = start - em->start;
+   em->start = start;
+   em->len = end - start;
+   if (em->block_start < EXTENT_MAP_LAST_BYTE &&
+   !test_bit(EXTENT_FLAG_COMPRESSED, >flags)) {
+   em->block_start += start_diff;
+   em->block_len -= start_diff;
+   }
+   return add_extent_mapping(em_tree, em, 0);
+}
+
+/* This handle the EEXIST case of add_extent_mapping. */
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len)
+{
+   int ret;
+   struct extent_map *em = *em_in;
+
+   ret = add_extent_mapping(em_tree, em, 0);
+   /*
+* It is possible that someone inserted the extent into the tree while
+* we had the lock dropped.  It is also possible that an overlapping map
+* exists in the tree
+*/
+   if (ret == -EEXIST) {
+   struct extent_map *existing;
+
+   ret = 0;
+
+   existing = search_extent_mapping(em_tree, start, len);
+
+   /*
+* existing will always be non-NULL, since there must be
+* extent causing the -EEXIST.
+*/
+   if (existing->start == em->start &&
+   extent_map_end(existing) >= extent_map_end(em) &&
+   em->block_start == existing->block_start) {
+   /*
+* The existing extent map already encompasses the
+* entire extent map we tried to add.
+*/
+   free_extent_map(em);
+   *em_in = existing;
+   ret = 0;
+   } else if (start >= extent_map_end(existing) ||
+   start <= existing->start) {
+   /*
+* The existing extent map is the one nearest to
+* the [start, start + len) range which overlaps
+

[PATCH 09/10] Btrfs: add tracepoint for em's EEXIST case

2017-12-21 Thread Liu Bo

This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c|  1 +
 include/trace/events/btrfs.h | 35 +++
 2 files changed, 36 insertions(+)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index a8b7e24..40e4d30 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -539,6 +539,7 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
*em_tree,
ret = 0;
 
existing = search_extent_mapping(em_tree, start, len);
+   trace_btrfs_handle_em_exist(existing, em, start, len);
 
/*
 * existing will always be non-NULL, since there must be
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4342a32..b7ffcf7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
  __entry->refs, __entry->compress_type)
 );
 
+TRACE_EVENT(btrfs_handle_em_exist,
+
+   TP_PROTO(const struct extent_map *existing, const struct extent_map 
*map, u64 start, u64 len),
+
+   TP_ARGS(existing, map, start, len),
+
+   TP_STRUCT__entry(
+   __field(u64,  e_start   )
+   __field(u64,  e_len )
+   __field(u64,  map_start )
+   __field(u64,  map_len   )
+   __field(u64,  start )
+   __field(u64,  len   )
+   ),
+
+   TP_fast_assign(
+   __entry->e_start= existing->start;
+   __entry->e_len  = existing->len;
+   __entry->map_start  = map->start;
+   __entry->map_len= map->len;
+   __entry->start  = start;
+   __entry->len= len;
+   ),
+
+   TP_printk("start=%llu len=%llu "
+ "existing(start=%llu len=%llu) "
+ "em(start=%llu len=%llu)",
+ (unsigned long long)__entry->start,
+ (unsigned long long)__entry->len,
+ (unsigned long long)__entry->e_start,
+ (unsigned long long)__entry->e_len,
+ (unsigned long long)__entry->map_start,
+ (unsigned long long)__entry->map_len)
+);
+
 /* file extent item */
 DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] Btrfs: noinline merge_extent_mapping

2017-12-21 Thread Liu Bo

In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.

This is adding noinline attribute to merge_extent_mapping().

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 40e4d30..f10a6fc 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -480,10 +480,10 @@ static struct extent_map *prev_extent_map(struct 
extent_map *em)
  * extent to map_start, and an extent that you want to insert, deal with 
overlap
  * and insert the best fitted new extent into the tree.
  */
-static int merge_extent_mapping(struct extent_map_tree *em_tree,
-   struct extent_map *existing,
-   struct extent_map *em,
-   u64 map_start, u64 map_len)
+static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map *existing,
+struct extent_map *em,
+u64 map_start, u64 map_len)
 {
struct extent_map *prev;
struct extent_map *next;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] Btrfs: extent map selftest: buffered write vs dio read

2017-12-21 Thread Liu Bo

This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tests/extent-map-tests.c | 73 +++
 1 file changed, 73 insertions(+)

diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
index 0407396..2adf55f 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -181,6 +181,78 @@ static void test_case_2(struct extent_map_tree *em_tree)
free_extent_map_tree(em_tree);
 }
 
+static void __test_case_3(struct extent_map_tree *em_tree, u64 start)
+{
+   struct extent_map *em;
+   u64 len = SZ_4K;
+   int ret;
+
+   em = alloc_extent_map();
+   if (!em)
+   /* Skip this test on error. */
+   return;
+
+   /* Add [4K, 8K) */
+   em->start = SZ_4K;
+   em->len = SZ_4K;
+   em->block_start = SZ_4K;
+   em->block_len = SZ_4K;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+
+   /* Add [0, 16K) */
+   em->start = 0;
+   em->len = SZ_16K;
+   em->block_start = 0;
+   em->block_len = SZ_16K;
+   ret = btrfs_add_extent_mapping(em_tree, , start, len);
+   if (ret)
+   test_msg("case3 [0x%llx 0x%llx): ret %d\n",
+start, start + len, ret);
+   /*
+* Since bytes within em are contiguous, em->block_start is identical to
+* em->start.
+*/
+   if (em &&
+   (start < em->start || start + len > extent_map_end(em) ||
+em->start != em->block_start || em->len != em->block_len))
+   test_msg("case3 [0x%llx 0x%llx): ret %d em (start 0x%llx len 
0x%llx block_start 0x%llx block_len 0x%llx)\n",
+start, start + len, ret, em->start, em->len,
+em->block_start, em->block_len);
+   free_extent_map(em);
+out:
+   /* free memory */
+   free_extent_map_tree(em_tree);
+}
+
+/*
+ * Test scenario:
+ *
+ * Suppose that no extent map has been loaded into memory yet.
+ * There is a file extent [0, 16K), two jobs are running concurrently
+ * against it, t1 is buffered writing to [4K, 8K) and t2 is doing dio
+ * read from [0, 4K) or [8K, 12K) or [12K, 16K).
+ *
+ * t1 goes ahead of t2 and adds em [4K, 8K) into tree.
+ *
+ * t1   t2
+ *  cow_file_range()btrfs_get_extent()
+ *-> lookup_extent_mapping()
+ *   -> add_extent_mapping()
+ *-> add_extent_mapping()
+ */
+static void test_case_3(struct extent_map_tree *em_tree)
+{
+   __test_case_3(em_tree, 0);
+   __test_case_3(em_tree, SZ_8K);
+   __test_case_3(em_tree, (12 * 1024ULL));
+}
+
 int btrfs_test_extent_map()
 {
struct extent_map_tree *em_tree;
@@ -196,6 +268,7 @@ int btrfs_test_extent_map()
 
test_case_1(em_tree);
test_case_2(em_tree);
+   test_case_3(em_tree);
 
kfree(em_tree);
return 0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping

2017-12-21 Thread Liu Bo

This is a subtle case, so in order to understand the problem, it'd be good to
know the content of existing and em when any error occurs.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index d386cfb..a8b7e24 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -550,17 +550,23 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
*em_tree,
*em_in = existing;
ret = 0;
} else {
+   u64 orig_start = em->start;
+   u64 orig_len = em->len;
+
/*
 * The existing extent map is the one nearest to
 * the [start, start + len) range which overlaps
 */
ret = merge_extent_mapping(em_tree, existing,
   em, start, len);
-   free_extent_map(existing);
if (ret) {
free_extent_map(em);
*em_in = NULL;
+   WARN_ONCE(ret, KERN_INFO "Unexpected error %d: 
merge existing(start %llu len %llu) with em(start %llu len %llu)\n",
+ ret, existing->start, existing->len,
+ orig_start, orig_len);
}
+   free_extent_map(existing);
}
}
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] Btrfs: add extent map selftests

2017-12-21 Thread Liu Bo

We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.

This adds extent map selftests in order to simulate those racy situations.

Signed-off-by: Liu Bo 
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/tests/btrfs-tests.c  |   3 +
 fs/btrfs/tests/btrfs-tests.h  |   1 +
 fs/btrfs/tests/extent-map-tests.c | 202 ++
 4 files changed, 207 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/tests/extent-map-tests.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6fe881d..0c43736 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -19,4 +19,4 @@ btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
tests/extent-buffer-tests.o tests/btrfs-tests.o \
tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
-   tests/free-space-tree-tests.o
+   tests/free-space-tree-tests.o tests/extent-map-tests.o
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index d3f2537..9786d8c 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -277,6 +277,9 @@ int btrfs_run_sanity_tests(void)
goto out;
}
}
+   ret = btrfs_test_extent_map();
+   if (ret)
+   goto out;
 out:
btrfs_destroy_test_fs();
return ret;
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index 266f1e3..bc0615b 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -33,6 +33,7 @@ int btrfs_test_extent_io(u32 sectorsize, u32 nodesize);
 int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
 int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
 int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
+int btrfs_test_extent_map(void);
 struct inode *btrfs_new_test_inode(void);
 struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize);
 void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
new file mode 100644
index 000..0407396
--- /dev/null
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright (C) 2017 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include "btrfs-tests.h"
+#include "../ctree.h"
+
+static void free_extent_map_tree(struct extent_map_tree *em_tree)
+{
+   struct extent_map *em;
+   struct rb_node *node;
+
+   while (!RB_EMPTY_ROOT(_tree->map)) {
+   node = rb_first(_tree->map);
+   em = rb_entry(node, struct extent_map, rb_node);
+   remove_extent_mapping(em_tree, em);
+
+#ifdef CONFIG_BTRFS_DEBUG
+   if (refcount_read(>refs) != 1) {
+   test_msg(
+"Oops we have a em leak: em (start 0x%llx len 
0x%llx block_start 0x%llx block_len 0x%llx) refs %d\n",
+em->start, em->len, em->block_start,
+em->block_len, refcount_read(>refs));
+
+   refcount_set(>refs, 1);
+   }
+#endif
+   free_extent_map(em);
+   }
+}
+
+/*
+ * Test scenario:
+ *
+ * Suppose that no extent map has been loaded into memory yet, there is a file
+ * extent [0, 16K), followed by another file extent [16K, 20K), two dio reads
+ * are entering btrfs_get_extent() concurrently, t1 is reading [8K, 16K), t2 is
+ * reading [0, 8K)
+ *
+ * t1t2
+ *  btrfs_get_extent()  btrfs_get_extent()
+ *-> lookup_extent_mapping()  ->lookup_extent_mapping()
+ *-> add_extent_mapping(0, 16K)
+ *-> return em
+ *->add_extent_mapping(0, 16K)
+ *-> #handle -EEXIST
+ */
+static void test_case_1(struct extent_map_tree *em_tree)
+{
+   struct extent_map *em;
+   u64 start = 0;
+   u64 len = SZ_8K;
+   int ret;
+
+   em = alloc_extent_map();
+   if (!em)
+   /* Skip the test on error. */
+

[PATCH 06/10] Btrfs: fix incorrect block_len in merge_extent_mapping

2017-12-21 Thread Liu Bo

%block_len could be checked on deciding if two em are mergable.

merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.

For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, while for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 51665de..6653b08 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -511,7 +511,7 @@ static int merge_extent_mapping(struct extent_map_tree 
*em_tree,
if (em->block_start < EXTENT_MAP_LAST_BYTE &&
!test_bit(EXTENT_FLAG_COMPRESSED, >flags)) {
em->block_start += start_diff;
-   em->block_len -= start_diff;
+   em->block_len = em->len;
}
return add_extent_mapping(em_tree, em, 0);
 }
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] Btrfs: add helper for em merge logic

2017-12-21 Thread Liu Bo

This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h |   2 ++
 fs/btrfs/inode.c | 101 ++-
 2 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b2e09fe..328f40f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3148,6 +3148,8 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
int delay_iput);
 void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
 
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start,
u64 len, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3c..527df6f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6911,6 +6911,61 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
return ret;
 }
 
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len)
+{
+   int ret;
+   struct extent_map *em = *em_in;
+
+   ret = add_extent_mapping(em_tree, em, 0);
+   /* it is possible that someone inserted the extent into the tree
+* while we had the lock dropped.  It is also possible that
+* an overlapping map exists in the tree
+*/
+   if (ret == -EEXIST) {
+   struct extent_map *existing;
+
+   ret = 0;
+
+   existing = search_extent_mapping(em_tree, start, len);
+
+   /*
+* existing will always be non-NULL, since there must be
+* extent causing the -EEXIST.
+*/
+   if (existing->start == em->start &&
+   extent_map_end(existing) >= extent_map_end(em) &&
+   em->block_start == existing->block_start) {
+   /*
+* The existing extent map already encompasses the
+* entire extent map we tried to add.
+*/
+   free_extent_map(em);
+   *em_in = existing;
+   ret = 0;
+   } else if (start >= extent_map_end(existing) ||
+   start <= existing->start) {
+   /*
+* The existing extent map is the one nearest to
+* the [start, start + len) range which overlaps
+*/
+   ret = merge_extent_mapping(em_tree, existing,
+  em, start);
+   free_extent_map(existing);
+   if (ret) {
+   free_extent_map(em);
+   *em_in = NULL;
+   }
+   } else {
+   free_extent_map(em);
+   *em_in = existing;
+   ret = 0;
+   }
+   }
+   ASSERT(ret == 0 || ret == -EEXIST);
+   return ret;
+}
+
 /*
  * a bit scary, this does extent mapping from logical file offset to the disk.
  * the ugly parts come from merging extents from the disk with the in-ram
@@ -7147,51 +7202,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 
err = 0;
write_lock(_tree->lock);
-   ret = add_extent_mapping(em_tree, em, 0);
-   /* it is possible that someone inserted the extent into the tree
-* while we had the lock dropped.  It is also possible that
-* an overlapping map exists in the tree
-*/
-   if (ret == -EEXIST) {
-   struct extent_map *existing;
-
-   ret = 0;
-
-   existing = search_extent_mapping(em_tree, start, len);
-   /*
-* existing will always be non-NULL, since there must be
-* extent causing the -EEXIST.
-*/
-   if (existing->start == em->start &&
-   extent_map_end(existing) >= extent_map_end(em) &&
-   em->block_start == existing->block_start) {
-   /*
-* The existing extent map already encompasses the
-* entire extent map we tried to add.
-*/
-   free_extent_map(em);
-   em = existing;
-   err = 0;
-
-   } else if (start >= extent_map_end(existing) ||
-   start <= existing->start) {
-   /*
-* The existing extent map is the

[PATCH 00/10] bugfixes and regression tests of btrfs_get_extent

2017-12-21 Thread Liu Bo

Although
commit e6c4efd87ab0 ("btrfs: Fix and enhance merge_extent_mapping() to insert 
best fitted extent map")
fixed up the negetive em->len, it has introduced several regressions, several 
has been fixed by

commit 32be3a1ac6d0 ("btrfs: Fix the wrong condition judgment about subset 
extent map"),
commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map insertion 
in btrfs_get_extent") and
commit 8e2bd3b7fac9 ("Btrfs: deal with existing encompassing extent map in 
btrfs_get_extent()").

Unfortunately, there is one more regression which is caught recently in
our test farm, more details are explained in patch 7.

While debugging the above issue, I found that all of these bugs are caused
by some racy situations, which can be very tricky to reproduce, so I made
several extent map specific test cases in btrfs's selftest framework.

Patch 1-2 are some preparatory work.
Patch 3-5 are regression tests for handling EEXIST from adding extent map.
Patch 6-7 are fixing two bugs which can be reproduced by the above test cases.
Patch 8-10 are debugging wise, so that we can know what happened easily.

Liu Bo (10):
  Btrfs: add helper for em merge logic
  Btrfs: move extent map specific code to extent_map.c
  Btrfs: add extent map selftests
  Btrfs: extent map selftest: buffered write vs dio read
  Btrfs: extent map selftest: dio write vs dio read
  Btrfs: fix incorrect block_len in merge_extent_mapping
  Btrfs: fix unexpected EEXIST from btrfs_get_extent
  Btrfs: add WARN_ONCE to detect unexpected error from
merge_extent_mapping
  Btrfs: add tracepoint for em's EEXIST case
  Btrfs: noinline merge_extent_mapping

 fs/btrfs/Makefile |   2 +-
 fs/btrfs/extent_map.c | 120 +
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/inode.c  | 108 +---
 fs/btrfs/tests/btrfs-tests.c  |   3 +
 fs/btrfs/tests/btrfs-tests.h  |   1 +
 fs/btrfs/tests/extent-map-tests.c | 363 ++
 include/trace/events/btrfs.h  |  35 
 8 files changed, 526 insertions(+), 108 deletions(-)
 create mode 100644 fs/btrfs/tests/extent-map-tests.c

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] Btrfs: extent map selftest: dio write vs dio read

2017-12-21 Thread Liu Bo

This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tests/extent-map-tests.c | 88 +++
 1 file changed, 88 insertions(+)

diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
index 2adf55f..66d5523 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -253,6 +253,93 @@ static void test_case_3(struct extent_map_tree *em_tree)
__test_case_3(em_tree, (12 * 1024ULL));
 }
 
+static void __test_case_4(struct extent_map_tree *em_tree, u64 start)
+{
+   struct extent_map *em;
+   u64 len = SZ_4K;
+   int ret;
+
+   em = alloc_extent_map();
+   if (!em)
+   /* Skip this test on error. */
+   return;
+
+   /* Add [0K, 8K) */
+   em->start = 0;
+   em->len = SZ_8K;
+   em->block_start = 0;
+   em->block_len = SZ_8K;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+
+   /* Add [8K, 24K) */
+   em->start = SZ_8K;
+   em->len = 24 * 1024ULL;
+   em->block_start = SZ_16K; /* avoid merging */
+   em->block_len = 24 * 1024ULL;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+   /* Add [0K, 32K) */
+   em->start = 0;
+   em->len = SZ_32K;
+   em->block_start = 0;
+   em->block_len = SZ_32K;
+   ret = btrfs_add_extent_mapping(em_tree, , start, len);
+   if (ret)
+   test_msg("case4 [0x%llx 0x%llx): ret %d\n",
+start, len, ret);
+   if (em &&
+   (start < em->start || start + len > extent_map_end(em)))
+   test_msg("case4 [0x%llx 0x%llx): ret %d, added wrong em (start 
0x%llx len 0x%llx block_start 0x%llx block_len 0x%llx)\n",
+start, len, ret, em->start, em->len, em->block_start,
+em->block_len);
+   free_extent_map(em);
+out:
+   /* free memory */
+   free_extent_map_tree(em_tree);
+}
+
+/*
+ * Test scenario:
+ *
+ * Suppose that no extent map has been loaded into memory yet.
+ * There is a file extent [0, 32K), two jobs are running concurrently
+ * against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
+ * read from [0, 4K) or [4K, 8K).
+ *
+ * t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
+ *
+ * t1t2
+ *  btrfs_get_blocks_direct() btrfs_get_blocks_direct()
+ *   -> btrfs_get_extent()  -> btrfs_get_extent()
+ *   -> lookup_extent_mapping()
+ *   -> add_extent_mapping()-> lookup_extent_mapping()
+ *  # load [0, 32K)
+ *   -> btrfs_new_extent_direct()
+ *   -> btrfs_drop_extent_cache()
+ *  # split [0, 32K)
+ *   -> add_extent_mapping()
+ *  # add [8K, 32K)
+ *  -> add_extent_mapping()
+ * # handle -EEXIST when adding
+ * # [0, 32K)
+ */
+static void test_case_4(struct extent_map_tree *em_tree)
+{
+   __test_case_4(em_tree, 0);
+   __test_case_4(em_tree, SZ_4K);
+}
+
 int btrfs_test_extent_map()
 {
struct extent_map_tree *em_tree;
@@ -269,6 +356,7 @@ int btrfs_test_extent_map()
test_case_1(em_tree);
test_case_2(em_tree);
test_case_3(em_tree);
+   test_case_4(em_tree);
 
kfree(em_tree);
return 0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent

2017-12-21 Thread Liu Bo

This fixes a corner case that is caused by a race of dio write vs dio
read/write.

dio write:
[0, 32k) -> [0, 8k) + [8k, 32k)

dio read/write:

While get_extent() with [0, 4k), [0, 8k) is found as existing em, even
though start == existing->start, em is [0, 32k),
extent_map_end(em) > extent_map_end(existing),
then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
(with a length 0), and get_extent ends up returning -EEXIST, and dio
read/write will get -EEXIST which is confusing applications.

Here I concluded all the possible situations,
1) start < existing->start

+---+em+---+
+--prev---+ | +-+  |
| | | | |  |
+-+ + +---+existing++  ++
+
|
+
 start

2) start == existing->start

  +em+
  | +-+  |
  | | |  |
  + +existing-+  +
|
|
+
 start

3) start > existing->start && start < (existing->start + existing->len)

  +em+
  | +-+  |
  | | |  |
  + +existing-+  +
   |
   |
   +
 start

4) start >= (existing->start + existing->len)

+---+em+---+
| +-+  | +--next---+
| | |  | | |
+ +---+existing++  + +-+
  +
  |
  +
   start

After going thru the above case by case, it turns out that if start is
within existing em (front inclusive), then the existing em should be
returned, otherwise, we try our best to merge candidate em with
sibling ems to form a larger em.

Reported-by: David Vallender 
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 6653b08..d386cfb 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -483,7 +483,7 @@ static struct extent_map *prev_extent_map(struct extent_map 
*em)
 static int merge_extent_mapping(struct extent_map_tree *em_tree,
struct extent_map *existing,
struct extent_map *em,
-   u64 map_start)
+   u64 map_start, u64 map_len)
 {
struct extent_map *prev;
struct extent_map *next;
@@ -496,9 +496,13 @@ static int merge_extent_mapping(struct extent_map_tree 
*em_tree,
if (existing->start > map_start) {
next = existing;
prev = prev_extent_map(next);
+   if (prev)
+   ASSERT(extent_map_end(prev) <= map_start);
} else {
prev = existing;
next = next_extent_map(prev);
+   if (next)
+   ASSERT(map_start + map_len <= next->start);
}
 
start = prev ? extent_map_end(prev) : em->start;
@@ -540,35 +544,26 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
*em_tree,
 * existing will always be non-NULL, since there must be
 * extent causing the -EEXIST.
 */
-   if (existing->start == em->start &&
-   extent_map_end(existing) >= extent_map_end(em) &&
-   em->block_start == existing->block_start) {
-   /*
-* The existing extent map already encompasses the
-* entire extent map we tried to add.
-*/
+   if (start >= existing->start &&
+   start < extent_map_end(existing)) {
free_extent_map(em);
*em_in = existing;
ret = 0;
-   } else if (start >= extent_map_end(existing) ||
-   start <= existing->start) {
+   } else {
/*
 * The existing extent map is the one nearest to
 * the [start, start + len) range which overlaps
 */
ret = merge_extent_mapping(em_tree, existing,
-  em, start);
+  em, start, len);
free_extent_map(existing);
if (ret) {
free_extent_map(em);
*em_in = NULL;
}
-   } else {
-   free_extent_map(em);
-   *em_in = existing;
-   ret = 0;
}
}
+

Btrfs blocked by too many delayed refs

2017-12-21 Thread Martin Raiber

Hi,

I have the problem that too many delayed refs block a btrfs storage. I
have one thread that does work:

[] io_schedule+0x16/0x40
[] wait_on_page_bit+0x116/0x150
[] read_extent_buffer_pages+0x1c5/0x290
[] btree_read_extent_buffer_pages+0x9d/0x100
[] read_tree_block+0x32/0x50
[] read_block_for_search.isra.30+0x120/0x2e0
[] btrfs_search_slot+0x385/0x990
[] btrfs_insert_empty_items+0x71/0xc0
[] insert_extent_data_ref.isra.49+0x11b/0x2a0
[] __btrfs_inc_extent_ref.isra.59+0x1ee/0x220
[] __btrfs_run_delayed_refs+0x924/0x12c0
[] btrfs_run_delayed_refs+0x7a/0x260
[] create_pending_snapshot+0x5e4/0xf00
[] create_pending_snapshots+0x97/0xc0
[] btrfs_commit_transaction+0x395/0x930
[] btrfs_mksubvol+0x4a6/0x4f0
[] btrfs_ioctl_snap_create_transid+0x185/0x190
[] btrfs_ioctl_snap_create_v2+0x104/0x150
[] btrfs_ioctl+0x5e1/0x23b0
[] do_vfs_ioctl+0x92/0x5a0
[] SyS_ioctl+0x79/0x9

the others are in 'D' state e.g. with

[] call_rwsem_down_write_failed+0x17/0x30
[] filename_create+0x6b/0x150
[] SyS_mkdir+0x44/0xe0

Slabtop shows 2423910 btrfs_delayed_ref_head structs, slowly decreasing.

What I think is happening is that delayed refs are added without
throttling them with btrfs_should_throttle_delayed_refs . Maybe by
creating a snapshot of a file and then modifying it (some action that
creates delayed refs, is not truncate which is already throttled and
does not commit a transaction which is also throttled).

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-21 Thread Chris Mason


On 12/20/2017 03:59 PM, Timofey Titovets wrote:

How reproduce:
touch test_file
chattr +C test_file
dd if=/dev/zero of=test_file bs=1M count=1
btrfs fi def -vrczlib test_file
filefrag -v test_file

test_file
Filesystem type is: 9123683e
File size of test_file is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:   72917050..  72917081: 32: encoded
   1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
   2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
   3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
   4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
   5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
   6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
   7:  224.. 255:   72967084..  72967115: 32:   72944984:
last,encoded,eof
test_file: 8 extents found

I can't found at now, where that error happen in code,
but it's reproducible on Linux 4.14.8


We'll silently cow in a few cases, this is one.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/14] btrfs: qgroup: Introduce helpers to update and access new qgroup rsv

2017-12-21 Thread Nikolay Borisov



On 12.12.2017 09:34, Qu Wenruo wrote:
> Introduce helpers to:
> 
> 1) Get total reserved space
>For limit calculation
> 2) Add/release reserved space for given type
>With underflow detection and warning
> 3) Add/release reserved space according to child qgroup
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/qgroup.c | 68 
> +++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index f6fe28ca0e86..14346ff9a162 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -47,6 +47,74 @@
>   *  - check all ioctl parameters
>   */
>  
> +/*
> + * Helpers to access qgroup reservation
> + *
> + * Callers should ensure the lock context and type are valid
> + */
> +
> +static u64 qgroup_rsv_total(const struct btrfs_qgroup *qgroup)
> +{
> + u64 ret = 0;
> + int i;
> +
> + for (i = 0; i < BTRFS_QGROUP_RSV_LAST; i++)
> + ret += qgroup->rsv.values[i];
> +
> + return ret;
> +}
> +
> +#ifdef CONFIG_BTRFS_DEBUG
> +static const char *qgroup_rsv_type_str(enum btrfs_qgroup_rsv_type type)
> +{
> + if (type == BTRFS_QGROUP_RSV_DATA)
> + return "data";
> + if (type == BTRFS_QGROUP_RSV_META)
> + return "meta";
> + return NULL;
> +}
> +#endif
> +
> +static void qgroup_rsv_add(struct btrfs_qgroup *qgroup, u64 num_bytes,
> +enum btrfs_qgroup_rsv_type type)
> +{
> + qgroup->rsv.values[type] += num_bytes;
> +}
> +
> +static void qgroup_rsv_release(struct btrfs_qgroup *qgroup, u64 num_bytes,
> +enum btrfs_qgroup_rsv_type type)
> +{
> + if (qgroup->rsv.values[type] >= num_bytes) {
> + qgroup->rsv.values[type] -= num_bytes;
> + return;
> + }

nit: Generally I'd prefer that the if tests for the exceptional
condition and handle it inside i.e. we expect most of the time for
underflow to not occur. Not a big deal so doesn't warrant a resend but
something to think about in the future.

> +#ifdef CONFIG_BTRFS_DEBUG
> + WARN_RATELIMIT(1,
> + "qgroup %llu %s reserved space underflow, have %llu to free 
> %llu",
> + qgroup->qgroupid, qgroup_rsv_type_str(type),
> + qgroup->rsv.values[type], num_bytes);
> +#endif
> + qgroup->rsv.values[type] = 0;
> +}
> +
> +static void qgroup_rsv_add_by_qgroup(struct btrfs_qgroup *dest,
> +   struct btrfs_qgroup *src)
> +{
> + int i;
> +
> + for (i = 0; i < BTRFS_QGROUP_RSV_LAST; i++)
> + qgroup_rsv_add(dest, src->rsv.values[i], i);
> +}
> +
> +static void qgroup_rsv_release_by_qgroup(struct btrfs_qgroup *dest,
> +   struct btrfs_qgroup *src)
> +{
> + int i;
> +
> + for (i = 0; i < BTRFS_QGROUP_RSV_LAST; i++)
> + qgroup_rsv_release(dest, src->rsv.values[i], i);
> +}
> +
>  static void btrfs_qgroup_update_old_refcnt(struct btrfs_qgroup *qg, u64 seq,
>  int mod)
>  {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2017-12-21 Thread Tejun Heo

Hello,

On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
> It's worrying that even though the blk_mark_rq_complete() here is intended to 
> synchronize with
> timeout path, but it indeed give the blk_mq_complete_request() the capability 
> to exclude with 
> itself. Maybe this capability should be reserved.

Can you explain further how that'd help?  The problem there is that if
you have two competing completions, where one isn't a timeout, there's
nothing synchronizing the reuse of the request.  IOW, the losing one
can easily complete the next recycle instance.  The atomic bitops
might feel like more protection but it's just feels.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distress Call Please don't ignore

2017-12-21 Thread Sandra Younes

Good Day,

Forgive my indignation if this message comes to you as a surprise and may 
offend your personality for contacting you without your prior consent and 
writing through this channel.

I came across your name and contact on the course of my personal searching when 
i was searching for a foreign reliable partner. I was assured of your 
capability and reliability after going true your profile.

I'm (Miss. Sandra) from Benghazi libya, My father of blessed memory by name 
late General Abdel Fattah Younes who was shot death by Islamist-linked militia 
within the anti-Gaddafi forces on 28th July, 2011 and after two days later my 
mother with my two brothers was killed one early morning by the rebels as 
result of civil war that is going on in my country Libya, then after the burial 
of my parents, my uncles conspired and sold my father's properties and left 
nothing for me. On a faithful morning, I opened my father's briefcase and 
discover a document which he has deposited ($6.250M USD) in a bank in a Turkish 
Bank which has a small branch in Canada with my name as the legitimate/next of 
kin. Meanwhile i have located the bank,and have also discussed the possiblity 
of transfering the fund. My father left a clause to the bank that i must 
introduce a trusted foreign partner who would be my trustee to help me invest 
this fund; hence the need for your assistance,i request that you be my t
rustee and assist me in e

You will also be responsible for the investment and management of the fund for 
me and also you will help me get a good school where i will further my 
education.
I agreed to give you 40% of the $6.250M once the transfer is done. this is my 
true life story, I will be glad to receive your respond soonest for more 
details to enable us start and champion the transfer less than 14 banking days 
as i was informed by the bank manager.

Thanks for giving me your attention,

Yours sincerely,
Miss. Sandra Younes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-21 Thread Austin S. Hemmelgarn


On 2017-12-21 06:44, Andrei Borzenkov wrote:

On Tue, Dec 19, 2017 at 11:47 PM, Austin S. Hemmelgarn
 wrote:

On 2017-12-19 15:41, Tomasz Pala wrote:


On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:


with a read only file system. Another reason is the kernel code and
udev rule for device "readiness" means the volume is not "ready" until
all member devices are present. And while the volume is not "ready"
systemd will not even attempt to mount. Solving this requires kernel
and udev work, or possibly a helper, to wait an appropriate amount of



Sth like this? I got such problem a few months ago, my solution was
accepted upstream:

https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

Rationale is in referred ticket, udev would not support any more btrfs
logic, so unless btrfs handles this itself on kernel level (daemon?),
that is all that can be done.


Or maybe systemd can quit trying to treat BTRFS like a volume manager (which
it isn't) and just try to mount the requested filesystem with the requested
options?


You can't mount filesystem until sufficient number of devices are
present and not waiting (at least attempting to wait) for them opens
you to races on startup. So far systemd position was - it is up to
filesystem to give it something to wait on. And while apparently
everyone agrees that current "btrfs device ready" does not fit the
bill, this is the only thing we have.
No, it isn't.  You can just make the damn mount call with the supplied 
options.  If it succeeds, the volume was ready, if it fails, it wasn't, 
it's that simple, and there's absolutely no reason that systemd can't 
just do that in a loop until it succeeds or a timeout is reached.  That 
isn't any more racy than waiting on them is (waiting on them to be ready 
and then mounting them is a TOCTOU race condition), and it doesn't have 
any of these issues with the volume being completely unusable in a 
degraded state.


Also, it's not 'up to the filesystem', it's 'up to the underlying 
device'.  LUKS, LVM, MD, and everything else that's an actual device 
layer is what systemd waits on.  XFS, ext4, and any other filesystem 
except BTRFS (and possibly ZFS, but I'm not 100% sure about that) 
provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle 
BTRFS like a device layer, and not a filesystem, so we have this crap to 
deal with, as well as the fact that it makes it impossible to manually 
mount a BTRFS volume with missing or failed devices in degraded mode 
under systemd (because it unmounts it damn near instantly because it 
somehow thinks it knows better than the user what the user wants to do).


This integration issue was so far silently ignored both by btrfs and
systemd developers. 
It's been ignored by BTRFS devs because there is _nothing_ wrong on this 
side other than the naming choice for the ioctl.  Systemd is _THE ONLY_ 
init system which has this issue, every other one works just fine.


As far as the systemd side, I have no idea why they are ignoring it, 
though I suspect it's the usual spoiled brat mentality that seems to be 
present about everything that people complain about regarding systemd.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-21 Thread Knut Omang

Joe Perches  writes:

> On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
>> On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
>> > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
>> > >  1. Using lockdep_set_novalidate_class() for anything other
>> > >  than device->mutex will throw checkpatch warnings. Nice. (*)
>> > []
>> > > (*) checkpatch.pl is considered mostly harmful round here, too,
>> > > but that's another rant
>> > 
>> > How so?
>> 
>> Short story is that it barfs all over the slightly non-standard
>> coding style used in XFS.
> []
>> This sort of stuff is just lowest-common-denominator noise - great
>> for new code and/or inexperienced developers, but not for working
>> with large bodies of existing code with slightly non-standard
>> conventions.
>
> Completely reasonable.  Thanks.
>
> Do you get many checkpatch submitters for fs/xfs?
>
> If so, could probably do something about adding
> a checkpatch file flag to the directory or equivalent.
>
> Maybe add something like:
>
> fs/xfs/.checkpatch
>
> where the contents turn off most everything

I propose a more fine grained and configurable form of this in

   https://lkml.org/lkml/2017/12/16/343

that also handles sparse and other checkers in a similar way.

Thanks,
Knut

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-21 Thread Andrei Borzenkov

On Wed, Dec 20, 2017 at 11:07 PM, Chris Murphy  wrote:
>
> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
> encryption for some weird reason; and it also enforces NOT putting
> Btrfs on LVM.
>

That's incorrect, btrfs on LVM is default on some SLES flavors and one
of the three standard proposals (where you do not need to go in expert
mode) - normal partitions, LVM, encrypted LVM - even on openSUSE.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-21 Thread Jan Kara

On Thu 21-12-17 06:25:55, Jeff Layton wrote:
> Got it, I think. How about this (sorry for the unrelated deltas here):
> 
> [PATCH] SQUASH: add memory barriers around i_version accesses

Yep, this looks good to me.

Honza
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/iversion.h | 60 
> +++-
>  1 file changed, 39 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..1b3b5ed7c5b8 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
>   atomic64_set(>i_version, val);
>  }
>  
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a 
> self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(>i_version);
> +}
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -152,7 +169,18 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + /*
> +  * The i_version field is not strictly ordered with any other inode
> +  * information, but the legacy inode_inc_iversion code used a spinlock
> +  * to serialize increments.
> +  *
> +  * Here, we add full memory barriers to ensure that any de-facto
> +  * ordering with other info is preserved.
> +  *
> +  * This barrier pairs with the barrier in inode_query_iversion()
> +  */
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))
> @@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
>   inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a 
> self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return atomic64_read(>i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,15 +259,22 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
> - if (cur & I_VERSION_QUERIED)
> + if (cur & I_VERSION_QUERIED) {
> + /*
> +  * This barrier (and the implicit barrier in the
> +  * cmpxchg below) pairs with the barrier in
> +  * inode_maybe_inc_iversion().
> +  */
> + smp_mb();
>   break;
> + }
>  
>   new = cur | I_VERSION_QUERIED;
>   old = atomic64_cmpxchg(>i_version, cur, new);
> - if (old == cur)
> + if (likely(old == cur))
>   break;
>   cur = old;
>   }
> -- 
> 2.14.3
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-21 Thread Jeff Layton

On Wed, 2017-12-20 at 17:41 +0100, Jan Kara wrote:
> On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> > On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > > [PATCH] SQUASH: add memory barriers around i_version accesses
> > > 
> > > Why explicit memory barriers rather than annotating the operations
> > > with the required semantics and getting the barriers the arch
> > > requires automatically?  I suspect this should be using
> > > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > > the atomic_cmpxchg needs to have release semantics to match the
> > > acquire semantics needed for the load of the current value.
> > > 
> > > From include/linux/atomics.h:
> > > 
> > >  * For compound atomics performing both a load and a store, ACQUIRE
> > >  * semantics apply only to the load and RELEASE semantics only to the
> > >  * store portion of the operation. Note that a failed cmpxchg_acquire
> > >  * does -not- imply any memory ordering constraints.
> > > 
> > > Memory barriers hurt my brain. :/
> > > 
> > > At minimum, shouldn't the atomic op specific barriers be used rather
> > > than full memory barriers? i.e:
> > > 
> > 
> > They hurt my brain too. Yes, definitely atomic-specific barriers should
> > be used here instead, since this is an atomic64_t now.
> > 
> > After going over the docs again...my understanding has always been that
> > you primarily need memory barriers to order accesses to different areas
> > of memory.
> 
> That is correct.
> 
> > As Jan and I were discussing in the other thread, i_version is not
> > synchronized with anything else. In this code, we're only dealing with a
> > single 64-bit word. I don't think there are any races there wrt the API
> > itself.
> 
> There are not but it is like saying that lock implementation is correct
> because the lock state does not get corrupted ;). Who cares about protected
> data...
> 
> > The "legacy" inode_inc_iversion() however _does_ have implied memory
> > barriers from the i_lock. There could be some subtle de-facto ordering
> > there, so I think we probably do want some barriers in here if only to
> > preserve that. It's not likely to cost much, and may save us tracking
> > down some fiddly bugs.
> > 
> > What about this patch? Note that I've only added barriers to
> > inode_maybe_inc_iversion. I don't see that we need it for the other
> > functions, but please do tell me if I'm wrong there:
> > 
> > 8<---
> > 
> > [PATCH] SQUASH: add memory barriers around i_version accesses
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/iversion.h | 53 
> > +---
> >  1 file changed, 32 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index a9fbf99709df..02187a3bec3b 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 
> > val)
> > atomic64_set(>i_version, val);
> >  }
> >  
> > +/**
> > + * inode_peek_iversion_raw - grab a "raw" iversion value
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Grab a "raw" inode->i_version value and return it. The i_version is not
> > + * flagged or converted in any way. This is mostly used to access a 
> > self-managed
> > + * i_version.
> > + *
> > + * With those filesystems, we want to treat the i_version as an entirely
> > + * opaque value.
> > + */
> > +static inline u64
> > +inode_peek_iversion_raw(const struct inode *inode)
> > +{
> > +   return atomic64_read(>i_version);
> > +}
> > +
> >  /**
> >   * inode_set_iversion - set i_version to a particular value
> >   * @inode: inode to set
> > @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool 
> > force)
> >  {
> > u64 cur, old, new;
> >  
> > -   cur = (u64)atomic64_read(>i_version);
> > +   /*
> > +* The i_version field is not strictly ordered with any other inode
> > +* information, but the legacy inode_inc_iversion code used a spinlock
> > +* to serialize increments.
> > +*
> > +* This code adds full memory barriers to ensure that any de-facto
> > +* ordering with other info is preserved.
> > +*/
> > +   smp_mb__before_atomic();
> 
> This should be just smp_mb(). __before_atomic() pairs with atomic
> operations like atomic_inc(). atomic_read() is completely unordered
> operation (happens to be plain memory read on x86) and so __before_atomic()
> is not enough.
> 
> > +   cur = inode_peek_iversion_raw(inode);
> > for (;;) {
> > /* If flag is clear then we needn't do anything */
> > if (!force && !(cur & I_VERSION_QUERIED))
> > @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool 
> > force)
> > new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> >  
> >

Re: Cannot balance, ENOSPC errors 4.14.2 vanilla kernel

2017-12-21 Thread Qu Wenruo



On 2017年12月21日 15:56, Adam Bahe wrote:
> Alright, I have rebuilt kernel 4.14.8 and added the line of code you
> gave me. The kernel is installed and I have a full balance running.
> Right off the bat one thing I noticed is that the last time I ran a
> full balance, balance status showed something like "14 out of about
> 200 chunks balanced". I thought that was interesting that it was only
> trying to balance 200 chunks. With your change, a full balance status
> right now shows "34 out of 9770 chunks balanced". It has been running
> for about 10 minutes now. But sometimes it took awhile to cause the
> filesystem to go read only. So we shall wait and see. A full balance
> across 21 devices and 120TB raw or thereabouts will take some time.

Well, if ENOSPC happens, there should be some kernel message along with
the line I added.

It's recommended to monitor dmesg too.

Thanks,
Qu

> 
> 
> On Wed, Dec 20, 2017 at 4:13 PM, Adam Bahe  wrote:
>> Yeah I had a hunch that it was something to do with the 2TB disks.
>> I've been slowly trying to replace them. But they're the remnants of
>> my old storage system so it has been slow going. When I get some time
>> I will try and compile the kernel with your patch. Thanks!
>>
>> On Wed, Dec 20, 2017 at 12:20 AM, Qu Wenruo  wrote:
>>> Also, if you're OK to compile kernel, would you please try the following
>>> diff to help us to further enhance btrfs chunk allocator?
>>>
>>> The chunk allocator itself is designed to handle your case, so it should
>>> pick up the remaining devices and allocate new RAID10 chunk with the
>>> unallocated space.
>>>
>>> But the truth is not the case.
>>> I'm wondering if it's the devs_increment and substripes calculation
>>> causing the problem.
>>>
>>> If you could help testing btrfs with the diff applied, dmesg would
>>> contain extra info when you try to do device remove.
>>>
>>> Thanks,
>>> Qu
>>>
>>> 
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 49810b70afd3..851ff13f5c29 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -4732,6 +4732,8 @@ static int __btrfs_alloc_chunk(struct
>>> btrfs_trans_handle *trans,
>>>
>>> if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
>>> ret = -ENOSPC;
>>> +   pr_info("ndevs=%d dev_increment=%d sub_stripes=%d
>>> devs_min=%d\n",
>>> +   ndevs, devs_increment, sub_stripes, devs_min);
>>> goto error;
>>> }
>>>
>>>
>>> On 2017年12月20日 14:11, Qu Wenruo wrote:


 On 2017年12月20日 13:00, Adam Bahe wrote:
> I'm using raid10.

 Pretty much the same.

 Raid10 is RAID1 first then RAID0.

 For your 20 disks (well, quite amazing) layout, btrfs will try to
 allocate using all disks for RAID10.

 Any unfortunately, devid 7, 9, 12, 13, 14, 15 are already full.

 Normally btrfs should exclude them in chunk allocation, but I think
 there is some small unaligned unallocated space making btrfs to choose
 them for allocation.

 And causing no new chunk could be allocated.

 And considering the size of your fs, common method like adding temporary
 small USB disk to allow convert doesn't work in your case.


 You could try convert the profile from RAID10 to RAID1, which will
 always use 2 disk to allocate space, so it would be OK to allocate new
 chunks to start your convert.

 And final suggestion, don't use any profile with stripe with so many
 uneven disks.

 Thanks,
 Qu

>
> On Tue, Dec 19, 2017 at 10:51 PM, Qu Wenruo  
> wrote:
>>
>>
>> On 2017年12月20日 10:51, Adam Bahe wrote:
>>> Forgot to add, I should have plenty of space:
>>>
>>> Label: 'nas'  uuid: 4fcd5725-b6c6-4d8a-9860-f2fc5474cbcb
>>> Total devices 20 FS bytes used 26.69TiB
>>> devid1 size 3.64TiB used 3.28TiB path /dev/sdm
>>> devid2 size 3.64TiB used 3.28TiB path /dev/sde
>>> devid3 size 7.28TiB used 3.45TiB path /dev/sdt
>>> devid4 size 9.03TiB used 2.51TiB path /dev/sdo
>>> devid5 size 7.28TiB used 3.45TiB path /dev/sdi
>>> devid6 size 7.28TiB used 3.45TiB path /dev/sdd
>>> devid7 size 1.82TiB used 1.82TiB path /dev/sdp
>>> devid9 size 1.82TiB used 1.82TiB path /dev/sdw
>>> devid   10 size 1.82TiB used 1.82TiB path /dev/sdk
>>> devid   11 size 9.03TiB used 1.82TiB path /dev/sdy
>>> devid   12 size 1.82TiB used 1.82TiB path /dev/sdg
>>> devid   13 size 1.82TiB used 1.82TiB path /dev/sdl
>>> devid   14 size 1.82TiB used 1.82TiB path /dev/sdr
>>> devid   15 size 1.82TiB used 1.82TiB path /dev/sdf
>>> devid   16 size 5.46TiB used 3.45TiB path /dev/sds
>>>

39 matches

Mail list logo