Re: [PATCH 3/5] btrfs: raid56: Use correct stolen pages to calculate P/Q

2017-03-15 Thread Liu Bo
On Fri, Feb 03, 2017 at 04:20:21PM +0800, Qu Wenruo wrote:
> In the following situation, scrub will calculate wrong parity to
> overwrite correct one:
> 
> RAID5 full stripe:
> 
> Before
> | Dev 1  | Dev  2 | Dev 3 |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --- 0
> | 0x (Bad)   | 0xcdcd | 0x|
> --- 4K
> | 0xcdcd | 0xcdcd | 0x|
> ...
> | 0xcdcd | 0xcdcd | 0x|
> --- 64K
> 
> After scrubbing dev3 only:
> 
> | Dev 1  | Dev  2 | Dev 3 |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --- 0
> | 0xcdcd (Good)  | 0xcdcd | 0xcdcd (Bad)  |
> --- 4K
> | 0xcdcd | 0xcdcd | 0x|
> ...
> | 0xcdcd | 0xcdcd | 0x|
> --- 64K
> 
> The calltrace of such corruption is as following:
> 
> scrub_bio_end_io_worker() get called for each extent read out
> |- scriub_block_complete()
>|- Data extent csum mismatch
>|- scrub_handle_errored_block
>   |- scrub_recheck_block()
>  |- scrub_submit_raid56_bio_wait()
> |- raid56_parity_recover()
> 
> Now we have a rbio with correct data stripe 1 recovered.
> Let's call it "good_rbio".
> 
> scrub_parity_check_and_repair()
> |- raid56_parity_submit_scrub_rbio()
>|- lock_stripe_add()
>|  |- steal_rbio()
>| |- Recovered data are steal from "good_rbio", stored into
>|rbio->stripe_pages[]
>|Now rbio->bio_pages[] are bad data read from disk.

At this point, we should have already known that whether
rbio->bio_pages are corrupted because rbio->bio_pages are indexed from
the list sparity->pages, and we only do scrub_parity_put after
finishing the endio of reading all pages linked at sparity->pages.

Since the previous checksuming failure has made a recovery and we
got the correct data on that rbio, instead of adding this corrupted
page into the new rbio, it'd be fine to skip it and we use all
rbio->stripe_pages which can be stolen from the previous good rbio.

Thanks,

-liubo

>|- async_scrub_parity()
>   |- scrub_parity_work() (delayed_call to scrub_parity_work)
> 
> scrub_parity_work()
> |- raid56_parity_scrub_stripe()
>|- validate_rbio_for_parity_scrub()
>   |- finish_parity_scrub()
>  |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
> So good parity is overwritten with *BAD* one
> 
> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
> btrfs_raid_bio, to info scrub code to use correct data pages to
> re-calculate parity.
> 
> Reported-by: Goffredo Baroncelli 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/raid56.c | 62 
> +++
>  1 file changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index d2a9a1ee5361..453eefdcb591 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>   /* second bad stripe (for raid6 use) */
>   int failb;
>  
> + /*
> +  * For steal_rbio, we can steal recovered correct page,
> +  * but in finish_parity_scrub(), we still use bad on-disk
> +  * page to calculate parity.
> +  * Use these members to info finish_parity_scrub() to use
> +  * correct pages
> +  */
> + int bad_ondisk_a;
> + int bad_ondisk_b;
> +
>   int scrubp;
>   /*
>* number of pages needed to represent the full
> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, 
> struct btrfs_raid_bio *dest)
>   if (!test_bit(RBIO_CACHE_READY_BIT, >flags))
>   return;
>  
> + /* Record recovered stripe number */
> + if (src->faila != -1)
> + dest->bad_ondisk_a = src->faila;
> + if (src->failb != -1)
> + dest->bad_ondisk_b = src->failb;
> +
>   for (i = 0; i < dest->nr_pages; i++) {
>   s = src->stripe_pages[i];
>   if (!s || !PageUptodate(s)) {
> @@ -999,6 +1015,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
> btrfs_fs_info *fs_info,
>   rbio->stripe_npages = stripe_npages;
>   rbio->faila = -1;
>   rbio->failb = -1;
> + rbio->bad_ondisk_a = -1;
> + rbio->bad_ondisk_b = -1;
>   atomic_set(>refs, 1);
>   atomic_set(>error, 0);
>   atomic_set(>stripes_pending, 0);
> @@ -2261,6 +2279,9 @@ static int alloc_rbio_essential_pages(struct 
> btrfs_raid_bio *rbio)
>   int bit;
>   int index;
>   struct page *page;
> + struct page *bio_page;
> + void *ptr;
> + void *bio_ptr;
>  
>   

[PATCH 2/2] btrfs-progs: convert: Make btrfs_reserved_ranges const

2017-03-15 Thread Qu Wenruo
Since btrfs_reserved_ranges array is just used to store btrfs reserved
ranges, no one will nor should modify them at run time, make them static
and const will be better.

This also eliminates the use of immediate number 3.

Signed-off-by: Qu Wenruo 
---
 convert/main.c  | 16 
 convert/source-fs.c |  6 --
 convert/source-fs.h |  8 ++--
 3 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 73c9d889..96358c62 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -218,7 +218,7 @@ static int create_image_file_range(struct 
btrfs_trans_handle *trans,
 * migrate block will fail as there is already a file extent.
 */
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *reserved = _reserved_ranges[i];
+   const struct simple_range *reserved = _reserved_ranges[i];
 
/*
 * |-- reserved --|
@@ -320,7 +320,7 @@ static int migrate_one_reserved_range(struct 
btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct cache_tree *used,
  struct btrfs_inode_item *inode, int fd,
- u64 ino, struct simple_range *range,
+ u64 ino, const struct simple_range *range,
  u32 convert_flags)
 {
u64 cur_off = range->start;
@@ -423,7 +423,7 @@ static int migrate_reserved_ranges(struct 
btrfs_trans_handle *trans,
int ret = 0;
 
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
if (range->start > total_bytes)
return ret;
@@ -609,7 +609,7 @@ static int wipe_reserved_ranges(struct cache_tree *tree, 
u64 min_stripe_size,
int ret;
 
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
ret = wipe_one_reserved_range(tree, range->start, range->len,
  min_stripe_size, ensure_size);
@@ -1370,7 +1370,7 @@ static int read_reserved_ranges(struct btrfs_root *root, 
u64 ino,
int ret = 0;
 
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
if (range->start + range->len >= total_bytes)
break;
@@ -1395,7 +1395,7 @@ static bool is_subset_of_reserved_ranges(u64 start, u64 
len)
bool ret = false;
 
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
if (start >= range->start && start + len <= range_end(range)) {
ret = true;
@@ -1620,7 +1620,7 @@ static int do_rollback(const char *devname)
int i;
 
for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
reserved_ranges[i] = calloc(1, range->len);
if (!reserved_ranges[i]) {
@@ -1730,7 +1730,7 @@ close_fs:
 
for (i = ARRAY_SIZE(btrfs_reserved_ranges) - 1; i >= 0; i--) {
u64 real_size;
-   struct simple_range *range = _reserved_ranges[i];
+   const struct simple_range *range = _reserved_ranges[i];
 
if (range_end(range) >= fsize)
continue;
diff --git a/convert/source-fs.c b/convert/source-fs.c
index 7cf515b0..8217c893 100644
--- a/convert/source-fs.c
+++ b/convert/source-fs.c
@@ -22,12 +22,6 @@
 #include "convert/common.h"
 #include "convert/source-fs.h"
 
-struct simple_range btrfs_reserved_ranges[3] = {
-   { 0, SZ_1M },
-   { BTRFS_SB_MIRROR_OFFSET(1), SZ_64K },
-   { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K }
-};
-
 static int intersect_with_sb(u64 bytenr, u64 num_bytes)
 {
int i;
diff --git a/convert/source-fs.h b/convert/source-fs.h
index 9f611150..7aabe96b 100644
--- a/convert/source-fs.h
+++ b/convert/source-fs.h
@@ -32,7 +32,11 @@ struct simple_range {
u64 len;
 };
 
-extern struct simple_range btrfs_reserved_ranges[3];
+static const struct simple_range btrfs_reserved_ranges[] = {
+   { 0, SZ_1M },
+   { BTRFS_SB_MIRROR_OFFSET(1), SZ_64K },
+   { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K }
+};
 
 struct 

[PATCH 1/2] btrfs-progs: kerncompat: Fix re-definition of __bitwise

2017-03-15 Thread Qu Wenruo
In latest linux api headers, __bitwise is already defined in
/usr/include/linux/types.h.

So kerncompat.h will re-define __bitwise, and cause gcc warning.

Fix it by checking if __bitwise is already define.

Signed-off-by: Qu Wenruo 
---
The patch is based on devel branch with the following head:
commit 64abe9f619b614e589339046b6c45dfb8fa8e2a9
Author: David Sterba 
Date:   Wed Mar 15 12:28:16 2017 +0100

btrfs-progs: tests: misc/019, use fssum
---
 kerncompat.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kerncompat.h b/kerncompat.h
index 958bea43..fa96715f 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -317,11 +317,13 @@ static inline void assert_trace(const char *assertion, 
const char *filename,
 #define container_of(ptr, type, member) ({  \
 const typeof( ((type *)0)->member ) *__mptr = (ptr);\
(type *)( (char *)__mptr - offsetof(type,member) );})
+#ifndef __bitwise
 #ifdef __CHECKER__
 #define __bitwise __bitwise__
 #else
 #define __bitwise
-#endif
+#endif /* __CHECKER__ */
+#endif /* __bitwise */
 
 /* Alignment check */
 #define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro

2017-03-15 Thread Qu Wenruo



At 03/15/2017 10:38 PM, David Sterba wrote:

On Mon, Mar 13, 2017 at 02:32:04PM -0600, ednadol...@gmail.com wrote:

From: Edmund Nadolski 

Define the SEQ_NONE macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.



index 9c41fba..20915a6 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -23,6 +23,8 @@
 #include "ulist.h"
 #include "extent_io.h"

+#define SEQ_NONE   ((u64)-1)


The naming of SEQ_NONE sounds not that good to me.

The (u64)-1 is to to info the backref walker to only search current 
root, and no need to worry about delayed_refs, since the caller (qgroup) 
will ensure that no delayed_ref will exist.


While the name SEQ_NONE seems a little like to 0, which is far from the 
original meaning.


What about SEQ_FINAL or SEQ_LAST?
Since the timing we use (u64)-1 is just before switching commit roots, 
it would be better for the naming to indicate that.


Thanks,
Qu





Can you please move the definition to ctree.h, near line 660, where
seq_list and SEQ_LIST_INIT are defined, so thay're all grouped together?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: remove unused qgroup members from btrfs_trans_handle

2017-03-15 Thread Qu Wenruo



At 03/15/2017 11:17 PM, David Sterba wrote:

The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562dec83b3), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Signed-off-by: David Sterba 


Reviewed-by: Qu Wenruo 

Thanks for the cleanup,
Qu


---
 fs/btrfs/extent-tree.c   |  1 -
 fs/btrfs/qgroup.c| 12 
 fs/btrfs/qgroup.h|  1 -
 fs/btrfs/tests/btrfs-tests.c |  1 -
 fs/btrfs/transaction.c   |  3 ---
 fs/btrfs/transaction.h   |  2 --
 6 files changed, 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..b5682abf6f68 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3003,7 +3003,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
goto again;
}
 out:
-   assert_qgroups_uptodate(trans);
trans->can_flush_pending_bgs = can_flush_pending_bgs;
return 0;
 }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a5da750c1087..2fa0b10d239f 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2487,18 +2487,6 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
spin_unlock(_info->qgroup_lock);
 }

-void assert_qgroups_uptodate(struct btrfs_trans_handle *trans)
-{
-   if (list_empty(>qgroup_ref_list) && !trans->delayed_ref_elem.seq)
-   return;
-   btrfs_err(trans->fs_info,
-   "qgroups not uptodate in trans handle %p:  list is%s empty, seq is 
%#x.%x",
-   trans, list_empty(>qgroup_ref_list) ? "" : " not",
-   (u32)(trans->delayed_ref_elem.seq >> 32),
-   (u32)trans->delayed_ref_elem.seq);
-   BUG();
-}
-
 /*
  * returns < 0 on error, 0 when more leafs are to be scanned.
  * returns 1 when done.
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 26932a8a1993..96fc56ebf55a 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -196,7 +196,6 @@ static inline void btrfs_qgroup_free_delayed_ref(struct 
btrfs_fs_info *fs_info,
btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
 }
-void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);

 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index ea272432c930..b18ab8f327a5 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -237,7 +237,6 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle 
*trans)
 {
memset(trans, 0, sizeof(*trans));
trans->transid = 1;
-   INIT_LIST_HEAD(>qgroup_ref_list);
trans->type = __TRANS_DUMMY;
 }

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 61b807de3e16..9db3b4ca0264 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -572,7 +572,6 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,

h->type = type;
h->can_flush_pending_bgs = true;
-   INIT_LIST_HEAD(>qgroup_ref_list);
INIT_LIST_HEAD(>new_bgs);

smp_mb();
@@ -917,7 +916,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
wake_up_process(info->transaction_kthread);
err = -EIO;
}
-   assert_qgroups_uptodate(trans);

kmem_cache_free(btrfs_trans_handle_cachep, trans);
if (must_run_delayed_refs) {
@@ -2223,7 +2221,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)

switch_commit_roots(cur_trans, fs_info);

-   assert_qgroups_uptodate(trans);
ASSERT(list_empty(_trans->dirty_bgs));
ASSERT(list_empty(_trans->io_bgs));
update_super_roots(fs_info);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 5dfb5590fff6..2e560d2abdff 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -125,8 +125,6 @@ struct btrfs_trans_handle {
unsigned int type;
struct btrfs_root *root;
struct btrfs_fs_info *fs_info;
-   struct seq_list delayed_ref_elem;
-   struct list_head qgroup_ref_list;
struct list_head new_bgs;
 };





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] fstests: btrfs: Add testcase for btrfs dedupe and metadata balance race test

2017-03-15 Thread Qu Wenruo
Btrfs balance with inband dedupe enable/disable will expose a lot of
hidden dedupe bug:

1) Enable/disable race bug
2) Btrfs dedupe tree balance corrupted delayed_ref
3) Btrfs disable and balance will cause balance BUG_ON

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/201 | 112 
 tests/btrfs/201.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 115 insertions(+)
 create mode 100755 tests/btrfs/201
 create mode 100644 tests/btrfs/201.out

diff --git a/tests/btrfs/201 b/tests/btrfs/201
new file mode 100755
index ..d6913c13
--- /dev/null
+++ b/tests/btrfs/201
@@ -0,0 +1,112 @@
+#! /bin/bash
+# FS QA Test 201
+#
+# Btrfs inband dedup enable/disable race with metadata balance
+#
+# This tests will test the following bugs exposed in development:
+# 1) enable/disable race
+# 2) tree balance cause delayed ref corruption
+# 3) disable and balance cause BUG_ON
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+   killall $FSSTRESS_PROG &> /dev/null
+   kill $trigger_pid &> /dev/null
+   kill $balance_pid &> /dev/null
+   wait
+
+   # See comment later
+   $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command dedupe
+_require_btrfs_fs_feature dedupe
+
+# Use 64K dedupe size to keep compatibility for 64K page size
+dedupe_bs=64K
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+mkdir -p $SCRATCH_MNT/stressdir
+
+runtime=$((60 * $TIME_FACTOR))
+
+trigger_work()
+{
+   while true; do
+   _run_btrfs_util_prog dedupe enable -s inmemory \
+   -b $dedupe_bs $SCRATCH_MNT
+   sleep 1
+   _run_btrfs_util_prog dedupe disable $SCRATCH_MNT
+   sleep 1
+   done
+}
+
+# redirect all output, as error output like 'balance cancelled by user'
+# will populuate the golden output.
+_btrfs_stress_balance -m $SCRATCH_MNT &> /dev/null &
+balance_pid=$!
+
+$FSSTRESS_PROG $(_scale_fsstress_args -p 1 -n 1000) $FSSTRESS_AVOID \
+   -d $SCRATCH_MNT/stressdir > /dev/null 2>&1 &
+
+trigger_work &
+trigger_pid=$!
+
+sleep $runtime
+killall $FSSTRESS_PROG &> /dev/null
+kill $trigger_pid &> /dev/null
+kill $balance_pid &> /dev/null
+wait
+
+# Manually stop balance as it's possible balance is still running for a short
+# time. And we don't want to populate $seqres.full, so call $BTRFS_UTIL_PROG
+# directly
+$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/201.out b/tests/btrfs/201.out
new file mode 100644
index ..5ac973f5
--- /dev/null
+++ b/tests/btrfs/201.out
@@ -0,0 +1,2 @@
+QA output created by 201
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index bf001d3c..f87d995c 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -143,3 +143,4 @@
 137 auto quick send
 138 auto compress
 200 auto ib-dedupe
+201 auto ib-dedupe
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: generic: Test space allocation when there is only fragmented space

2017-03-15 Thread Qu Wenruo
This test case will test if file system works well when handling large
write while available space are all fragmented.

This can expose a bug in a btrfs unmerged patch, which wrongly modified
the delayed allocation code, to exit before allocating all space, and
cause hang when unmounting.

The wrong patch is:
[PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum 
error

The test case will:
1) Fill small filesystem with page sized small files
   All these files has a sequential number as file name
2) Remove files with odd number as file name
   This will free almost half of the space
3) Try to write a file which takes 1/8 of the file system

The method to create fragmented fs may not be generic enough, but should
work for most extent based fs.
Unless one file system will allocate extents from both end of its free
space.

Cc: Filipe Manana 
Cc: Liu Bo 
Signed-off-by: Qu Wenruo 
---
 tests/generic/416 | 99 +++
 tests/generic/416.out |  3 ++
 tests/generic/group   |  1 +
 3 files changed, 103 insertions(+)
 create mode 100755 tests/generic/416
 create mode 100644 tests/generic/416.out

diff --git a/tests/generic/416 b/tests/generic/416
new file mode 100755
index 000..925524b
--- /dev/null
+++ b/tests/generic/416
@@ -0,0 +1,99 @@
+#! /bin/bash
+# FS QA Test 416
+#
+# Test fs behavior when large write request can't be met by one single extent
+#
+# Inspired by a bug in a btrfs fix, which doesn't get exposed by current test
+# cases
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os IRIX Linux
+_require_scratch
+
+fs_size=$((128 * 1024 * 1024))
+page_size=$(get_page_size)
+
+# We will never reach this number though
+nr_files=$(($fs_size / $page_size))
+
+# Use small fs to make the fill more faster
+_scratch_mkfs_sized $fs_size >> $seqres.full 2>&1
+
+_scratch_mount
+
+fill_fs()
+{
+   dir=$1
+   for i in $(seq -w $nr_files); do
+   # xfs_io can't return correct value when it hit ENOSPC, use
+   # dd here to detect ENOSPC
+   dd if=/dev/zero of=$SCRATCH_MNT/$i bs=$page_size count=1 \
+   &> /dev/null
+   if [ $? -ne 0 ]; then
+   break
+   fi
+   done
+}
+
+fill_fs $SCRATCH_MNT
+
+# remount to sync every thing into fs, and drop all cache
+_scratch_remount
+
+# remove all files with odd file names, which should free near half
+# of the space
+rm $SCRATCH_MNT/*[13579]
+sync
+
+# We should be able to write at least 1/8 of the whole fs size
+# The number 1/8 is for btrfs, which only has about 47M for data.
+# And half of the 47M is already taken up, so only 1/8 is safe here
+$XFS_IO_PROG -f -c "pwrite 0 $(($fs_size / 8))" $SCRATCH_MNT/large_file | \
+   _filter_xfs_io
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/416.out b/tests/generic/416.out
new file mode 100644
index 000..8d2ffac
--- /dev/null
+++ b/tests/generic/416.out
@@ -0,0 +1,3 @@
+QA output created by 416
+wrote 16777216/16777216 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/generic/group b/tests/generic/group
index b510d41..59f94f9 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -418,3 +418,4 @@
 413 auto quick
 414 auto quick clone
 415 auto clone
+416 auto enospc
-- 
2.9.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 0/5] In-band de-duplication for btrfs-progs

2017-03-15 Thread Qu Wenruo
Patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs.git dedupe_20170306

Inband dedupe(in-memory backend only) ioctl support for btrfs-progs.

User/reviewer/tester can still use previous btrfs-progs patchset to test,
this update is just cleanuping unsupported functions, like on-disk
backend and any on-disk format change.

v7 changes:
   Update ctree.h to follow kernel structure change
   Update print-tree to follow kernel structure change
V8 changes:
   Move dedup props and on-disk backend support out of the patchset
   Change command group name to "dedupe-inband", to avoid confusion with
   possible out-of-band dedupe. Suggested by Mark.
   Rebase to latest devel branch.
V9 changes:
   Follow kernels ioctl change to support FORCE flag, new reconf ioctl,
   and more precious error reporting.

v10 changes:
   Rebase to v4.10.
   Add BUILD_ASSERT for btrfs_ioctl_dedupe_args

Qu Wenruo (5):
  btrfs-progs: Basic framework for dedupe-inband command group
  btrfs-progs: dedupe: Add enable command for dedupe command group
  btrfs-progs: dedupe: Add disable support for inband dedupelication
  btrfs-progs: dedupe: Add status subcommand
  btrfs-progs: dedupe: introduce reconfigure subcommand

 Documentation/Makefile.in  |   1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 167 +++
 Documentation/btrfs.asciidoc   |   4 +
 Makefile   |   2 +-
 btrfs-completion   |   6 +-
 btrfs.c|   2 +
 cmds-dedupe-ib.c   | 437 +
 commands.h |   2 +
 dedupe-ib.h|  41 +++
 ioctl.h|  38 +++
 10 files changed, 698 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 1/5] btrfs-progs: Basic framework for dedupe-inband command group

2017-03-15 Thread Qu Wenruo
Add basic ioctl header and command group framework for later use.
Alone with basic man page doc.

Signed-off-by: Qu Wenruo 
---
 Documentation/Makefile.in  |  1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 40 +
 Documentation/btrfs.asciidoc   |  4 +++
 Makefile   |  2 +-
 btrfs.c|  2 ++
 cmds-dedupe-ib.c   | 48 ++
 commands.h |  2 ++
 dedupe-ib.h| 41 +
 ioctl.h| 36 ++
 9 files changed, 175 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in
index 539c6b55..f175ae1e 100644
--- a/Documentation/Makefile.in
+++ b/Documentation/Makefile.in
@@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc
 MAN8_TXT += btrfs-replace.asciidoc
 MAN8_TXT += btrfs-restore.asciidoc
 MAN8_TXT += btrfs-property.asciidoc
+MAN8_TXT += btrfs-dedupe-inband.asciidoc
 
 # Category 5 manual page
 MAN5_TXT += btrfs-man5.asciidoc
diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
new file mode 100644
index ..9ee2bc75
--- /dev/null
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -0,0 +1,40 @@
+btrfs-dedupe(8)
+==
+
+NAME
+
+btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs
+filesystem
+
+SYNOPSIS
+
+*btrfs dedupe-inband*  
+
+DESCRIPTION
+---
+*btrfs dedupe-inband* is used to enable/disable or show current in-band 
de-duplication
+status of a btrfs filesystem.
+
+Kernel support for in-band de-duplication starts from 4.8.
+
+WARNING: In-band de-duplication is still an experimental feautre of btrfs,
+use with caution.
+
+SUBCOMMAND
+--
+Nothing yet
+
+EXIT STATUS
+---
+*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc
index 100a6adf..64fc0d2c 100644
--- a/Documentation/btrfs.asciidoc
+++ b/Documentation/btrfs.asciidoc
@@ -50,6 +50,10 @@ COMMANDS
Do off-line check on a btrfs filesystem. +
See `btrfs-check`(8) for details.
 
+*dedupe*::
+   Control btrfs in-band(write time) de-duplication. +
+   See `btrfs-dedupe`(8) for details.
+
 *device*::
Manage devices managed by btrfs, including add/delete/scan and so
on. +
diff --git a/Makefile b/Makefile
index 67fbc483..24445493 100644
--- a/Makefile
+++ b/Makefile
@@ -102,7 +102,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
   cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o 
\
-  mkfs/common.o
+  mkfs/common.o cmds-dedupe-ib.o
 libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o 
\
   kernel-lib/crc32c.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
diff --git a/btrfs.c b/btrfs.c
index 9214ae6e..1f055d75 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -201,6 +201,8 @@ static const struct cmd_group btrfs_cmd_group = {
{ "quota", cmd_quota, NULL, _cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, _cmd_group, 0 },
{ "replace", cmd_replace, NULL, _cmd_group, 0 },
+   { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group,
+   0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
new file mode 100644
index ..f4d31386
--- /dev/null
+++ b/cmds-dedupe-ib.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2017 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ 

[PATCH v10 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group

2017-03-15 Thread Qu Wenruo
Add enable subcommand for dedupe commmand group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc | 114 ++-
 btrfs-completion   |   6 +-
 cmds-dedupe-ib.c   | 225 +
 ioctl.h|   2 +
 4 files changed, 345 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 9ee2bc75..82f970a6 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,7 +22,119 @@ use with caution.
 
 SUBCOMMAND
 --
-Nothing yet
+*enable* [options] ::
+Enable in-band de-duplication for a filesystem.
++
+`Options`
++
+-f|--force
+Force 'enable' command to be exected.
+Will skip memory limit check and allow 'enable' to be executed even in-band
+de-duplication is already enabled.
++
+NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be
+reset to its default value.
+
+-s|--storage-backend 
+Specify de-duplication hash storage backend.
+Only 'inmemory' backend is supported yet.
+If not specified, default value is 'inmemory'.
++
+Refer to *BACKENDS* sector for more information.
+
+-b|--blocksize 
+Specify dedupe block size.
+Supported values are power of 2 from '16K' to '8M'.
+Default value is '128K'.
++
+Refer to *BLOCKSIZE* sector for more information.
+
+-a|--hash-algorithm 
+Specify hash algorithm.
+Only 'sha256' is supported yet.
+
+-l|--limit-hash 
+Specify maximum number of hashes stored in memory.
+Only works for 'inmemory' backend.
+Conflicts with '-m' option.
++
+Only positive values are valid.
+Default value is '32K'.
+
+-m|--limit-memory 
+Specify maximum memory used for hashes.
+Only works for 'inmemory' backend.
+Conflicts with '-l' option.
++
+Only value larger than or equal to '1024' is valid.
+No default value.
++
+NOTE: Memory limit will be rounded down to kernel internal hash size,
+so the memory limit shown in 'btrfs dedupe status' may be different
+from the .
+
+WARNING: Too large value for '-l' or '-m' will easily trigger OOM.
+Please use with caution according to system memory.
+
+NOTE: In-band de-duplication is not compactible with compression yet.
+And compression has higher priority than in-band de-duplication, means if
+compression and de-duplication is enabled at the same time, only compression
+will work.
+
+BACKENDS
+
+Btrfs in-band de-duplication will support different storage backends, with
+different use case and features.
+
+In-memory backend::
+This backend provides backward-compatibility, and more fine-tuning options.
+But hash pool is non-persistent and may exhaust kernel memory if not setup
+properly.
++
+This backend can be used on old btrfs(without '-O dedupe' mkfs option).
+When used on old btrfs, this backend needs to be enabled manually after mount.
++
+Designed for fast hash search speed, in-memory backend will keep all dedupe
+hashes in memory. (Although overall performance is still much the same with
+'ondisk' backend if all 'ondisk' hash can be cached in memory)
++
+And only keeps limited number of hash in memory to avoid exhausting memory.
+Hashes over the limit will be dropped following Last-Recent-Use behavior.
+So this backend has a consistent overhead for given limit but can\'t ensure
+all duplicated blocks will be de-duplicated.
++
+After umount and mount, in-memory backend need to refill its hash pool.
+
+On-disk backend::
+This backend provides persistent hash pool, with more smart memory management
+for hash pool.
+But it\'s not backward-compatible, meaning it must be used with '-O dedupe' 
mkfs
+option and older kernel can\'t mount it read-write.
++
+Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk.
+This behavior may cause extra disk IO for hash search under high memory
+pressure.
++
+After umount and mount, on-disk backend still has its hash on disk, no need to
+refill its dedupe hash pool.
+
+Currently, only 'inmemory' backend is supported in btrfs-progs.
+
+DEDUPE BLOCK SIZE
+
+In-band de-duplication is done at dedupe block size.
+Any data smaller than dedupe block size won\'t go through in-band
+de-duplication.
+
+And dedupe block size affects dedupe rate and fragmentation heavily.
+
+Smaller block size will cause more fragments, but higher dedupe rate.
+
+Larger block size will cause less fragments, but lower dedupe rate.
+
+In-band de-duplication rate is highly related to the workload pattern.
+So it\'s highly recommended to align dedupe block size to the workload
+block size to make full use of de-duplication.
 
 EXIT STATUS
 ---
diff --git a/btrfs-completion b/btrfs-completion
index 3ede77b6..50f7ea2b 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -29,7 +29,7 @@ _btrfs()
 
local cmd=${words[1]}
 
-commands='subvolume filesystem balance 

[PATCH 3/3] fstests: btrfs: Test inband dedupe with data balance.

2017-03-15 Thread Qu Wenruo
Btrfs balance will reloate date extent, but its hash is removed too late
at run_delayed_ref() time, which will cause extent ref increased
during balance, cause either find_data_references() gives WARN_ON()
or even run_delayed_refs() fails and cause transaction abort.

Add such concurrency test for inband dedupe and data balance.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/203 | 109 
 tests/btrfs/203.out |   3 ++
 tests/btrfs/group   |   1 +
 3 files changed, 113 insertions(+)
 create mode 100755 tests/btrfs/203
 create mode 100644 tests/btrfs/203.out

diff --git a/tests/btrfs/203 b/tests/btrfs/203
new file mode 100755
index ..aea756cb
--- /dev/null
+++ b/tests/btrfs/203
@@ -0,0 +1,109 @@
+#! /bin/bash
+# FS QA Test 203
+#
+# Btrfs inband dedupe with balance concurrency test
+#
+# This can spot inband dedupe error which will increase delayed ref on
+# an data extent inside RO block group
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   kill $populate_pid &> /dev/null
+   kill $balance_pid &> /dev/null
+   wait
+   # Check later comment for reason
+   $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/reflink
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_cp_reflink
+_require_btrfs_command dedupe
+_require_btrfs_fs_feature dedupe
+
+dedupe_bs=128k
+file_size_in_kilo=4096
+init_file=$SCRATCH_MNT/foo
+run_time=$((60 * $TIME_FACTOR))
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+do_dedupe_balance_test()
+{
+   _run_btrfs_util_prog dedupe enable -b $dedupe_bs -s inmemory 
$SCRATCH_MNT
+
+   # create the initial file and fill hash pool
+   $XFS_IO_PROG -f -c "pwrite -S 0x0 -b $dedupe_bs 0 $dedupe_bs" -c 
"fsync" \
+   $init_file | _filter_xfs_io
+
+   _btrfs_stress_balance $SCRATCH_MNT >/dev/null 2>&1 &
+   balance_pid=$!
+
+   # Populate fs with all 0 data, to trigger enough in-band dedupe work
+   # to race with balance
+   _populate_fs -n 5 -f 1000 -d 1 -r $SCRATCH_MNT \
+   -s $file_size_in_kilo &> /dev/null &
+   populate_pid=$!
+
+   sleep $run_time
+
+   kill $populate_pid
+   kill $balance_pid
+   wait
+
+   # Sometimes even we killed $balance_pid and wait returned,
+   # balance may still be running, use balance cancel to wait it.
+   # As this is just a workaround, we don't want it pollute seqres
+   # so call $BTRFS_UTIL_PROG directly
+   $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+
+   rm $SCRATCH_MNT/* -rf &> /dev/null
+   _run_btrfs_util_prog dedupe disable $SCRATCH_MNT
+}
+
+do_dedupe_balance_test
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/203.out b/tests/btrfs/203.out
new file mode 100644
index ..404394c3
--- /dev/null
+++ b/tests/btrfs/203.out
@@ -0,0 +1,3 @@
+QA output created by 203
+wrote 131072/131072 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f87d995c..2ef7a498 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -144,3 +144,4 @@
 138 auto compress
 200 auto ib-dedupe
 201 auto ib-dedupe
+203 auto ib-dedupe balance
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand

2017-03-15 Thread Qu Wenruo
Introduce reconfigure subcommand to co-operate with new kernel ioctl
modification.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  7 +++
 cmds-dedupe-ib.c   | 73 +++---
 2 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index df068c31..5fc4bb0d 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,13 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*reconfigure* [options] ::
+Re-configure in-band de-duplication parameters of a filesystem.
++
+In-band de-duplication must be enbaled first before re-configuration.
++
+[Options] are the same with 'btrfs dedupe-inband enable'.
+
 *status* ::
 Show current in-band de-duplication status of a filesystem.
 
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index 5fd26009..397946fa 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -69,7 +69,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = {
NULL
 };
 
-
 #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \
 if (dargs->member != old->member && dargs->member == (type)(err_val)) {
\
error("unsupported dedupe "#member": %"#fmt"", old->member);\
@@ -92,6 +91,10 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
}
report_option_parameter(dargs, old, flags, u8, -1, x);
}
+   if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) {
+   error("must enable dedupe before reconfiguration");
+   return;
+   }
report_fatal_parameter(dargs, old, cmd, u16, -1, u);
report_fatal_parameter(dargs, old, blocksize, u64, -1, llu);
report_fatal_parameter(dargs, old, backend, u16, -1, u);
@@ -102,14 +105,17 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
return;
 }
 
-static int cmd_dedupe_ib_enable(int argc, char **argv)
+static int enable_reconfig_dedupe(int argc, char **argv, int reconf)
 {
int ret;
int fd = -1;
char *path;
u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT;
+   int blocksize_set = 0;
u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+   int hash_algo_set = 0;
u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
+   int backend_set = 0;
u64 limit_nr = 0;
u64 limit_mem = 0;
u64 sys_mem = 0;
@@ -131,20 +137,22 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "s:b:a:l:m:", long_options, NULL);
+   c = getopt_long(argc, argv, "s:b:a:l:m:f", long_options, NULL);
if (c < 0)
break;
switch (c) {
case 's':
-   if (!strcasecmp("inmemory", optarg))
+   if (!strcasecmp("inmemory", optarg)) {
backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
-   else {
+   backend_set = 1;
+   } else {
error("unsupported dedupe backend: %s", optarg);
exit(1);
}
break;
case 'b':
blocksize = parse_size(optarg);
+   blocksize_set = 1;
break;
case 'a':
if (strcmp("sha256", optarg)) {
@@ -224,26 +232,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
return 1;
}
memset(, -1, sizeof(dargs));
-   dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE;
-   dargs.blocksize = blocksize;
-   dargs.hash_algo = hash_algo;
-   dargs.limit_nr = limit_nr;
-   dargs.limit_mem = limit_mem;
-   dargs.backend = backend;
-   if (force)
-   dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE;
-   else
-   dargs.flags = 0;
+   if (reconf) {
+   dargs.cmd = BTRFS_DEDUPE_CTL_RECONF;
+   if (blocksize_set)
+   dargs.blocksize = blocksize;
+   if (hash_algo_set)
+   dargs.hash_algo = hash_algo;
+   if (backend_set)
+   dargs.backend = backend;
+   dargs.limit_nr = limit_nr;
+   dargs.limit_mem = limit_mem;
+   } else {
+   dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE;
+   dargs.blocksize = blocksize;
+   dargs.hash_algo = hash_algo;
+   dargs.limit_nr = limit_nr;
+   dargs.limit_mem = 

[PATCH v10 4/5] btrfs-progs: dedupe: Add status subcommand

2017-03-15 Thread Qu Wenruo
Add status subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  3 ++
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 81 ++
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index de32eb97..df068c31 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,9 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*status* ::
+Show current in-band de-duplication status of a filesystem.
+
 BACKENDS
 
 Btrfs in-band de-duplication will support different storage backends, with
diff --git a/btrfs-completion b/btrfs-completion
index 9a6c73ba..fbaae0cc 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -40,7 +40,7 @@ _btrfs()
 commands_property='get set list'
 commands_quota='enable disable rescan'
 commands_qgroup='assign remove create destroy show limit'
-commands_dedupe='enable disable'
+commands_dedupe='enable disable status'
 commands_replace='start status cancel'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index a8b10924..5fd26009 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -299,12 +299,93 @@ out:
return 0;
 }
 
+static const char * const cmd_dedupe_ib_status_usage[] = {
+   "btrfs dedupe status ",
+   "Show current in-band(write time) de-duplication status of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_status(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+   int print_limit = 1;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_status_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   ret = 1;
+   goto out;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_STATUS;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to get inband deduplication status: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+   if (dargs.status == 0) {
+   printf("Status: \t\t\tDisabled\n");
+   goto out;
+   }
+   printf("Status:\t\t\tEnabled\n");
+
+   if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256)
+   printf("Hash algorithm:\t\tSHA-256\n");
+   else
+   printf("Hash algorithm:\t\tUnrecognized(%x)\n",
+   dargs.hash_algo);
+
+   if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   printf("Backend:\t\tIn-memory\n");
+   print_limit = 1;
+   } else  {
+   printf("Backend:\t\tUnrecognized(%x)\n",
+   dargs.backend);
+   }
+
+   printf("Dedup Blocksize:\t%llu\n", dargs.blocksize);
+
+   if (print_limit) {
+   u64 cur_mem;
+
+   /* Limit nr may be 0 */
+   if (dargs.limit_nr)
+   cur_mem = dargs.current_nr * (dargs.limit_mem /
+   dargs.limit_nr);
+   else
+   cur_mem = 0;
+
+   printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr,
+   dargs.limit_nr);
+   printf("Memory usage: \t\t[%s/%s]\n",
+   pretty_size(cur_mem),
+   pretty_size(dargs.limit_mem));
+   }
+out:
+   close_file_or_dir(fd, dirstream);
+   return ret;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
{ "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
  NULL, 0},
+   { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Btrfs in-band de-duplication test cases

2017-03-15 Thread Qu Wenruo
Btrfs in-band de-duplication test cases for in-memory backend, which covers
the bugs exposed during the development.

Qu Wenruo (3):
  fstests: btrfs: Add basic test for btrfs in-band de-duplication
  fstests: btrfs: Add testcase for btrfs dedupe and metadata balance
race test
  fstests: btrfs: Test inband dedupe with data balance.

 common/defrag   |  13 ++
 tests/btrfs/200 | 116 
 tests/btrfs/200.out |  22 ++
 tests/btrfs/201 | 112 ++
 tests/btrfs/201.out |   2 +
 tests/btrfs/203 | 109 
 tests/btrfs/203.out |   3 ++
 tests/btrfs/group   |   4 ++
 8 files changed, 381 insertions(+)
 create mode 100755 tests/btrfs/200
 create mode 100644 tests/btrfs/200.out
 create mode 100755 tests/btrfs/201
 create mode 100644 tests/btrfs/201.out
 create mode 100755 tests/btrfs/203
 create mode 100644 tests/btrfs/203.out

-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication

2017-03-15 Thread Qu Wenruo
Add disable subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  5 
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 42 ++
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 82f970a6..de32eb97 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,6 +22,11 @@ use with caution.
 
 SUBCOMMAND
 --
+*disable* ::
+Disable in-band de-duplication for a filesystem.
++
+This will trash all stored dedupe hash.
++
 *enable* [options] ::
 Enable in-band de-duplication for a filesystem.
 +
diff --git a/btrfs-completion b/btrfs-completion
index 50f7ea2b..9a6c73ba 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -40,7 +40,7 @@ _btrfs()
 commands_property='get set list'
 commands_quota='enable disable rescan'
 commands_qgroup='assign remove create destroy show limit'
-commands_dedupe='enable'
+commands_dedupe='enable disable'
 commands_replace='start status cancel'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index cc9928aa..a8b10924 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -259,10 +259,52 @@ out:
return ret;
 }
 
+static const char * const cmd_dedupe_ib_disable_usage[] = {
+   "btrfs dedupe disable ",
+   "Disable in-band(write time) de-duplication of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_disable(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_disable_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   return 1;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to disable inband deduplication: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+
+out:
+   close_file_or_dir(fd, dirstream);
+   return 0;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
+   { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] fstests: btrfs: Add basic test for btrfs in-band de-duplication

2017-03-15 Thread Qu Wenruo
Add basic test for btrfs in-band de-duplication(inmemory backend), including:
1) Enable
3) Dedup rate
4) File correctness
5) Disable

Signed-off-by: Qu Wenruo 
---
 common/defrag   |  13 ++
 tests/btrfs/200 | 116 
 tests/btrfs/200.out |  22 ++
 tests/btrfs/group   |   2 +
 4 files changed, 153 insertions(+)
 create mode 100755 tests/btrfs/200
 create mode 100644 tests/btrfs/200.out

diff --git a/common/defrag b/common/defrag
index d279382f..0a41714f 100644
--- a/common/defrag
+++ b/common/defrag
@@ -59,6 +59,19 @@ _extent_count()
$XFS_IO_PROG -c "fiemap" $1 | tail -n +2 | grep -v hole | wc -l| 
$AWK_PROG '{print $1}'
 }
 
+# Get the number of unique file extents
+# Unique file extents means they have different ondisk bytenr
+# Some filesystem supports reflinkat() or in-band de-dup can create
+# a file whose all file extents points to the same ondisk bytenr
+# this can be used to test if such reflinkat() or in-band de-dup works
+_extent_count_uniq()
+{
+   file=$1
+   $XFS_IO_PROG -c "fiemap" $file >> $seqres.full 2>&1
+   $XFS_IO_PROG -c "fiemap" $file | tail -n +2 | grep -v hole |\
+   $AWK_PROG '{print $3}' | sort | uniq | wc -l
+}
+
 _check_extent_count()
 {
min=$1
diff --git a/tests/btrfs/200 b/tests/btrfs/200
new file mode 100755
index ..1b3e46fd
--- /dev/null
+++ b/tests/btrfs/200
@@ -0,0 +1,116 @@
+#! /bin/bash
+# FS QA Test 200
+#
+# Basic btrfs inband dedupe test for inmemory backend
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/defrag
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command dedupe
+_require_btrfs_fs_feature dedupe
+
+# File size is twice the maximum file extent of btrfs
+# So even fallbacked to non-dedupe, it will have at least 2 extents
+file_size=256m
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+do_dedupe_test()
+{
+   dedupe_bs=$1
+
+   echo "Testing inmemory dedupe backend with block size $dedupe_bs"
+   _run_btrfs_util_prog dedupe enable -f -s inmemory -b $dedupe_bs \
+   $SCRATCH_MNT
+   # do sync write to ensure dedupe hash is added into dedupe pool
+   $XFS_IO_PROG -f -c "pwrite -b $dedupe_bs 0 $dedupe_bs" -c "fsync"\
+   $SCRATCH_MNT/initial_block | _filter_xfs_io
+
+   # do sync write to ensure we can get stable fiemap later
+   $XFS_IO_PROG -f -c "pwrite -b $dedupe_bs 0 $file_size" -c "fsync"\
+   $SCRATCH_MNT/real_file | _filter_xfs_io
+
+   # Test if real_file is de-duplicated
+   nr_uniq_extents=$(_extent_count_uniq $SCRATCH_MNT/real_file)
+   nr_total_extents=$(_extent_count $SCRATCH_MNT/real_file)
+   nr_deduped_extents=$(($nr_total_extents - $nr_uniq_extents))
+
+   echo "deduped/total: $nr_deduped_extents/$nr_total_extents" \
+   >> $seqres.full
+   # Allow a small amount of dedupe miss, as commit interval or
+   # memory pressure may break a dedupe_bs block and cause
+   # small extent which won't go through dedupe routine
+   _within_tolerance "number of deduped extents" $nr_deduped_extents \
+   $nr_total_extents 5% -v
+
+   # Also check the md5sum to ensure data is not corrupted
+   md5=$(_md5_checksum $SCRATCH_MNT/real_file)
+   echo "md5sum: $md5"
+}
+
+# Test inmemory dedupe first, use 64K dedupe bs to keep compatibility
+# with 64K page size
+do_dedupe_test 64K
+
+# Test 128K(default) dedupe bs
+do_dedupe_test 128K
+
+# Test 1M dedupe bs
+do_dedupe_test 1M
+
+# Check dedupe disable
+_run_btrfs_util_prog dedupe disable $SCRATCH_MNT
+
+# success, all done
+status=0
+exit
+# Check dedupe disable
+_run_btrfs_util_prog dedupe 

Re: [PATCH 4/4] btrfs: add dummy callback for readpage_io_failed and drop checks

2017-03-15 Thread Liu Bo
On Mon, Feb 20, 2017 at 07:31:33PM +0100, David Sterba wrote:
> Make extent_io_ops::readpage_io_failed_hook callback mandatory and
> define a dummy function for btrfs_extent_io_ops. As the failed IO
> callback is not performance critical, the branch vs extra trade off does
> not hurt.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/disk-io.c   | 2 +-
>  fs/btrfs/extent_io.c | 2 +-
>  fs/btrfs/extent_io.h | 2 +-
>  fs/btrfs/inode.c | 7 +++
>  4 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0715b6f3f686..fbf4921f4d60 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4658,7 +4658,7 @@ static const struct extent_io_ops btree_extent_io_ops = 
> {
>   .readpage_end_io_hook = btree_readpage_end_io_hook,
>   /* note we're sharing with inode.c for the merge bio hook */
>   .merge_bio_hook = btrfs_merge_bio_hook,
> + .readpage_io_failed_hook = btree_io_failed_hook,
>  
>   /* optional callbacks */
> - .readpage_io_failed_hook = btree_io_failed_hook,
>  };
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index f5cff93ab152..eaee7bb2ff7c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2578,7 +2578,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   if (likely(uptodate))
>   goto readpage_ok;
>  
> - if (tree->ops && tree->ops->readpage_io_failed_hook) {
> + if (tree->ops) {
>   ret = tree->ops->readpage_io_failed_hook(page, mirror);
>   if (!ret && !bio->bi_error)
>   uptodate = 1;
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 5c5e2e6cfb9e..63c8cc970b1c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,7 @@ struct extent_io_ops {
>   int (*merge_bio_hook)(struct page *page, unsigned long offset,
> size_t size, struct bio *bio,
> unsigned long bio_flags);
> + int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
>  
>   /*
>* Optional hooks, called if the pointer is not NULL
> @@ -109,7 +110,6 @@ struct extent_io_ops {
>   int (*fill_delalloc)(struct inode *inode, struct page *locked_page,
>u64 start, u64 end, int *page_started,
>unsigned long *nr_written);
> - int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
>  
>   int (*writepage_start_hook)(struct page *page, u64 start, u64 end);
>   void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 72faf9b5616a..a74191fa3934 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10503,6 +10503,12 @@ static int btrfs_tmpfile(struct inode *dir, struct 
> dentry *dentry, umode_t mode)
>  
>  }
>  
> +__attribute__((const))
> +static int dummy_readpage_io_failed_hook(struct page *page, int 
> failed_mirror)
> +{
> + return 0;
> +}
> +
>  static const struct inode_operations btrfs_dir_inode_operations = {
>   .getattr= btrfs_getattr,
>   .lookup = btrfs_lookup,
> @@ -10545,6 +10551,7 @@ static const struct extent_io_ops btrfs_extent_io_ops 
> = {
>   .submit_bio_hook = btrfs_submit_bio_hook,
>   .readpage_end_io_hook = btrfs_readpage_end_io_hook,
>   .merge_bio_hook = btrfs_merge_bio_hook,
> + .readpage_io_failed_hook = dummy_readpage_io_failed_hook,

This has made us not call bio_readpage_error() to correct corrupted data...

Thanks,

-liubo
>  
>   /* optional callbacks */
>   .fill_delalloc = run_delalloc_range,
> -- 
> 2.10.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow
Am Wed, 15 Mar 2017 23:26:32 +0100
schrieb Kai Krakow :

> Well, bugs can hit you with every filesystem. Nothing as complex as a

Meh... I fooled myself. Find the mistake... ;-)

SPOILER:

"Nothing" should be "something".

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow
Am Wed, 15 Mar 2017 23:41:41 +0100
schrieb Kai Krakow :

> Am Wed, 15 Mar 2017 23:26:32 +0100
> schrieb Kai Krakow :
> 
> > Well, bugs can hit you with every filesystem. Nothing as complex as
> > a  
> 
> Meh... I fooled myself. Find the mistake... ;-)
> 
> SPOILER:
> 
> "Nothing" should be "something".

*doublefacepalm*

Please forget what I wrote. The original sentence is correct.

I should get some coffee or go to bed. :-\

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow
Am Wed, 15 Mar 2017 07:55:51 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted:
> 
> > Today is safe to use btrfs for home storage? No raid, just secure
> > storage for some files and create snapshots from it.  
> 
> 
> I'll echo the others... but with emphasis on a few caveats the others 
> mentioned but didn't give the emphasis I thought they deserved:
> 
> 1) Btrfs is, as I repeatedly put it in post after post, "stabilizing,
> but not yet fully stable and mature."  In general, that means it's
> likely to work quite or even very well for you (as it has done for
> us) if you don't try the too unusual or get too cocky, but get too
> close to the edge and you just might find yourself over that edge.
> Don't worry too much, tho, those edges are clearly marked if you're
> paying attention, and just by asking here, you're already paying way
> more attention than too many we see here... /after/ they've found
> themselves over the edge.  That's a _very_ good sign. =:^)

Well, bugs can hit you with every filesystem. Nothing as complex as a
file system can ever be proven bug free (except FAT maybe). But as a
general-purpose-no-fancy-features-needed FS, btrfs should be on par
with other FS these days.

> 2) "Stabilizing, not fully stable and mature", means even more than
> ever, if you value your data more than the time, hassle and resources
> necessary to have backups, you HAVE them, tested and available for
> practical use should it be necessary.

This is totally not dependent on "stabilizing, not fully stable and
mature". If you data matters to you, do backups. It's that simple. If
you don't do backups, your data isn't important - by definition.

> Of course any sysadmin (and that's what you are for at least your own 
> systems if you're making this choice) worth the name will tell you
> the value of the data is really defined by the number of backups it
> has, not by any arbitrary claims to value absent those backups.  No
> backups, you simply didn't value the data enough to have them,
> whatever claims of value you might otherwise try to make.  Backups,
> you /did/ value the data.

Yes. :-)

> And of course the corollary to that first sysadmin's rule of backups
> is that an untested as restorable backup isn't yet a backup, only a 
> potential backup, because the job isn't finished and it can't be
> properly called a backup until you know you can restore from it if
> necessary.

Even more true. :-)

> And lest anyone get the wrong idea, a snapshot is /not/ a backup for 
> purposes of the above rules.  It's on the same filesystem and
> hardware media and if that goes down... you've lost it just the
> same.  And since that filesystem is still stabilizing, you really
> must be even more prepared for it to go down, even if the chances are
> still quite good it won't.

A good backup should follow the 3-2-1 rule: Have 3 different backup
copies, 2 different media, and store at least 1 copy external/off-site.

For customers, we usually deploy a strategy like this for Windows
machines: Do one local backup using Windows Image Backup to a local
NAS to backup from inside the VM, use a different software to do image
backups from outside of the VM to the local NAS, mirror the "outside
image" to a remote location (cloud storage). And keep some backup
history. Overwriting the one existing backup with a new one won't help
you anything. All involved software should be able to do efficient
delta backups, otherwise mirroring offsite may be no fun.

In linux, I'm using borgbackup and rsync to have something similar.
Using borgbackup to a local storage, and syncing it offsite with rsync
gives me the 2-1 rule part. You can get the third rule by using rsync
to also mirror the local FS off the machine. But that's usually
overkill for personal backups. Instead, I only have a third copy of
most valuable data like photos, dev stuff, documents, etc.

BTW: For me, different media also means different FS types. So a bug in
one FS wouldn't easily hit the other.

[snip]

> 4) Keep the number of snapshots per subvolume under tight control as 
> already suggested.  A few hundred, NOT a few thousand.  Easy enough
> if you do those snapshots manually, but easy enough to get thousands
> if you're not paying attention to thin out the old ones and using an 
> automated tool such as snapper.

Borgbackup is so fast and storage efficient that you could run it easily
multiple times per day. That in turn means I don't need to rely on
regular snapshots to undo mistakes. I only use snapshots before doing
some knowingly risky stuff to have fast recovery. But that's all,
nothing else should snapshots before (except you are doing more
advanced stuff like container cloning, VM instance spawning, ...).

> 5) Stay away from quotas.  Either you need the feature and thus need
> a more mature filesystem where it's actually stable and does what it
> says on the label, or you don't, in which 

[PATCH 1/8] nowait aio: Introduce IOCB_RW_FLAG_NOWAIT

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This flag informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

Unfortunately, aio_flags is not checked for validity. If we
add the flags to aio_flags, it would break existing applications
which have it set to anything besides zero or IOCB_FLAG_RESFD.
So, we are using aio_reserved1 and renaming it to aio_rw_flags.

IOCB_RW_FLAG_NOWAIT is translated to IOCB_NOWAIT for
iocb->ki_flags.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/aio.c | 10 +-
 include/linux/fs.h   |  1 +
 include/uapi/linux/aio_abi.h |  9 -
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925..41409ac 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,11 +1541,16 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
 
+   if (unlikely(iocb->aio_rw_flags & ~IOCB_RW_FLAG_NOWAIT)) {
+   pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
+   return -EINVAL;
+   }
+
/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
@@ -1586,6 +1591,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   if (iocb->aio_rw_flags & IOCB_RW_FLAG_NOWAIT)
+   req->common.ki_flags |= IOCB_NOWAIT;
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7b..e8d9346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 << 4)
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
+#define IOCB_NOWAIT(1 << 7)
 
 struct kiocb {
struct file *ki_filp;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..6d98cbe 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -54,6 +54,13 @@ enum {
  */
 #define IOCB_FLAG_RESFD(1 << 0)
 
+/*
+ * Flags for aio_rw_flags member of "struct iocb".
+ * IOCB_RW_FLAG_NOWAIT - Set if the user wants the iocb to fail if it
+ * would block for operations such as disk allocation.
+ */
+#define IOCB_RW_FLAG_NOWAIT(1 << 1)
+
 /* read() from /dev/aio returns these structures. */
 struct io_event {
__u64   data;   /* the data field from the iocb */
@@ -79,7 +86,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

A failure to lock i_rwsem would mean there is I/O being performed
by another thread. So, let's bail.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Goldwyn Rodrigues 
---
 mm/filemap.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623..e08f3b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2982,7 +2982,12 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct inode *inode = file->f_mapping->host;
ssize_t ret;
 
-   inode_lock(inode);
+   if (!inode_trylock(inode)) {
+   /* Don't sleep on inode rwsem */
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   inode_lock(inode);
+   }
ret = generic_write_checks(iocb, from);
if (ret > 0)
ret = __generic_file_write_iter(iocb, from);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] nowait aio: return if direct write will trigger writeback

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues 
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 50 +++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e8d9346..4a30e8f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2514,6 +2514,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+  loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index e08f3b9..c020e23 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+   loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
@@ -2640,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2709,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] nowait aio: btrfs

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb72..a870e5d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c40060c..788bb93 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8613,6 +8613,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] nowait aio: xfs

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/xfs/xfs_file.c  | 15 +++
 fs/xfs/xfs_iomap.c | 13 +
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a8..08a5eef 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,13 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+atomic_read(>i_dio_count))
+   return -EAGAIN;
inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b..6843725 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,6 +1015,11 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /* A reflinked inode will result in CoW alloc */
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1032,6 +1037,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] nowait aio: return on congested block device

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

A new flag BIO_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Signed-off-by: Goldwyn Rodrigues 
---
 block/blk-core.c  | 12 ++--
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  4 
 fs/direct-io.c| 11 +--
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  1 +
 6 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99e..2e5cba2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (bio && bio_flagged(bio, BIO_NOWAIT)) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 
0)) {
struct bio_list hold;
struct bio_list lower, same;
 
@@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack, );
bio_list_merge(_list_on_stack, );
} else {
-   bio_io_error(bio);
+   if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(current->bio_list);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff..40e78b5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a..942ce8c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio && bio_flagged(bio, BIO_NOWAIT))
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
@@ -1642,6 +1644,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio && bio_flagged(bio, BIO_NOWAIT))
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea..f6835d3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
 
+   if (dio->iocb->ki_flags & IOCB_NOWAIT)
+   bio_set_flag(bio, BIO_NOWAIT);
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)
-   dio->io_error = -EIO;
+   if (bio->bi_error) {
+   if (bio_flagged(bio, BIO_NOWAIT))
+   dio->io_error = -EAGAIN;
+   else
+   dio->io_error = -EIO;
+   }
 
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
err = bio->bi_error;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..1a92707 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+   bio->bi_error = -EAGAIN;
+   bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 

[PATCH 6/8] nowait aio: ext4

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail for direct I/O:
 + i_rwsem is lockable
 + Writing beyond end of file (will trigger allocation)
 + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues 
---
 fs/ext4/file.c | 48 +++-
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8210c1f..e223b9f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -127,27 +127,22 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter 
*from, loff_t pos)
return 0;
 }
 
-/* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+/* Are IO blocks allocated */
+static bool ext4_blocks_mapped(struct inode *inode, loff_t pos, loff_t len,
+   struct ext4_map_blocks *map)
 {
-   struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
int err, blklen;
 
if (pos + len > i_size_read(inode))
return false;
 
-   map.m_lblk = pos >> blkbits;
-   map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-   blklen = map.m_len;
+   map->m_lblk = pos >> blkbits;
+   map->m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
+   blklen = map->m_len;
 
-   err = ext4_map_blocks(NULL, inode, , 0);
-   /*
-* 'err==len' means that all of the blocks have been preallocated,
-* regardless of whether they have been initialized or not. To exclude
-* unwritten extents, we need to check m_flags.
-*/
-   return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
+   err = ext4_map_blocks(NULL, inode, map, 0);
+   return err == blklen;
 }
 
 static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
@@ -204,6 +199,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 {
struct inode *inode = file_inode(iocb->ki_filp);
int o_direct = iocb->ki_flags & IOCB_DIRECT;
+   int nowait = iocb->ki_flags & IOCB_NOWAIT;
int unaligned_aio = 0;
int overwrite = 0;
ssize_t ret;
@@ -216,7 +212,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (o_direct && nowait) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +237,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   struct ext4_map_blocks map;
+   if (ext4_blocks_mapped(inode, iocb->ki_pos,
+ iov_iter_count(from), )) {
+   /* To exclude unwritten extents, we need to check
+* m_flags.
+*/
+   if (ext4_should_dioread_nolock(inode) &&
+   (map.m_flags & EXT4_MAP_MAPPED))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT

2017-03-15 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd..d1c8175 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810..53f6af8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8 v3] No wait AIO

2017-03-15 Thread Goldwyn Rodrigues
Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, BIO_NOWAIT for bio and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] cleanup __btrfs_map_block

2017-03-15 Thread Liu Bo
On Wed, Mar 15, 2017 at 02:07:53PM +0100, David Sterba wrote:
> On Tue, Mar 14, 2017 at 01:33:54PM -0700, Liu Bo wrote:
> > This is attempting to make __btrfs_map_block less scary :)
> > 
> > The major changes are
> > 
> > 1) split operations for discard out of __btrfs_map_block and we don't copy
> > discard operations for the target device of dev replace since they're not
> > as important as writes.
> > 
> > 2) put dev replace stuff into helpers since they're basically
> > self-independant.
> 
> Thank, I'm going to add the branch to 4.12 queue (right now the branch
> is misc-next but it could change),
> 
> https://marc.info/?l=linux-btrfs=148741582021588
> 
> and fix that one too.

Oh, sorry about that, copy-and-paste...

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


possible deadlock between fsfreeze and asynchronous faults

2017-03-15 Thread Nikolay Borisov
Hello, 

Here is a nother lockdep splat I got: 

[ 1131.517411] ==
[ 1131.518059] [ INFO: possible circular locking dependency detected ]
[ 1131.518059] 4.11.0-rc1-nbor #147 Tainted: GW  
[ 1131.518059] ---
[ 1131.518059] xfs_io/2661 is trying to acquire lock:
[ 1131.518059]  (sb_internal#2){.+}, at: [] 
percpu_down_write+0x25/0x120
[ 1131.518059] 
[ 1131.518059] but task is already holding lock:
[ 1131.518059]  (sb_pagefaults){..}, at: [] 
percpu_down_write+0x25/0x120
[ 1131.518059] 
[ 1131.518059] which lock already depends on the new lock.
[ 1131.518059] 
[ 1131.518059] 
[ 1131.518059] the existing dependency chain (in reverse order) is:
[ 1131.518059] 
[ 1131.518059] -> #4 (sb_pagefaults){..}:
[ 1131.518059]lock_acquire+0xc5/0x220
[ 1131.518059]__sb_start_write+0x119/0x1d0
[ 1131.518059]btrfs_page_mkwrite+0x51/0x420
[ 1131.518059]do_page_mkwrite+0x38/0xb0
[ 1131.518059]__handle_mm_fault+0x6b5/0xef0
[ 1131.518059]handle_mm_fault+0x175/0x300
[ 1131.518059]__do_page_fault+0x1e0/0x4d0
[ 1131.518059]trace_do_page_fault+0xaa/0x270
[ 1131.518059]do_async_page_fault+0x19/0x70
[ 1131.518059]async_page_fault+0x28/0x30
[ 1131.518059] 
[ 1131.518059] -> #3 (>mmap_sem){++}:
[ 1131.518059]lock_acquire+0xc5/0x220
[ 1131.518059]down_read+0x47/0x70
[ 1131.518059]get_user_pages_unlocked+0x4f/0x1a0
[ 1131.518059]get_user_pages_fast+0x81/0x170
[ 1131.518059]iov_iter_get_pages+0xc1/0x300
[ 1131.518059]__blockdev_direct_IO+0x14f8/0x34e0
[ 1131.518059]btrfs_direct_IO+0x1e8/0x390
[ 1131.518059]generic_file_direct_write+0xb5/0x160
[ 1131.518059]btrfs_file_write_iter+0x26d/0x500
[ 1131.518059]aio_write+0xdb/0x190
[ 1131.518059]do_io_submit+0x5aa/0x830
[ 1131.518059]SyS_io_submit+0x10/0x20
[ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1131.518059] 
[ 1131.518059] -> #2 (>dio_sem){.+}:
[ 1131.518059]lock_acquire+0xc5/0x220
[ 1131.518059]down_write+0x44/0x80
[ 1131.518059]btrfs_log_changed_extents+0x7c/0x660
[ 1131.518059]btrfs_log_inode+0xb78/0xf50
[ 1131.518059]btrfs_log_inode_parent+0x2a9/0xa70
[ 1131.518059]btrfs_log_dentry_safe+0x74/0xa0
[ 1131.518059]btrfs_sync_file+0x321/0x4d0
[ 1131.518059]vfs_fsync_range+0x46/0xc0
[ 1131.518059]vfs_fsync+0x1c/0x20
[ 1131.518059]do_fsync+0x38/0x60
[ 1131.518059]SyS_fsync+0x10/0x20
[ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1131.518059] 
[ 1131.518059] -> #1 (>log_mutex){+.+...}:
[ 1131.518059]lock_acquire+0xc5/0x220
[ 1131.518059]__mutex_lock+0x7c/0x960
[ 1131.518059]mutex_lock_nested+0x1b/0x20
[ 1131.518059]btrfs_record_unlink_dir+0x3e/0xb0
[ 1131.518059]btrfs_unlink+0x72/0xf0
[ 1131.518059]vfs_unlink+0xbe/0x1b0
[ 1131.518059]do_unlinkat+0x244/0x280
[ 1131.518059]SyS_unlinkat+0x1d/0x30
[ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1131.518059] 
[ 1131.518059] -> #0 (sb_internal#2){.+}:
[ 1131.518059]__lock_acquire+0x16f1/0x17c0
[ 1131.518059]lock_acquire+0xc5/0x220
[ 1131.518059]down_write+0x44/0x80
[ 1131.518059]percpu_down_write+0x25/0x120
[ 1131.518059]freeze_super+0xbf/0x1a0
[ 1131.518059]do_vfs_ioctl+0x598/0x770
[ 1131.518059]SyS_ioctl+0x4c/0x90
[ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1131.518059] 
[ 1131.518059] other info that might help us debug this:
[ 1131.518059] 
[ 1131.518059] Chain exists of:
[ 1131.518059]   sb_internal#2 --> >mmap_sem --> sb_pagefaults
[ 1131.518059] 
[ 1131.518059]  Possible unsafe locking scenario:
[ 1131.518059] 
[ 1131.518059]CPU0CPU1
[ 1131.518059]
[ 1131.518059]   lock(sb_pagefaults);
[ 1131.518059]lock(>mmap_sem);
[ 1131.518059]lock(sb_pagefaults);
[ 1131.518059]   lock(sb_internal#2);
[ 1131.518059] 
[ 1131.518059]  *** DEADLOCK ***
[ 1131.518059] 
[ 1131.518059] 3 locks held by xfs_io/2661:
[ 1131.518059]  #0:  (sb_writers#11){.+}, at: [] 
percpu_down_write+0x25/0x120
[ 1131.518059]  #1:  (>s_umount_key#33){+.}, at: [] 
freeze_super+0x93/0x1a0
[ 1131.518059]  #2:  (sb_pagefaults){..}, at: [] 
percpu_down_write+0x25/0x120
[ 1131.518059] 
[ 1131.518059] stack backtrace:
[ 1131.518059] CPU: 0 PID: 2661 Comm: xfs_io Tainted: GW   
4.11.0-rc1-nbor #147
[ 1131.518059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 1131.518059] Call Trace:
[ 1131.518059]  dump_stack+0x85/0xc9
[ 1131.518059]  print_circular_bug+0x2ac/0x2ba
[ 1131.518059]  __lock_acquire+0x16f1/0x17c0
[ 

[PATCH 5/7] btrfs: remove redundant parameter from reada_find_zone

2017-03-15 Thread David Sterba
We can read fs_info from dev.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 5edf7328f67d..c1fc79cd4b2a 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -235,10 +235,10 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err)
return ret;
 }
 
-static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info,
- struct btrfs_device *dev, u64 logical,
+static struct reada_zone *reada_find_zone(struct btrfs_device *dev, u64 
logical,
  struct btrfs_bio *bbio)
 {
+   struct btrfs_fs_info *fs_info = dev->fs_info;
int ret;
struct reada_zone *zone;
struct btrfs_block_group_cache *cache = NULL;
@@ -372,7 +372,7 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
 if (!dev->bdev)
continue;
 
-   zone = reada_find_zone(fs_info, dev, logical, bbio);
+   zone = reada_find_zone(dev, logical, bbio);
if (!zone)
continue;
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] btrfs: preallocate radix tree node for readahead

2017-03-15 Thread David Sterba
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the per-device radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, same as an allocation above anyway, but also
because readahead is optional and not on any critical writeout path.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c   | 7 +++
 fs/btrfs/volumes.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index e88bca87f5d2..fdae8ca79401 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -270,6 +270,12 @@ static struct reada_zone *reada_find_zone(struct 
btrfs_fs_info *fs_info,
if (!zone)
return NULL;
 
+   ret = radix_tree_preload(GFP_KERNEL);
+   if (ret) {
+   kfree(zone);
+   return NULL;
+   }
+
zone->start = start;
zone->end = end;
INIT_LIST_HEAD(>list);
@@ -299,6 +305,7 @@ static struct reada_zone *reada_find_zone(struct 
btrfs_fs_info *fs_info,
zone = NULL;
}
spin_unlock(_info->reada_lock);
+   radix_tree_preload_end();
 
return zone;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..f158b8657ae3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -247,7 +247,7 @@ static struct btrfs_device *__alloc_device(void)
atomic_set(>reada_in_flight, 0);
atomic_set(>dev_stats_ccnt, 0);
btrfs_device_data_ordered_init(dev);
-   INIT_RADIX_TREE(>reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+   INIT_RADIX_TREE(>reada_zones, GFP_KERNEL);
INIT_RADIX_TREE(>reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 
return dev;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] btrfs: remove local blocksize variable in reada_find_extent

2017-03-15 Thread David Sterba
The name is misleading and the local variable serves no purpose.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 91df381a60ce..64425c3fe4f5 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -318,7 +318,6 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
struct btrfs_bio *bbio = NULL;
struct btrfs_device *dev;
struct btrfs_device *prev_dev;
-   u32 blocksize;
u64 length;
int real_stripes;
int nzones = 0;
@@ -339,7 +338,6 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
if (!re)
return NULL;
 
-   blocksize = fs_info->nodesize;
re->logical = logical;
re->top = *top;
INIT_LIST_HEAD(>extctl);
@@ -349,10 +347,10 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
/*
 * map block
 */
-   length = blocksize;
+   length = fs_info->nodesize;
ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
, , 0);
-   if (ret || !bbio || length < blocksize)
+   if (ret || !bbio || length < fs_info->nodesize)
goto error;
 
if (bbio->num_stripes > BTRFS_MAX_MIRRORS) {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] btrfs: remove redundant parameter from btree_readahead_hook

2017-03-15 Thread David Sterba
We can read fs_info from eb.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   | 3 +--
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/reada.c   | 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..173fac68323a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3671,8 +3671,7 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
  struct btrfs_key *start, struct btrfs_key *end);
 int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
-int btree_readahead_hook(struct btrfs_fs_info *fs_info,
-struct extent_buffer *eb, int err);
+int btree_readahead_hook(struct extent_buffer *eb, int err);
 
 static inline int is_fstree(u64 rootid)
 {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1d4c30327247..995b28179af9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -762,7 +762,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 err:
if (reads_done &&
test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(fs_info, eb, ret);
+   btree_readahead_hook(eb, ret);
 
if (ret) {
/*
@@ -787,7 +787,7 @@ static int btree_io_failed_hook(struct page *page, int 
failed_mirror)
eb->read_mirror = failed_mirror;
atomic_dec(>io_pages);
if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(eb->fs_info, eb, -EIO);
+   btree_readahead_hook(eb, -EIO);
return -EIO;/* we fixed nothing */
 }
 
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 4c5a9b241cab..5edf7328f67d 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -209,9 +209,9 @@ static void __readahead_hook(struct btrfs_fs_info *fs_info,
return;
 }
 
-int btree_readahead_hook(struct btrfs_fs_info *fs_info,
-struct extent_buffer *eb, int err)
+int btree_readahead_hook(struct extent_buffer *eb, int err)
 {
+   struct btrfs_fs_info *fs_info = eb->fs_info;
int ret = 0;
struct reada_extent *re;
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] btrfs: preallocate radix tree node for global readahead tree

2017-03-15 Thread David Sterba
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the global radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, because readahead is optional and not on any
critical writeout path.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/reada.c   | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74daf35d0..1d4c30327247 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2693,7 +2693,7 @@ int open_ctree(struct super_block *sb,
fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
/* readahead state */
-   INIT_RADIX_TREE(_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+   INIT_RADIX_TREE(_info->reada_tree, GFP_KERNEL);
spin_lock_init(_info->reada_lock);
 
fs_info->thread_pool_size = min_t(unsigned long,
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index dd78af5d265d..4c5a9b241cab 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -391,6 +391,10 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
goto error;
}
 
+   ret = radix_tree_preload(GFP_KERNEL);
+   if (ret)
+   goto error;
+
/* insert extent in reada_tree + all per-device trees, all or nothing */
btrfs_dev_replace_lock(_info->dev_replace, 0);
spin_lock(_info->reada_lock);
@@ -400,13 +404,16 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
re_exist->refcnt++;
spin_unlock(_info->reada_lock);
btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   radix_tree_preload_end();
goto error;
}
if (ret) {
spin_unlock(_info->reada_lock);
btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   radix_tree_preload_end();
goto error;
}
+   radix_tree_preload_end();
prev_dev = NULL;
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(
_info->dev_replace);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] Readahead clenups

2017-03-15 Thread David Sterba
I spotted some GFP_NOFS uses in readahead and converted them to GFP_KERNEL with
a few cleanups along the way.

David Sterba (7):
  btrfs: preallocate radix tree node for readahead
  btrfs: use simpler readahead zone lookups
  btrfs: preallocate radix tree node for global readahead tree
  btrfs: remove redundant parameter from btree_readahead_hook
  btrfs: remove redundant parameter from reada_find_zone
  btrfs: remove redundant parameter from reada_start_machine_dev
  btrfs: remove local blocksize variable in reada_find_extent

 fs/btrfs/ctree.h   |  3 +-
 fs/btrfs/disk-io.c |  6 ++--
 fs/btrfs/reada.c   | 89 --
 fs/btrfs/volumes.c |  2 +-
 4 files changed, 51 insertions(+), 49 deletions(-)

-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] btrfs: use simpler readahead zone lookups

2017-03-15 Thread David Sterba
No point using radix_tree_gang_lookup if we're looking up just one slot.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 52 ++--
 1 file changed, 22 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index fdae8ca79401..dd78af5d265d 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -246,11 +246,9 @@ static struct reada_zone *reada_find_zone(struct 
btrfs_fs_info *fs_info,
u64 end;
int i;
 
-   zone = NULL;
spin_lock(_info->reada_lock);
-   ret = radix_tree_gang_lookup(>reada_zones, (void **),
-logical >> PAGE_SHIFT, 1);
-   if (ret == 1 && logical >= zone->start && logical <= zone->end) {
+   zone = radix_tree_lookup(>reada_zones, logical >> PAGE_SHIFT);
+   if (zone && logical >= zone->start && logical <= zone->end) {
kref_get(>refcnt);
spin_unlock(_info->reada_lock);
return zone;
@@ -297,9 +295,9 @@ static struct reada_zone *reada_find_zone(struct 
btrfs_fs_info *fs_info,
 
if (ret == -EEXIST) {
kfree(zone);
-   ret = radix_tree_gang_lookup(>reada_zones, (void **),
-logical >> PAGE_SHIFT, 1);
-   if (ret == 1 && logical >= zone->start && logical <= zone->end)
+   zone = radix_tree_lookup(>reada_zones,
+   logical >> PAGE_SHIFT);
+   if (zone && logical >= zone->start && logical <= zone->end)
kref_get(>refcnt);
else
zone = NULL;
@@ -604,7 +602,6 @@ static int reada_pick_zone(struct btrfs_device *dev)
u64 top_elems = 0;
u64 top_locked_elems = 0;
unsigned long index = 0;
-   int ret;
 
if (dev->reada_curr_zone) {
reada_peer_zones_set_lock(dev->reada_curr_zone, 0);
@@ -615,9 +612,8 @@ static int reada_pick_zone(struct btrfs_device *dev)
while (1) {
struct reada_zone *zone;
 
-   ret = radix_tree_gang_lookup(>reada_zones,
-(void **), index, 1);
-   if (ret == 0)
+   zone = radix_tree_lookup(>reada_zones, index);
+   if (!zone)
break;
index = (zone->end >> PAGE_SHIFT) + 1;
if (zone->locked) {
@@ -669,19 +665,18 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
 * a contiguous block of extents, we could also coagulate them or use
 * plugging to speed things up
 */
-   ret = radix_tree_gang_lookup(>reada_extents, (void **),
-dev->reada_next >> PAGE_SHIFT, 1);
-   if (ret == 0 || re->logical > dev->reada_curr_zone->end) {
+   re = radix_tree_lookup(>reada_extents,
+   dev->reada_next >> PAGE_SHIFT);
+   if (!re || re->logical > dev->reada_curr_zone->end) {
ret = reada_pick_zone(dev);
if (!ret) {
spin_unlock(_info->reada_lock);
return 0;
}
-   re = NULL;
-   ret = radix_tree_gang_lookup(>reada_extents, (void **),
-   dev->reada_next >> PAGE_SHIFT, 1);
+   re = radix_tree_lookup(>reada_extents,
+   dev->reada_next >> PAGE_SHIFT);
}
-   if (ret == 0) {
+   if (!re) {
spin_unlock(_info->reada_lock);
return 0;
}
@@ -809,7 +804,6 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
unsigned long index;
-   int ret;
int i;
int j;
int cnt;
@@ -821,9 +815,9 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
index = 0;
while (1) {
struct reada_zone *zone;
-   ret = radix_tree_gang_lookup(>reada_zones,
-(void **), index, 1);
-   if (ret == 0)
+
+   zone = radix_tree_lookup(>reada_zones, index);
+   if (!zone)
break;
pr_debug("  zone %llu-%llu elems %llu locked %d devs",
zone->start, zone->end, zone->elems,
@@ -841,11 +835,10 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
cnt = 0;
index = 0;
while (all) {
-   struct reada_extent *re = NULL;
+   struct reada_extent *re;
 
-   ret = radix_tree_gang_lookup(>reada_extents,
- 

[PATCH 6/7] btrfs: remove redundant parameter from reada_start_machine_dev

2017-03-15 Thread David Sterba
We can read fs_info from dev.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index c1fc79cd4b2a..91df381a60ce 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -649,9 +649,9 @@ static int reada_pick_zone(struct btrfs_device *dev)
return 1;
 }
 
-static int reada_start_machine_dev(struct btrfs_fs_info *fs_info,
-  struct btrfs_device *dev)
+static int reada_start_machine_dev(struct btrfs_device *dev)
 {
+   struct btrfs_fs_info *fs_info = dev->fs_info;
struct reada_extent *re = NULL;
int mirror_num = 0;
struct extent_buffer *eb = NULL;
@@ -763,8 +763,7 @@ static void __reada_start_machine(struct btrfs_fs_info 
*fs_info)
list_for_each_entry(device, _devices->devices, dev_list) {
if (atomic_read(>reada_in_flight) <
MAX_IN_FLIGHT)
-   enqueued += reada_start_machine_dev(fs_info,
-   device);
+   enqueued += reada_start_machine_dev(device);
}
mutex_unlock(_devices->device_list_mutex);
total += enqueued;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: remove unused qgroup members from btrfs_trans_handle

2017-03-15 Thread David Sterba
The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562dec83b3), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent-tree.c   |  1 -
 fs/btrfs/qgroup.c| 12 
 fs/btrfs/qgroup.h|  1 -
 fs/btrfs/tests/btrfs-tests.c |  1 -
 fs/btrfs/transaction.c   |  3 ---
 fs/btrfs/transaction.h   |  2 --
 6 files changed, 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..b5682abf6f68 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3003,7 +3003,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
goto again;
}
 out:
-   assert_qgroups_uptodate(trans);
trans->can_flush_pending_bgs = can_flush_pending_bgs;
return 0;
 }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a5da750c1087..2fa0b10d239f 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2487,18 +2487,6 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
spin_unlock(_info->qgroup_lock);
 }
 
-void assert_qgroups_uptodate(struct btrfs_trans_handle *trans)
-{
-   if (list_empty(>qgroup_ref_list) && !trans->delayed_ref_elem.seq)
-   return;
-   btrfs_err(trans->fs_info,
-   "qgroups not uptodate in trans handle %p:  list is%s empty, seq 
is %#x.%x",
-   trans, list_empty(>qgroup_ref_list) ? "" : " not",
-   (u32)(trans->delayed_ref_elem.seq >> 32),
-   (u32)trans->delayed_ref_elem.seq);
-   BUG();
-}
-
 /*
  * returns < 0 on error, 0 when more leafs are to be scanned.
  * returns 1 when done.
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 26932a8a1993..96fc56ebf55a 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -196,7 +196,6 @@ static inline void btrfs_qgroup_free_delayed_ref(struct 
btrfs_fs_info *fs_info,
btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
 }
-void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index ea272432c930..b18ab8f327a5 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -237,7 +237,6 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle 
*trans)
 {
memset(trans, 0, sizeof(*trans));
trans->transid = 1;
-   INIT_LIST_HEAD(>qgroup_ref_list);
trans->type = __TRANS_DUMMY;
 }
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 61b807de3e16..9db3b4ca0264 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -572,7 +572,6 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
 
h->type = type;
h->can_flush_pending_bgs = true;
-   INIT_LIST_HEAD(>qgroup_ref_list);
INIT_LIST_HEAD(>new_bgs);
 
smp_mb();
@@ -917,7 +916,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
wake_up_process(info->transaction_kthread);
err = -EIO;
}
-   assert_qgroups_uptodate(trans);
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
if (must_run_delayed_refs) {
@@ -2223,7 +2221,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 
switch_commit_roots(cur_trans, fs_info);
 
-   assert_qgroups_uptodate(trans);
ASSERT(list_empty(_trans->dirty_bgs));
ASSERT(list_empty(_trans->io_bgs));
update_super_roots(fs_info);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 5dfb5590fff6..2e560d2abdff 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -125,8 +125,6 @@ struct btrfs_trans_handle {
unsigned int type;
struct btrfs_root *root;
struct btrfs_fs_info *fs_info;
-   struct seq_list delayed_ref_elem;
-   struct list_head qgroup_ref_list;
struct list_head new_bgs;
 };
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: provide enumeration for __merge_refs mode argument

2017-03-15 Thread David Sterba
On Mon, Mar 13, 2017 at 02:32:03PM -0600, ednadol...@gmail.com wrote:
> @@ -809,14 +814,12 @@ static int __add_missing_keys(struct btrfs_fs_info 
> *fs_info,
>  /*
>   * merge backrefs and adjust counts accordingly
>   *
> - * mode = 1: merge identical keys, if key is set
>   *FIXME: if we add more keys in __add_prelim_ref, we can merge more here.
>   *   additionally, we could even add a key range for the blocks we
>   *   looked into to merge even more (-> replace unresolved refs by 
> those
>   *   having a parent).

The 'FIXME' seems to refer to mode = 1, but now that you remove it, it's
not clear what it's referring to. Mentioning MERGE_IDENTICAL_KEYS would
be good.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro

2017-03-15 Thread David Sterba
On Mon, Mar 13, 2017 at 02:32:04PM -0600, ednadol...@gmail.com wrote:
> From: Edmund Nadolski 
> 
> Define the SEQ_NONE macro to replace (u64)-1 in places where said
> value triggers a special-case ref search behavior.

> index 9c41fba..20915a6 100644
> --- a/fs/btrfs/backref.h
> +++ b/fs/btrfs/backref.h
> @@ -23,6 +23,8 @@
>  #include "ulist.h"
>  #include "extent_io.h"
>  
> +#define SEQ_NONE ((u64)-1)

Can you please move the definition to ctree.h, near line 660, where
seq_list and SEQ_LIST_INIT are defined, so thay're all grouped together?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


claim of vtagfs feature

2017-03-15 Thread Alexandr Unger
Hi to everyone here!

Firstly thank you all for this amasing project!

I want to claim "virtual TAG file system" feature to be implemented in BTRFS.

What is it?

It is a feature to simplify use and search data (files) with common tags.

Some tags may be defined as a file attribute, like year of creation&|change
so user can access same file with different (virtual) paths like:

two diff files under ~/vtagfs/root/ :
firm1/reports/2016/report
firm1/reports/2017/report
(firm1/reports/2017/.tags/report or just .tags - sysfile with all tags
for this file/dir)
... and
2017/reports/firm1/report - will be link to second one

also may be set by default automatic lock to edit files with "old year
tag"... and so on...


How it can works.

/etc/vtagfs/ - place for global configs
~/.vtagfs/ - user space for configs

~/vtagfs/ - default or fixed place for data and links
~/vtagfs/.tags/ - place for data used by vtagfs itself
~/vtagfs/root/ - (root is fixed dir for all data operated by users'
vtagfs - to be known by other programs)
~/vtagfs/root/tag1/ - place for data with tag1 (only tag1)
~/vtagfs/root/tag1/.tag - file marker that this dir is used as a tag.
It maybe used to declare
possible sets of tags with this one.

~/vtagfs/root/tag1//tag2/ - place for data with tag1 & tag2
...
~/vtagfs/tag1/tag2 - slink showed to user
~/vtagfs/tag1/tag3 - slink showed to user
~/vtagfs/tag2/tag1 - slink showed to user
~/vtagfs/tag2/tag3 - slink showed to user
~/vtagfs/tag1/tag2/tag3/ - slink showed to user
...
Sure, must be something like tagManager to define/operate with tags
used by user.

or/and it can be realized as transparent to user/system requests like mkdir, so
mkdir tag1 inside ~/vtagfs/  will create tag1...
and cp file1 tag1/tag2/tag3/ - will set these tags to file1 ...
  ... and make cp file1 ~/vtagfs/root/tag1/tag2/tag3/ where will
be real place of file, known to system.

Hope this feature will be useful, so thank you all for patience :)

Best regards,

Alexandr.

  l-in-k  . c o m
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] cleanup __btrfs_map_block

2017-03-15 Thread David Sterba
On Tue, Mar 14, 2017 at 01:33:54PM -0700, Liu Bo wrote:
> This is attempting to make __btrfs_map_block less scary :)
> 
> The major changes are
> 
> 1) split operations for discard out of __btrfs_map_block and we don't copy
> discard operations for the target device of dev replace since they're not
> as important as writes.
> 
> 2) put dev replace stuff into helpers since they're basically
> self-independant.

Thank, I'm going to add the branch to 4.12 queue (right now the branch
is misc-next but it could change),

https://marc.info/?l=linux-btrfs=148741582021588

and fix that one too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2017-03-15 Thread Alexandr Unger
Hi to everyone here!

Firstly thank you all for this amasing project!

I want to claim "virtual TAG file system" feature to be implemented in BTRFS.

What is it?

It is a feature to simplify use and search data (files) with common tags.

Some tags may be defined as a file attribute, like year of creation&|change
so user can access same file with different (virtual) paths like:

two diff files under ~/vtagfs/root/ :
firm1/reports/2016/report
firm1/reports/2017/report
(firm1/reports/2017/.tags/report - sysfile with all tags for this file)
... and
2017/reports/firm1/report - will be link to second one

also may be set by default automatic lock to edit files with "old year
tag"... and so on...


How it can works.

/etc/vtagfs/ - place for global configs
~/.vtagfs/ - user space for configs

~/vtagfs/ - default or fixed place for data and links
~/vtagfs/.tags/ - place for data used by vtagfs itself
~/vtagfs/root/ - (root is fixed dir for all data operated by users'
vtagfs - to be known by other programs)
~/vtagfs/root/tag1/ - place for data with tag1 (only tag1)
~/vtagfs/root/tag1/.tag - file marker that this dir is used as a tag.
It maybe used to declare
possible sets of tags with this one.

~/vtagfs/root/tag1//tag2/ - place for data with tag1 & tag2
...
~/vtagfs/tag1/tag2 - slink showed to user
~/vtagfs/tag1/tag3 - slink showed to user
~/vtagfs/tag2/tag1 - slink showed to user
~/vtagfs/tag2/tag3 - slink showed to user
~/vtagfs/tag1/tag2/tag3/ - slink showed to user
...
Sure, must be something like tagManager to define/operate with tags
used by user.

or/and it can be realized as transparent to user/system requests like mkdir, so
mkdir tag1 inside ~/vtagfs/  will create tag1...
and cp file1 tag1/tag2/tag3/ - will set these tags to file1 ...
  ... and make cp file1 ~/vtagfs/root/tag1/tag2/tag3/ where will
be real place of file, known to system.

Hope this feature will be useful, so thank you all for patience :)

Best regards, and welcome to https://l-in-k.com

Alexandr.

http://ungerware.biz
https://l-in-k.com/147a258u369
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-15 Thread Duncan
Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted:

> Today is safe to use btrfs for home storage? No raid, just secure
> storage for some files and create snapshots from it.


I'll echo the others... but with emphasis on a few caveats the others 
mentioned but didn't give the emphasis I thought they deserved:

1) Btrfs is, as I repeatedly put it in post after post, "stabilizing, but 
not yet fully stable and mature."  In general, that means it's likely to 
work quite or even very well for you (as it has done for us) if you don't 
try the too unusual or get too cocky, but get too close to the edge and 
you just might find yourself over that edge.  Don't worry too much, tho, 
those edges are clearly marked if you're paying attention, and just by 
asking here, you're already paying way more attention than too many we 
see here... /after/ they've found themselves over the edge.  That's a 
_very_ good sign. =:^)

2) "Stabilizing, not fully stable and mature", means even more than ever, 
if you value your data more than the time, hassle and resources necessary 
to have backups, you HAVE them, tested and available for practical use 
should it be necessary.

Of course any sysadmin (and that's what you are for at least your own 
systems if you're making this choice) worth the name will tell you the 
value of the data is really defined by the number of backups it has, not 
by any arbitrary claims to value absent those backups.  No backups, you 
simply didn't value the data enough to have them, whatever claims of 
value you might otherwise try to make.  Backups, you /did/ value the data.

And of course the corollary to that first sysadmin's rule of backups is 
that an untested as restorable backup isn't yet a backup, only a 
potential backup, because the job isn't finished and it can't be properly 
called a backup until you know you can restore from it if necessary.

And lest anyone get the wrong idea, a snapshot is /not/ a backup for 
purposes of the above rules.  It's on the same filesystem and hardware 
media and if that goes down... you've lost it just the same.  And since 
that filesystem is still stabilizing, you really must be even more 
prepared for it to go down, even if the chances are still quite good it 
won't.

3) "Stabilizing, not fully stable and mature", also means that since the 
current best-practices code is still a moving target, you better be 
prepared to move with it.  The list-recommended kernels are the latest 
two releases of either the current or (mainline) LTS kernel series.  On 
the current track, 4.10 is out, so 4.10 and 4.9 are supported.  If you're 
still on 4.8 or earlier and can't point to a very specific known reason, 
you're behind.  On the LTS track, 4.9 is the latest LTS kernel as well, 
with 4.4 the one before that.  4.1's the one before that but that's a 
very long time ago in btrfs-development time, and while we'll generally 
still /try/ to help, honestly, our memory and thus our effectiveness at 
trying to help are going to be down dramatically from that of the 
recommended series.

If you prefer longer term "enterprise" or just Debian-stable distro 
support, fine, but honestly, the sort of stability they target doesn't 
have much in common with a still stabilizing btrfs, and the chances are 
/extremely/ high that either one or the other isn't a good match for your 
needs.   Either you want/need a more leading edge aka current distro 
which btrfs as still stabilizing fits in well with, or you want/need the 
stability of those longer term releases, and btrfs as still very actively 
stabilizing sticks out like a sore thumb and you're very likely to be 
better off on something that's actually considered stable, ext4 or xfs, 
perhaps, or my longer term stability favorite, reiserfs (which tends to 
be so stable in part because there's nobody screwing with it and messing 
things up, any more, reference the period when the mainline kernel devs 
switched the otherwise quite stable ext3 to the rather less stable 
data=writeback mode, for instance).

4) Keep the number of snapshots per subvolume under tight control as 
already suggested.  A few hundred, NOT a few thousand.  Easy enough if 
you do those snapshots manually, but easy enough to get thousands if 
you're not paying attention to thin out the old ones and using an 
automated tool such as snapper.

5) Stay away from quotas.  Either you need the feature and thus need a 
more mature filesystem where it's actually stable and does what it says 
on the label, or you don't, in which case you'll save yourself a /lot/ of 
headaches keeping them off.  Maybe someday...

6) Stay away from raid56 mode.  It has known problems ATM and is simply 
not ready.

FWIW, single-device and raid1 mode are the best tested and most reliable 
(within the single-device limitations for it, of course).  But even raid1 
mode has some caveats about rebuilding that it might be wise to 
familiarize yourself with /before/ they happen, if