Re: [PATCH 3/5] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Fri, Feb 03, 2017 at 04:20:21PM +0800, Qu Wenruo wrote: > In the following situation, scrub will calculate wrong parity to > overwrite correct one: > > RAID5 full stripe: > > Before > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --- 0 > | 0x (Bad) | 0xcdcd | 0x| > --- 4K > | 0xcdcd | 0xcdcd | 0x| > ... > | 0xcdcd | 0xcdcd | 0x| > --- 64K > > After scrubbing dev3 only: > > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --- 0 > | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | > --- 4K > | 0xcdcd | 0xcdcd | 0x| > ... > | 0xcdcd | 0xcdcd | 0x| > --- 64K > > The calltrace of such corruption is as following: > > scrub_bio_end_io_worker() get called for each extent read out > |- scriub_block_complete() >|- Data extent csum mismatch >|- scrub_handle_errored_block > |- scrub_recheck_block() > |- scrub_submit_raid56_bio_wait() > |- raid56_parity_recover() > > Now we have a rbio with correct data stripe 1 recovered. > Let's call it "good_rbio". > > scrub_parity_check_and_repair() > |- raid56_parity_submit_scrub_rbio() >|- lock_stripe_add() >| |- steal_rbio() >| |- Recovered data are steal from "good_rbio", stored into >|rbio->stripe_pages[] >|Now rbio->bio_pages[] are bad data read from disk. At this point, we should have already known that whether rbio->bio_pages are corrupted because rbio->bio_pages are indexed from the list sparity->pages, and we only do scrub_parity_put after finishing the endio of reading all pages linked at sparity->pages. Since the previous checksuming failure has made a recovery and we got the correct data on that rbio, instead of adding this corrupted page into the new rbio, it'd be fine to skip it and we use all rbio->stripe_pages which can be stolen from the previous good rbio. Thanks, -liubo >|- async_scrub_parity() > |- scrub_parity_work() (delayed_call to scrub_parity_work) > > scrub_parity_work() > |- raid56_parity_scrub_stripe() >|- validate_rbio_for_parity_scrub() > |- finish_parity_scrub() > |- Recalculate parity using *BAD* pages in rbio->bio_pages[] > So good parity is overwritten with *BAD* one > > The fix is to introduce 2 new members, bad_ondisk_a/b, to struct > btrfs_raid_bio, to info scrub code to use correct data pages to > re-calculate parity. > > Reported-by: Goffredo Baroncelli> Signed-off-by: Qu Wenruo > --- > fs/btrfs/raid56.c | 62 > +++ > 1 file changed, 58 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index d2a9a1ee5361..453eefdcb591 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -133,6 +133,16 @@ struct btrfs_raid_bio { > /* second bad stripe (for raid6 use) */ > int failb; > > + /* > + * For steal_rbio, we can steal recovered correct page, > + * but in finish_parity_scrub(), we still use bad on-disk > + * page to calculate parity. > + * Use these members to info finish_parity_scrub() to use > + * correct pages > + */ > + int bad_ondisk_a; > + int bad_ondisk_b; > + > int scrubp; > /* >* number of pages needed to represent the full > @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, > struct btrfs_raid_bio *dest) > if (!test_bit(RBIO_CACHE_READY_BIT, >flags)) > return; > > + /* Record recovered stripe number */ > + if (src->faila != -1) > + dest->bad_ondisk_a = src->faila; > + if (src->failb != -1) > + dest->bad_ondisk_b = src->failb; > + > for (i = 0; i < dest->nr_pages; i++) { > s = src->stripe_pages[i]; > if (!s || !PageUptodate(s)) { > @@ -999,6 +1015,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct > btrfs_fs_info *fs_info, > rbio->stripe_npages = stripe_npages; > rbio->faila = -1; > rbio->failb = -1; > + rbio->bad_ondisk_a = -1; > + rbio->bad_ondisk_b = -1; > atomic_set(>refs, 1); > atomic_set(>error, 0); > atomic_set(>stripes_pending, 0); > @@ -2261,6 +2279,9 @@ static int alloc_rbio_essential_pages(struct > btrfs_raid_bio *rbio) > int bit; > int index; > struct page *page; > + struct page *bio_page; > + void *ptr; > + void *bio_ptr; > >
[PATCH 2/2] btrfs-progs: convert: Make btrfs_reserved_ranges const
Since btrfs_reserved_ranges array is just used to store btrfs reserved ranges, no one will nor should modify them at run time, make them static and const will be better. This also eliminates the use of immediate number 3. Signed-off-by: Qu Wenruo--- convert/main.c | 16 convert/source-fs.c | 6 -- convert/source-fs.h | 8 ++-- 3 files changed, 14 insertions(+), 16 deletions(-) diff --git a/convert/main.c b/convert/main.c index 73c9d889..96358c62 100644 --- a/convert/main.c +++ b/convert/main.c @@ -218,7 +218,7 @@ static int create_image_file_range(struct btrfs_trans_handle *trans, * migrate block will fail as there is already a file extent. */ for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *reserved = _reserved_ranges[i]; + const struct simple_range *reserved = _reserved_ranges[i]; /* * |-- reserved --| @@ -320,7 +320,7 @@ static int migrate_one_reserved_range(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct cache_tree *used, struct btrfs_inode_item *inode, int fd, - u64 ino, struct simple_range *range, + u64 ino, const struct simple_range *range, u32 convert_flags) { u64 cur_off = range->start; @@ -423,7 +423,7 @@ static int migrate_reserved_ranges(struct btrfs_trans_handle *trans, int ret = 0; for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; if (range->start > total_bytes) return ret; @@ -609,7 +609,7 @@ static int wipe_reserved_ranges(struct cache_tree *tree, u64 min_stripe_size, int ret; for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; ret = wipe_one_reserved_range(tree, range->start, range->len, min_stripe_size, ensure_size); @@ -1370,7 +1370,7 @@ static int read_reserved_ranges(struct btrfs_root *root, u64 ino, int ret = 0; for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; if (range->start + range->len >= total_bytes) break; @@ -1395,7 +1395,7 @@ static bool is_subset_of_reserved_ranges(u64 start, u64 len) bool ret = false; for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; if (start >= range->start && start + len <= range_end(range)) { ret = true; @@ -1620,7 +1620,7 @@ static int do_rollback(const char *devname) int i; for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) { - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; reserved_ranges[i] = calloc(1, range->len); if (!reserved_ranges[i]) { @@ -1730,7 +1730,7 @@ close_fs: for (i = ARRAY_SIZE(btrfs_reserved_ranges) - 1; i >= 0; i--) { u64 real_size; - struct simple_range *range = _reserved_ranges[i]; + const struct simple_range *range = _reserved_ranges[i]; if (range_end(range) >= fsize) continue; diff --git a/convert/source-fs.c b/convert/source-fs.c index 7cf515b0..8217c893 100644 --- a/convert/source-fs.c +++ b/convert/source-fs.c @@ -22,12 +22,6 @@ #include "convert/common.h" #include "convert/source-fs.h" -struct simple_range btrfs_reserved_ranges[3] = { - { 0, SZ_1M }, - { BTRFS_SB_MIRROR_OFFSET(1), SZ_64K }, - { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K } -}; - static int intersect_with_sb(u64 bytenr, u64 num_bytes) { int i; diff --git a/convert/source-fs.h b/convert/source-fs.h index 9f611150..7aabe96b 100644 --- a/convert/source-fs.h +++ b/convert/source-fs.h @@ -32,7 +32,11 @@ struct simple_range { u64 len; }; -extern struct simple_range btrfs_reserved_ranges[3]; +static const struct simple_range btrfs_reserved_ranges[] = { + { 0, SZ_1M }, + { BTRFS_SB_MIRROR_OFFSET(1), SZ_64K }, + { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K } +}; struct
[PATCH 1/2] btrfs-progs: kerncompat: Fix re-definition of __bitwise
In latest linux api headers, __bitwise is already defined in /usr/include/linux/types.h. So kerncompat.h will re-define __bitwise, and cause gcc warning. Fix it by checking if __bitwise is already define. Signed-off-by: Qu Wenruo--- The patch is based on devel branch with the following head: commit 64abe9f619b614e589339046b6c45dfb8fa8e2a9 Author: David Sterba Date: Wed Mar 15 12:28:16 2017 +0100 btrfs-progs: tests: misc/019, use fssum --- kerncompat.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kerncompat.h b/kerncompat.h index 958bea43..fa96715f 100644 --- a/kerncompat.h +++ b/kerncompat.h @@ -317,11 +317,13 @@ static inline void assert_trace(const char *assertion, const char *filename, #define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (ptr);\ (type *)( (char *)__mptr - offsetof(type,member) );}) +#ifndef __bitwise #ifdef __CHECKER__ #define __bitwise __bitwise__ #else #define __bitwise -#endif +#endif /* __CHECKER__ */ +#endif /* __bitwise */ /* Alignment check */ #define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0) -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro
At 03/15/2017 10:38 PM, David Sterba wrote: On Mon, Mar 13, 2017 at 02:32:04PM -0600, ednadol...@gmail.com wrote: From: Edmund NadolskiDefine the SEQ_NONE macro to replace (u64)-1 in places where said value triggers a special-case ref search behavior. index 9c41fba..20915a6 100644 --- a/fs/btrfs/backref.h +++ b/fs/btrfs/backref.h @@ -23,6 +23,8 @@ #include "ulist.h" #include "extent_io.h" +#define SEQ_NONE ((u64)-1) The naming of SEQ_NONE sounds not that good to me. The (u64)-1 is to to info the backref walker to only search current root, and no need to worry about delayed_refs, since the caller (qgroup) will ensure that no delayed_ref will exist. While the name SEQ_NONE seems a little like to 0, which is far from the original meaning. What about SEQ_FINAL or SEQ_LAST? Since the timing we use (u64)-1 is just before switching commit roots, it would be better for the naming to indicate that. Thanks, Qu Can you please move the definition to ctree.h, near line 660, where seq_list and SEQ_LIST_INIT are defined, so thay're all grouped together? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: remove unused qgroup members from btrfs_trans_handle
At 03/15/2017 11:17 PM, David Sterba wrote: The members have been effectively unused since "Btrfs: rework qgroup accounting" (fcebe4562dec83b3), there's no substitute for assert_qgroups_uptodate so it's removed as well. Signed-off-by: David SterbaReviewed-by: Qu Wenruo Thanks for the cleanup, Qu --- fs/btrfs/extent-tree.c | 1 - fs/btrfs/qgroup.c| 12 fs/btrfs/qgroup.h| 1 - fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 3 --- fs/btrfs/transaction.h | 2 -- 6 files changed, 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index be5477676cc8..b5682abf6f68 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3003,7 +3003,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, goto again; } out: - assert_qgroups_uptodate(trans); trans->can_flush_pending_bgs = can_flush_pending_bgs; return 0; } diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index a5da750c1087..2fa0b10d239f 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2487,18 +2487,6 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info, spin_unlock(_info->qgroup_lock); } -void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) -{ - if (list_empty(>qgroup_ref_list) && !trans->delayed_ref_elem.seq) - return; - btrfs_err(trans->fs_info, - "qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x", - trans, list_empty(>qgroup_ref_list) ? "" : " not", - (u32)(trans->delayed_ref_elem.seq >> 32), - (u32)trans->delayed_ref_elem.seq); - BUG(); -} - /* * returns < 0 on error, 0 when more leafs are to be scanned. * returns 1 when done. diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 26932a8a1993..96fc56ebf55a 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -196,7 +196,6 @@ static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info, btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes); trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes); } -void assert_qgroups_uptodate(struct btrfs_trans_handle *trans); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid, diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c index ea272432c930..b18ab8f327a5 100644 --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -237,7 +237,6 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle *trans) { memset(trans, 0, sizeof(*trans)); trans->transid = 1; - INIT_LIST_HEAD(>qgroup_ref_list); trans->type = __TRANS_DUMMY; } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 61b807de3e16..9db3b4ca0264 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -572,7 +572,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items, h->type = type; h->can_flush_pending_bgs = true; - INIT_LIST_HEAD(>qgroup_ref_list); INIT_LIST_HEAD(>new_bgs); smp_mb(); @@ -917,7 +916,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, wake_up_process(info->transaction_kthread); err = -EIO; } - assert_qgroups_uptodate(trans); kmem_cache_free(btrfs_trans_handle_cachep, trans); if (must_run_delayed_refs) { @@ -2223,7 +2221,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) switch_commit_roots(cur_trans, fs_info); - assert_qgroups_uptodate(trans); ASSERT(list_empty(_trans->dirty_bgs)); ASSERT(list_empty(_trans->io_bgs)); update_super_roots(fs_info); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 5dfb5590fff6..2e560d2abdff 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -125,8 +125,6 @@ struct btrfs_trans_handle { unsigned int type; struct btrfs_root *root; struct btrfs_fs_info *fs_info; - struct seq_list delayed_ref_elem; - struct list_head qgroup_ref_list; struct list_head new_bgs; }; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] fstests: btrfs: Add testcase for btrfs dedupe and metadata balance race test
Btrfs balance with inband dedupe enable/disable will expose a lot of hidden dedupe bug: 1) Enable/disable race bug 2) Btrfs dedupe tree balance corrupted delayed_ref 3) Btrfs disable and balance will cause balance BUG_ON Reported-by: Satoru TakeuchiSigned-off-by: Qu Wenruo --- tests/btrfs/201 | 112 tests/btrfs/201.out | 2 + tests/btrfs/group | 1 + 3 files changed, 115 insertions(+) create mode 100755 tests/btrfs/201 create mode 100644 tests/btrfs/201.out diff --git a/tests/btrfs/201 b/tests/btrfs/201 new file mode 100755 index ..d6913c13 --- /dev/null +++ b/tests/btrfs/201 @@ -0,0 +1,112 @@ +#! /bin/bash +# FS QA Test 201 +# +# Btrfs inband dedup enable/disable race with metadata balance +# +# This tests will test the following bugs exposed in development: +# 1) enable/disable race +# 2) tree balance cause delayed ref corruption +# 3) disable and balance cause BUG_ON +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* + killall $FSSTRESS_PROG &> /dev/null + kill $trigger_pid &> /dev/null + kill $balance_pid &> /dev/null + wait + + # See comment later + $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_btrfs_command dedupe +_require_btrfs_fs_feature dedupe + +# Use 64K dedupe size to keep compatibility for 64K page size +dedupe_bs=64K + +_scratch_mkfs >> $seqres.full 2>&1 +_scratch_mount + +mkdir -p $SCRATCH_MNT/stressdir + +runtime=$((60 * $TIME_FACTOR)) + +trigger_work() +{ + while true; do + _run_btrfs_util_prog dedupe enable -s inmemory \ + -b $dedupe_bs $SCRATCH_MNT + sleep 1 + _run_btrfs_util_prog dedupe disable $SCRATCH_MNT + sleep 1 + done +} + +# redirect all output, as error output like 'balance cancelled by user' +# will populuate the golden output. +_btrfs_stress_balance -m $SCRATCH_MNT &> /dev/null & +balance_pid=$! + +$FSSTRESS_PROG $(_scale_fsstress_args -p 1 -n 1000) $FSSTRESS_AVOID \ + -d $SCRATCH_MNT/stressdir > /dev/null 2>&1 & + +trigger_work & +trigger_pid=$! + +sleep $runtime +killall $FSSTRESS_PROG &> /dev/null +kill $trigger_pid &> /dev/null +kill $balance_pid &> /dev/null +wait + +# Manually stop balance as it's possible balance is still running for a short +# time. And we don't want to populate $seqres.full, so call $BTRFS_UTIL_PROG +# directly +$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null + +echo "Silence is golden" +# success, all done +status=0 +exit diff --git a/tests/btrfs/201.out b/tests/btrfs/201.out new file mode 100644 index ..5ac973f5 --- /dev/null +++ b/tests/btrfs/201.out @@ -0,0 +1,2 @@ +QA output created by 201 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index bf001d3c..f87d995c 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -143,3 +143,4 @@ 137 auto quick send 138 auto compress 200 auto ib-dedupe +201 auto ib-dedupe -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstests: generic: Test space allocation when there is only fragmented space
This test case will test if file system works well when handling large write while available space are all fragmented. This can expose a bug in a btrfs unmerged patch, which wrongly modified the delayed allocation code, to exit before allocating all space, and cause hang when unmounting. The wrong patch is: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error The test case will: 1) Fill small filesystem with page sized small files All these files has a sequential number as file name 2) Remove files with odd number as file name This will free almost half of the space 3) Try to write a file which takes 1/8 of the file system The method to create fragmented fs may not be generic enough, but should work for most extent based fs. Unless one file system will allocate extents from both end of its free space. Cc: Filipe MananaCc: Liu Bo Signed-off-by: Qu Wenruo --- tests/generic/416 | 99 +++ tests/generic/416.out | 3 ++ tests/generic/group | 1 + 3 files changed, 103 insertions(+) create mode 100755 tests/generic/416 create mode 100644 tests/generic/416.out diff --git a/tests/generic/416 b/tests/generic/416 new file mode 100755 index 000..925524b --- /dev/null +++ b/tests/generic/416 @@ -0,0 +1,99 @@ +#! /bin/bash +# FS QA Test 416 +# +# Test fs behavior when large write request can't be met by one single extent +# +# Inspired by a bug in a btrfs fix, which doesn't get exposed by current test +# cases +# +#--- +# Copyright (c) 2017 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os IRIX Linux +_require_scratch + +fs_size=$((128 * 1024 * 1024)) +page_size=$(get_page_size) + +# We will never reach this number though +nr_files=$(($fs_size / $page_size)) + +# Use small fs to make the fill more faster +_scratch_mkfs_sized $fs_size >> $seqres.full 2>&1 + +_scratch_mount + +fill_fs() +{ + dir=$1 + for i in $(seq -w $nr_files); do + # xfs_io can't return correct value when it hit ENOSPC, use + # dd here to detect ENOSPC + dd if=/dev/zero of=$SCRATCH_MNT/$i bs=$page_size count=1 \ + &> /dev/null + if [ $? -ne 0 ]; then + break + fi + done +} + +fill_fs $SCRATCH_MNT + +# remount to sync every thing into fs, and drop all cache +_scratch_remount + +# remove all files with odd file names, which should free near half +# of the space +rm $SCRATCH_MNT/*[13579] +sync + +# We should be able to write at least 1/8 of the whole fs size +# The number 1/8 is for btrfs, which only has about 47M for data. +# And half of the 47M is already taken up, so only 1/8 is safe here +$XFS_IO_PROG -f -c "pwrite 0 $(($fs_size / 8))" $SCRATCH_MNT/large_file | \ + _filter_xfs_io + +# success, all done +status=0 +exit diff --git a/tests/generic/416.out b/tests/generic/416.out new file mode 100644 index 000..8d2ffac --- /dev/null +++ b/tests/generic/416.out @@ -0,0 +1,3 @@ +QA output created by 416 +wrote 16777216/16777216 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/generic/group b/tests/generic/group index b510d41..59f94f9 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -418,3 +418,4 @@ 413 auto quick 414 auto quick clone 415 auto clone +416 auto enospc -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 0/5] In-band de-duplication for btrfs-progs
Patchset can be fetched from github: https://github.com/adam900710/btrfs-progs.git dedupe_20170306 Inband dedupe(in-memory backend only) ioctl support for btrfs-progs. User/reviewer/tester can still use previous btrfs-progs patchset to test, this update is just cleanuping unsupported functions, like on-disk backend and any on-disk format change. v7 changes: Update ctree.h to follow kernel structure change Update print-tree to follow kernel structure change V8 changes: Move dedup props and on-disk backend support out of the patchset Change command group name to "dedupe-inband", to avoid confusion with possible out-of-band dedupe. Suggested by Mark. Rebase to latest devel branch. V9 changes: Follow kernels ioctl change to support FORCE flag, new reconf ioctl, and more precious error reporting. v10 changes: Rebase to v4.10. Add BUILD_ASSERT for btrfs_ioctl_dedupe_args Qu Wenruo (5): btrfs-progs: Basic framework for dedupe-inband command group btrfs-progs: dedupe: Add enable command for dedupe command group btrfs-progs: dedupe: Add disable support for inband dedupelication btrfs-progs: dedupe: Add status subcommand btrfs-progs: dedupe: introduce reconfigure subcommand Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 167 +++ Documentation/btrfs.asciidoc | 4 + Makefile | 2 +- btrfs-completion | 6 +- btrfs.c| 2 + cmds-dedupe-ib.c | 437 + commands.h | 2 + dedupe-ib.h| 41 +++ ioctl.h| 38 +++ 10 files changed, 698 insertions(+), 2 deletions(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 1/5] btrfs-progs: Basic framework for dedupe-inband command group
Add basic ioctl header and command group framework for later use. Alone with basic man page doc. Signed-off-by: Qu Wenruo--- Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 40 + Documentation/btrfs.asciidoc | 4 +++ Makefile | 2 +- btrfs.c| 2 ++ cmds-dedupe-ib.c | 48 ++ commands.h | 2 ++ dedupe-ib.h| 41 + ioctl.h| 36 ++ 9 files changed, 175 insertions(+), 1 deletion(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in index 539c6b55..f175ae1e 100644 --- a/Documentation/Makefile.in +++ b/Documentation/Makefile.in @@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc MAN8_TXT += btrfs-replace.asciidoc MAN8_TXT += btrfs-restore.asciidoc MAN8_TXT += btrfs-property.asciidoc +MAN8_TXT += btrfs-dedupe-inband.asciidoc # Category 5 manual page MAN5_TXT += btrfs-man5.asciidoc diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc new file mode 100644 index ..9ee2bc75 --- /dev/null +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -0,0 +1,40 @@ +btrfs-dedupe(8) +== + +NAME + +btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs +filesystem + +SYNOPSIS + +*btrfs dedupe-inband* + +DESCRIPTION +--- +*btrfs dedupe-inband* is used to enable/disable or show current in-band de-duplication +status of a btrfs filesystem. + +Kernel support for in-band de-duplication starts from 4.8. + +WARNING: In-band de-duplication is still an experimental feautre of btrfs, +use with caution. + +SUBCOMMAND +-- +Nothing yet + +EXIT STATUS +--- +*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is +returned in case of failure. + +AVAILABILITY + +*btrfs* is part of btrfs-progs. +Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for +further details. + +SEE ALSO + +`mkfs.btrfs`(8), diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc index 100a6adf..64fc0d2c 100644 --- a/Documentation/btrfs.asciidoc +++ b/Documentation/btrfs.asciidoc @@ -50,6 +50,10 @@ COMMANDS Do off-line check on a btrfs filesystem. + See `btrfs-check`(8) for details. +*dedupe*:: + Control btrfs in-band(write time) de-duplication. + + See `btrfs-dedupe`(8) for details. + *device*:: Manage devices managed by btrfs, including add/delete/scan and so on. + diff --git a/Makefile b/Makefile index 67fbc483..24445493 100644 --- a/Makefile +++ b/Makefile @@ -102,7 +102,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ - mkfs/common.o + mkfs/common.o cmds-dedupe-ib.o libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \ kernel-lib/crc32c.o \ uuid-tree.o utils-lib.o rbtree-utils.o diff --git a/btrfs.c b/btrfs.c index 9214ae6e..1f055d75 100644 --- a/btrfs.c +++ b/btrfs.c @@ -201,6 +201,8 @@ static const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, _cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, _cmd_group, 0 }, { "replace", cmd_replace, NULL, _cmd_group, 0 }, + { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group, + 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, NULL_CMD_STRUCT diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c new file mode 100644 index ..f4d31386 --- /dev/null +++ b/cmds-dedupe-ib.c @@ -0,0 +1,48 @@ +/* + * Copyright (C) 2017 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the +
[PATCH v10 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group
Add enable subcommand for dedupe commmand group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe-inband.asciidoc | 114 ++- btrfs-completion | 6 +- cmds-dedupe-ib.c | 225 + ioctl.h| 2 + 4 files changed, 345 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 9ee2bc75..82f970a6 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,7 +22,119 @@ use with caution. SUBCOMMAND -- -Nothing yet +*enable* [options] :: +Enable in-band de-duplication for a filesystem. ++ +`Options` ++ +-f|--force +Force 'enable' command to be exected. +Will skip memory limit check and allow 'enable' to be executed even in-band +de-duplication is already enabled. ++ +NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be +reset to its default value. + +-s|--storage-backend +Specify de-duplication hash storage backend. +Only 'inmemory' backend is supported yet. +If not specified, default value is 'inmemory'. ++ +Refer to *BACKENDS* sector for more information. + +-b|--blocksize +Specify dedupe block size. +Supported values are power of 2 from '16K' to '8M'. +Default value is '128K'. ++ +Refer to *BLOCKSIZE* sector for more information. + +-a|--hash-algorithm +Specify hash algorithm. +Only 'sha256' is supported yet. + +-l|--limit-hash +Specify maximum number of hashes stored in memory. +Only works for 'inmemory' backend. +Conflicts with '-m' option. ++ +Only positive values are valid. +Default value is '32K'. + +-m|--limit-memory +Specify maximum memory used for hashes. +Only works for 'inmemory' backend. +Conflicts with '-l' option. ++ +Only value larger than or equal to '1024' is valid. +No default value. ++ +NOTE: Memory limit will be rounded down to kernel internal hash size, +so the memory limit shown in 'btrfs dedupe status' may be different +from the . + +WARNING: Too large value for '-l' or '-m' will easily trigger OOM. +Please use with caution according to system memory. + +NOTE: In-band de-duplication is not compactible with compression yet. +And compression has higher priority than in-band de-duplication, means if +compression and de-duplication is enabled at the same time, only compression +will work. + +BACKENDS + +Btrfs in-band de-duplication will support different storage backends, with +different use case and features. + +In-memory backend:: +This backend provides backward-compatibility, and more fine-tuning options. +But hash pool is non-persistent and may exhaust kernel memory if not setup +properly. ++ +This backend can be used on old btrfs(without '-O dedupe' mkfs option). +When used on old btrfs, this backend needs to be enabled manually after mount. ++ +Designed for fast hash search speed, in-memory backend will keep all dedupe +hashes in memory. (Although overall performance is still much the same with +'ondisk' backend if all 'ondisk' hash can be cached in memory) ++ +And only keeps limited number of hash in memory to avoid exhausting memory. +Hashes over the limit will be dropped following Last-Recent-Use behavior. +So this backend has a consistent overhead for given limit but can\'t ensure +all duplicated blocks will be de-duplicated. ++ +After umount and mount, in-memory backend need to refill its hash pool. + +On-disk backend:: +This backend provides persistent hash pool, with more smart memory management +for hash pool. +But it\'s not backward-compatible, meaning it must be used with '-O dedupe' mkfs +option and older kernel can\'t mount it read-write. ++ +Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk. +This behavior may cause extra disk IO for hash search under high memory +pressure. ++ +After umount and mount, on-disk backend still has its hash on disk, no need to +refill its dedupe hash pool. + +Currently, only 'inmemory' backend is supported in btrfs-progs. + +DEDUPE BLOCK SIZE + +In-band de-duplication is done at dedupe block size. +Any data smaller than dedupe block size won\'t go through in-band +de-duplication. + +And dedupe block size affects dedupe rate and fragmentation heavily. + +Smaller block size will cause more fragments, but higher dedupe rate. + +Larger block size will cause less fragments, but lower dedupe rate. + +In-band de-duplication rate is highly related to the workload pattern. +So it\'s highly recommended to align dedupe block size to the workload +block size to make full use of de-duplication. EXIT STATUS --- diff --git a/btrfs-completion b/btrfs-completion index 3ede77b6..50f7ea2b 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -29,7 +29,7 @@ _btrfs() local cmd=${words[1]} -commands='subvolume filesystem balance
[PATCH 3/3] fstests: btrfs: Test inband dedupe with data balance.
Btrfs balance will reloate date extent, but its hash is removed too late at run_delayed_ref() time, which will cause extent ref increased during balance, cause either find_data_references() gives WARN_ON() or even run_delayed_refs() fails and cause transaction abort. Add such concurrency test for inband dedupe and data balance. Signed-off-by: Qu Wenruo--- tests/btrfs/203 | 109 tests/btrfs/203.out | 3 ++ tests/btrfs/group | 1 + 3 files changed, 113 insertions(+) create mode 100755 tests/btrfs/203 create mode 100644 tests/btrfs/203.out diff --git a/tests/btrfs/203 b/tests/btrfs/203 new file mode 100755 index ..aea756cb --- /dev/null +++ b/tests/btrfs/203 @@ -0,0 +1,109 @@ +#! /bin/bash +# FS QA Test 203 +# +# Btrfs inband dedupe with balance concurrency test +# +# This can spot inband dedupe error which will increase delayed ref on +# an data extent inside RO block group +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + kill $populate_pid &> /dev/null + kill $balance_pid &> /dev/null + wait + # Check later comment for reason + $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cp_reflink +_require_btrfs_command dedupe +_require_btrfs_fs_feature dedupe + +dedupe_bs=128k +file_size_in_kilo=4096 +init_file=$SCRATCH_MNT/foo +run_time=$((60 * $TIME_FACTOR)) + +_scratch_mkfs >> $seqres.full 2>&1 +_scratch_mount + +do_dedupe_balance_test() +{ + _run_btrfs_util_prog dedupe enable -b $dedupe_bs -s inmemory $SCRATCH_MNT + + # create the initial file and fill hash pool + $XFS_IO_PROG -f -c "pwrite -S 0x0 -b $dedupe_bs 0 $dedupe_bs" -c "fsync" \ + $init_file | _filter_xfs_io + + _btrfs_stress_balance $SCRATCH_MNT >/dev/null 2>&1 & + balance_pid=$! + + # Populate fs with all 0 data, to trigger enough in-band dedupe work + # to race with balance + _populate_fs -n 5 -f 1000 -d 1 -r $SCRATCH_MNT \ + -s $file_size_in_kilo &> /dev/null & + populate_pid=$! + + sleep $run_time + + kill $populate_pid + kill $balance_pid + wait + + # Sometimes even we killed $balance_pid and wait returned, + # balance may still be running, use balance cancel to wait it. + # As this is just a workaround, we don't want it pollute seqres + # so call $BTRFS_UTIL_PROG directly + $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null + + rm $SCRATCH_MNT/* -rf &> /dev/null + _run_btrfs_util_prog dedupe disable $SCRATCH_MNT +} + +do_dedupe_balance_test + +# success, all done +status=0 +exit diff --git a/tests/btrfs/203.out b/tests/btrfs/203.out new file mode 100644 index ..404394c3 --- /dev/null +++ b/tests/btrfs/203.out @@ -0,0 +1,3 @@ +QA output created by 203 +wrote 131072/131072 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/btrfs/group b/tests/btrfs/group index f87d995c..2ef7a498 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -144,3 +144,4 @@ 138 auto compress 200 auto ib-dedupe 201 auto ib-dedupe +203 auto ib-dedupe balance -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand
Introduce reconfigure subcommand to co-operate with new kernel ioctl modification. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe-inband.asciidoc | 7 +++ cmds-dedupe-ib.c | 73 +++--- 2 files changed, 64 insertions(+), 16 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index df068c31..5fc4bb0d 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,13 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*reconfigure* [options] :: +Re-configure in-band de-duplication parameters of a filesystem. ++ +In-band de-duplication must be enbaled first before re-configuration. ++ +[Options] are the same with 'btrfs dedupe-inband enable'. + *status* :: Show current in-band de-duplication status of a filesystem. diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 5fd26009..397946fa 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -69,7 +69,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = { NULL }; - #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \ if (dargs->member != old->member && dargs->member == (type)(err_val)) { \ error("unsupported dedupe "#member": %"#fmt"", old->member);\ @@ -92,6 +91,10 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, } report_option_parameter(dargs, old, flags, u8, -1, x); } + if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) { + error("must enable dedupe before reconfiguration"); + return; + } report_fatal_parameter(dargs, old, cmd, u16, -1, u); report_fatal_parameter(dargs, old, blocksize, u64, -1, llu); report_fatal_parameter(dargs, old, backend, u16, -1, u); @@ -102,14 +105,17 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, return; } -static int cmd_dedupe_ib_enable(int argc, char **argv) +static int enable_reconfig_dedupe(int argc, char **argv, int reconf) { int ret; int fd = -1; char *path; u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT; + int blocksize_set = 0; u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256; + int hash_algo_set = 0; u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY; + int backend_set = 0; u64 limit_nr = 0; u64 limit_mem = 0; u64 sys_mem = 0; @@ -131,20 +137,22 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) { NULL, 0, NULL, 0} }; - c = getopt_long(argc, argv, "s:b:a:l:m:", long_options, NULL); + c = getopt_long(argc, argv, "s:b:a:l:m:f", long_options, NULL); if (c < 0) break; switch (c) { case 's': - if (!strcasecmp("inmemory", optarg)) + if (!strcasecmp("inmemory", optarg)) { backend = BTRFS_DEDUPE_BACKEND_INMEMORY; - else { + backend_set = 1; + } else { error("unsupported dedupe backend: %s", optarg); exit(1); } break; case 'b': blocksize = parse_size(optarg); + blocksize_set = 1; break; case 'a': if (strcmp("sha256", optarg)) { @@ -224,26 +232,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) return 1; } memset(, -1, sizeof(dargs)); - dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; - dargs.blocksize = blocksize; - dargs.hash_algo = hash_algo; - dargs.limit_nr = limit_nr; - dargs.limit_mem = limit_mem; - dargs.backend = backend; - if (force) - dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE; - else - dargs.flags = 0; + if (reconf) { + dargs.cmd = BTRFS_DEDUPE_CTL_RECONF; + if (blocksize_set) + dargs.blocksize = blocksize; + if (hash_algo_set) + dargs.hash_algo = hash_algo; + if (backend_set) + dargs.backend = backend; + dargs.limit_nr = limit_nr; + dargs.limit_mem = limit_mem; + } else { + dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; + dargs.blocksize = blocksize; + dargs.hash_algo = hash_algo; + dargs.limit_nr = limit_nr; + dargs.limit_mem =
[PATCH v10 4/5] btrfs-progs: dedupe: Add status subcommand
Add status subcommand for dedupe command group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe-inband.asciidoc | 3 ++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 81 ++ 3 files changed, 85 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index de32eb97..df068c31 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,9 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*status* :: +Show current in-band de-duplication status of a filesystem. + BACKENDS Btrfs in-band de-duplication will support different storage backends, with diff --git a/btrfs-completion b/btrfs-completion index 9a6c73ba..fbaae0cc 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -40,7 +40,7 @@ _btrfs() commands_property='get set list' commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' -commands_dedupe='enable disable' +commands_dedupe='enable disable status' commands_replace='start status cancel' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index a8b10924..5fd26009 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -299,12 +299,93 @@ out: return 0; } +static const char * const cmd_dedupe_ib_status_usage[] = { + "btrfs dedupe status ", + "Show current in-band(write time) de-duplication status of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_status(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + int print_limit = 1; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_status_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + ret = 1; + goto out; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_STATUS; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to get inband deduplication status: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + if (dargs.status == 0) { + printf("Status: \t\t\tDisabled\n"); + goto out; + } + printf("Status:\t\t\tEnabled\n"); + + if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256) + printf("Hash algorithm:\t\tSHA-256\n"); + else + printf("Hash algorithm:\t\tUnrecognized(%x)\n", + dargs.hash_algo); + + if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + printf("Backend:\t\tIn-memory\n"); + print_limit = 1; + } else { + printf("Backend:\t\tUnrecognized(%x)\n", + dargs.backend); + } + + printf("Dedup Blocksize:\t%llu\n", dargs.blocksize); + + if (print_limit) { + u64 cur_mem; + + /* Limit nr may be 0 */ + if (dargs.limit_nr) + cur_mem = dargs.current_nr * (dargs.limit_mem / + dargs.limit_nr); + else + cur_mem = 0; + + printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr, + dargs.limit_nr); + printf("Memory usage: \t\t[%s/%s]\n", + pretty_size(cur_mem), + pretty_size(dargs.limit_mem)); + } +out: + close_file_or_dir(fd, dirstream); + return ret; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, NULL, 0}, + { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Btrfs in-band de-duplication test cases
Btrfs in-band de-duplication test cases for in-memory backend, which covers the bugs exposed during the development. Qu Wenruo (3): fstests: btrfs: Add basic test for btrfs in-band de-duplication fstests: btrfs: Add testcase for btrfs dedupe and metadata balance race test fstests: btrfs: Test inband dedupe with data balance. common/defrag | 13 ++ tests/btrfs/200 | 116 tests/btrfs/200.out | 22 ++ tests/btrfs/201 | 112 ++ tests/btrfs/201.out | 2 + tests/btrfs/203 | 109 tests/btrfs/203.out | 3 ++ tests/btrfs/group | 4 ++ 8 files changed, 381 insertions(+) create mode 100755 tests/btrfs/200 create mode 100644 tests/btrfs/200.out create mode 100755 tests/btrfs/201 create mode 100644 tests/btrfs/201.out create mode 100755 tests/btrfs/203 create mode 100644 tests/btrfs/203.out -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication
Add disable subcommand for dedupe command group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe-inband.asciidoc | 5 btrfs-completion | 2 +- cmds-dedupe-ib.c | 42 ++ 3 files changed, 48 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 82f970a6..de32eb97 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,6 +22,11 @@ use with caution. SUBCOMMAND -- +*disable* :: +Disable in-band de-duplication for a filesystem. ++ +This will trash all stored dedupe hash. ++ *enable* [options] :: Enable in-band de-duplication for a filesystem. + diff --git a/btrfs-completion b/btrfs-completion index 50f7ea2b..9a6c73ba 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -40,7 +40,7 @@ _btrfs() commands_property='get set list' commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' -commands_dedupe='enable' +commands_dedupe='enable disable' commands_replace='start status cancel' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index cc9928aa..a8b10924 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -259,10 +259,52 @@ out: return ret; } +static const char * const cmd_dedupe_ib_disable_usage[] = { + "btrfs dedupe disable ", + "Disable in-band(write time) de-duplication of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_disable_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + return 1; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to disable inband deduplication: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + +out: + close_file_or_dir(fd, dirstream); + return 0; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, + { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] fstests: btrfs: Add basic test for btrfs in-band de-duplication
Add basic test for btrfs in-band de-duplication(inmemory backend), including: 1) Enable 3) Dedup rate 4) File correctness 5) Disable Signed-off-by: Qu Wenruo--- common/defrag | 13 ++ tests/btrfs/200 | 116 tests/btrfs/200.out | 22 ++ tests/btrfs/group | 2 + 4 files changed, 153 insertions(+) create mode 100755 tests/btrfs/200 create mode 100644 tests/btrfs/200.out diff --git a/common/defrag b/common/defrag index d279382f..0a41714f 100644 --- a/common/defrag +++ b/common/defrag @@ -59,6 +59,19 @@ _extent_count() $XFS_IO_PROG -c "fiemap" $1 | tail -n +2 | grep -v hole | wc -l| $AWK_PROG '{print $1}' } +# Get the number of unique file extents +# Unique file extents means they have different ondisk bytenr +# Some filesystem supports reflinkat() or in-band de-dup can create +# a file whose all file extents points to the same ondisk bytenr +# this can be used to test if such reflinkat() or in-band de-dup works +_extent_count_uniq() +{ + file=$1 + $XFS_IO_PROG -c "fiemap" $file >> $seqres.full 2>&1 + $XFS_IO_PROG -c "fiemap" $file | tail -n +2 | grep -v hole |\ + $AWK_PROG '{print $3}' | sort | uniq | wc -l +} + _check_extent_count() { min=$1 diff --git a/tests/btrfs/200 b/tests/btrfs/200 new file mode 100755 index ..1b3e46fd --- /dev/null +++ b/tests/btrfs/200 @@ -0,0 +1,116 @@ +#! /bin/bash +# FS QA Test 200 +# +# Basic btrfs inband dedupe test for inmemory backend +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/defrag + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_btrfs_command dedupe +_require_btrfs_fs_feature dedupe + +# File size is twice the maximum file extent of btrfs +# So even fallbacked to non-dedupe, it will have at least 2 extents +file_size=256m + +_scratch_mkfs >> $seqres.full 2>&1 +_scratch_mount + +do_dedupe_test() +{ + dedupe_bs=$1 + + echo "Testing inmemory dedupe backend with block size $dedupe_bs" + _run_btrfs_util_prog dedupe enable -f -s inmemory -b $dedupe_bs \ + $SCRATCH_MNT + # do sync write to ensure dedupe hash is added into dedupe pool + $XFS_IO_PROG -f -c "pwrite -b $dedupe_bs 0 $dedupe_bs" -c "fsync"\ + $SCRATCH_MNT/initial_block | _filter_xfs_io + + # do sync write to ensure we can get stable fiemap later + $XFS_IO_PROG -f -c "pwrite -b $dedupe_bs 0 $file_size" -c "fsync"\ + $SCRATCH_MNT/real_file | _filter_xfs_io + + # Test if real_file is de-duplicated + nr_uniq_extents=$(_extent_count_uniq $SCRATCH_MNT/real_file) + nr_total_extents=$(_extent_count $SCRATCH_MNT/real_file) + nr_deduped_extents=$(($nr_total_extents - $nr_uniq_extents)) + + echo "deduped/total: $nr_deduped_extents/$nr_total_extents" \ + >> $seqres.full + # Allow a small amount of dedupe miss, as commit interval or + # memory pressure may break a dedupe_bs block and cause + # small extent which won't go through dedupe routine + _within_tolerance "number of deduped extents" $nr_deduped_extents \ + $nr_total_extents 5% -v + + # Also check the md5sum to ensure data is not corrupted + md5=$(_md5_checksum $SCRATCH_MNT/real_file) + echo "md5sum: $md5" +} + +# Test inmemory dedupe first, use 64K dedupe bs to keep compatibility +# with 64K page size +do_dedupe_test 64K + +# Test 128K(default) dedupe bs +do_dedupe_test 128K + +# Test 1M dedupe bs +do_dedupe_test 1M + +# Check dedupe disable +_run_btrfs_util_prog dedupe disable $SCRATCH_MNT + +# success, all done +status=0 +exit +# Check dedupe disable +_run_btrfs_util_prog dedupe
Re: [PATCH 4/4] btrfs: add dummy callback for readpage_io_failed and drop checks
On Mon, Feb 20, 2017 at 07:31:33PM +0100, David Sterba wrote: > Make extent_io_ops::readpage_io_failed_hook callback mandatory and > define a dummy function for btrfs_extent_io_ops. As the failed IO > callback is not performance critical, the branch vs extra trade off does > not hurt. > > Signed-off-by: David Sterba> --- > fs/btrfs/disk-io.c | 2 +- > fs/btrfs/extent_io.c | 2 +- > fs/btrfs/extent_io.h | 2 +- > fs/btrfs/inode.c | 7 +++ > 4 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 0715b6f3f686..fbf4921f4d60 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4658,7 +4658,7 @@ static const struct extent_io_ops btree_extent_io_ops = > { > .readpage_end_io_hook = btree_readpage_end_io_hook, > /* note we're sharing with inode.c for the merge bio hook */ > .merge_bio_hook = btrfs_merge_bio_hook, > + .readpage_io_failed_hook = btree_io_failed_hook, > > /* optional callbacks */ > - .readpage_io_failed_hook = btree_io_failed_hook, > }; > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index f5cff93ab152..eaee7bb2ff7c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2578,7 +2578,7 @@ static void end_bio_extent_readpage(struct bio *bio) > if (likely(uptodate)) > goto readpage_ok; > > - if (tree->ops && tree->ops->readpage_io_failed_hook) { > + if (tree->ops) { > ret = tree->ops->readpage_io_failed_hook(page, mirror); > if (!ret && !bio->bi_error) > uptodate = 1; > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > index 5c5e2e6cfb9e..63c8cc970b1c 100644 > --- a/fs/btrfs/extent_io.h > +++ b/fs/btrfs/extent_io.h > @@ -102,6 +102,7 @@ struct extent_io_ops { > int (*merge_bio_hook)(struct page *page, unsigned long offset, > size_t size, struct bio *bio, > unsigned long bio_flags); > + int (*readpage_io_failed_hook)(struct page *page, int failed_mirror); > > /* >* Optional hooks, called if the pointer is not NULL > @@ -109,7 +110,6 @@ struct extent_io_ops { > int (*fill_delalloc)(struct inode *inode, struct page *locked_page, >u64 start, u64 end, int *page_started, >unsigned long *nr_written); > - int (*readpage_io_failed_hook)(struct page *page, int failed_mirror); > > int (*writepage_start_hook)(struct page *page, u64 start, u64 end); > void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end, > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 72faf9b5616a..a74191fa3934 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -10503,6 +10503,12 @@ static int btrfs_tmpfile(struct inode *dir, struct > dentry *dentry, umode_t mode) > > } > > +__attribute__((const)) > +static int dummy_readpage_io_failed_hook(struct page *page, int > failed_mirror) > +{ > + return 0; > +} > + > static const struct inode_operations btrfs_dir_inode_operations = { > .getattr= btrfs_getattr, > .lookup = btrfs_lookup, > @@ -10545,6 +10551,7 @@ static const struct extent_io_ops btrfs_extent_io_ops > = { > .submit_bio_hook = btrfs_submit_bio_hook, > .readpage_end_io_hook = btrfs_readpage_end_io_hook, > .merge_bio_hook = btrfs_merge_bio_hook, > + .readpage_io_failed_hook = dummy_readpage_io_failed_hook, This has made us not call bio_readpage_error() to correct corrupted data... Thanks, -liubo > > /* optional callbacks */ > .fill_delalloc = run_delalloc_range, > -- > 2.10.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 23:26:32 +0100 schrieb Kai Krakow: > Well, bugs can hit you with every filesystem. Nothing as complex as a Meh... I fooled myself. Find the mistake... ;-) SPOILER: "Nothing" should be "something". -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 23:41:41 +0100 schrieb Kai Krakow: > Am Wed, 15 Mar 2017 23:26:32 +0100 > schrieb Kai Krakow : > > > Well, bugs can hit you with every filesystem. Nothing as complex as > > a > > Meh... I fooled myself. Find the mistake... ;-) > > SPOILER: > > "Nothing" should be "something". *doublefacepalm* Please forget what I wrote. The original sentence is correct. I should get some coffee or go to bed. :-\ -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 07:55:51 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted: > > > Today is safe to use btrfs for home storage? No raid, just secure > > storage for some files and create snapshots from it. > > > I'll echo the others... but with emphasis on a few caveats the others > mentioned but didn't give the emphasis I thought they deserved: > > 1) Btrfs is, as I repeatedly put it in post after post, "stabilizing, > but not yet fully stable and mature." In general, that means it's > likely to work quite or even very well for you (as it has done for > us) if you don't try the too unusual or get too cocky, but get too > close to the edge and you just might find yourself over that edge. > Don't worry too much, tho, those edges are clearly marked if you're > paying attention, and just by asking here, you're already paying way > more attention than too many we see here... /after/ they've found > themselves over the edge. That's a _very_ good sign. =:^) Well, bugs can hit you with every filesystem. Nothing as complex as a file system can ever be proven bug free (except FAT maybe). But as a general-purpose-no-fancy-features-needed FS, btrfs should be on par with other FS these days. > 2) "Stabilizing, not fully stable and mature", means even more than > ever, if you value your data more than the time, hassle and resources > necessary to have backups, you HAVE them, tested and available for > practical use should it be necessary. This is totally not dependent on "stabilizing, not fully stable and mature". If you data matters to you, do backups. It's that simple. If you don't do backups, your data isn't important - by definition. > Of course any sysadmin (and that's what you are for at least your own > systems if you're making this choice) worth the name will tell you > the value of the data is really defined by the number of backups it > has, not by any arbitrary claims to value absent those backups. No > backups, you simply didn't value the data enough to have them, > whatever claims of value you might otherwise try to make. Backups, > you /did/ value the data. Yes. :-) > And of course the corollary to that first sysadmin's rule of backups > is that an untested as restorable backup isn't yet a backup, only a > potential backup, because the job isn't finished and it can't be > properly called a backup until you know you can restore from it if > necessary. Even more true. :-) > And lest anyone get the wrong idea, a snapshot is /not/ a backup for > purposes of the above rules. It's on the same filesystem and > hardware media and if that goes down... you've lost it just the > same. And since that filesystem is still stabilizing, you really > must be even more prepared for it to go down, even if the chances are > still quite good it won't. A good backup should follow the 3-2-1 rule: Have 3 different backup copies, 2 different media, and store at least 1 copy external/off-site. For customers, we usually deploy a strategy like this for Windows machines: Do one local backup using Windows Image Backup to a local NAS to backup from inside the VM, use a different software to do image backups from outside of the VM to the local NAS, mirror the "outside image" to a remote location (cloud storage). And keep some backup history. Overwriting the one existing backup with a new one won't help you anything. All involved software should be able to do efficient delta backups, otherwise mirroring offsite may be no fun. In linux, I'm using borgbackup and rsync to have something similar. Using borgbackup to a local storage, and syncing it offsite with rsync gives me the 2-1 rule part. You can get the third rule by using rsync to also mirror the local FS off the machine. But that's usually overkill for personal backups. Instead, I only have a third copy of most valuable data like photos, dev stuff, documents, etc. BTW: For me, different media also means different FS types. So a bug in one FS wouldn't easily hit the other. [snip] > 4) Keep the number of snapshots per subvolume under tight control as > already suggested. A few hundred, NOT a few thousand. Easy enough > if you do those snapshots manually, but easy enough to get thousands > if you're not paying attention to thin out the old ones and using an > automated tool such as snapper. Borgbackup is so fast and storage efficient that you could run it easily multiple times per day. That in turn means I don't need to rely on regular snapshots to undo mistakes. I only use snapshots before doing some knowingly risky stuff to have fast recovery. But that's all, nothing else should snapshots before (except you are doing more advanced stuff like container cloning, VM instance spawning, ...). > 5) Stay away from quotas. Either you need the feature and thus need > a more mature filesystem where it's actually stable and does what it > says on the label, or you don't, in which
[PATCH 1/8] nowait aio: Introduce IOCB_RW_FLAG_NOWAIT
From: Goldwyn RodriguesThis flag informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. Unfortunately, aio_flags is not checked for validity. If we add the flags to aio_flags, it would break existing applications which have it set to anything besides zero or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it to aio_rw_flags. IOCB_RW_FLAG_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. Signed-off-by: Goldwyn Rodrigues --- fs/aio.c | 10 +- include/linux/fs.h | 1 + include/uapi/linux/aio_abi.h | 9 - 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index f52d925..41409ac 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1541,11 +1541,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) { + if (unlikely(iocb->aio_reserved2)) { pr_debug("EINVAL: reserve field set\n"); return -EINVAL; } + if (unlikely(iocb->aio_rw_flags & ~IOCB_RW_FLAG_NOWAIT)) { + pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n"); + return -EINVAL; + } + /* prevent overflows */ if (unlikely( (iocb->aio_buf != (unsigned long)iocb->aio_buf) || @@ -1586,6 +1591,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->common.ki_flags |= IOCB_EVENTFD; } + if (iocb->aio_rw_flags & IOCB_RW_FLAG_NOWAIT) + req->common.ki_flags |= IOCB_NOWAIT; + ret = put_user(KIOCB_KEY, _iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/include/linux/fs.h b/include/linux/fs.h index 7251f7b..e8d9346 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -270,6 +270,7 @@ struct writeback_control; #define IOCB_DSYNC (1 << 4) #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) +#define IOCB_NOWAIT(1 << 7) struct kiocb { struct file *ki_filp; diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index bb2554f..6d98cbe 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -54,6 +54,13 @@ enum { */ #define IOCB_FLAG_RESFD(1 << 0) +/* + * Flags for aio_rw_flags member of "struct iocb". + * IOCB_RW_FLAG_NOWAIT - Set if the user wants the iocb to fail if it + * would block for operations such as disk allocation. + */ +#define IOCB_RW_FLAG_NOWAIT(1 << 1) + /* read() from /dev/aio returns these structures. */ struct io_event { __u64 data; /* the data field from the iocb */ @@ -79,7 +86,7 @@ struct io_event { struct iocb { /* these are internal to the kernel/libc. */ __u64 aio_data; /* data to be returned in event's data */ - __u32 PADDED(aio_key, aio_reserved1); + __u32 PADDED(aio_key, aio_rw_flags); /* the kernel sets aio_key to the req # */ /* common fields */ -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem
From: Goldwyn RodriguesA failure to lock i_rwsem would mean there is I/O being performed by another thread. So, let's bail. Reviewed-by: Christoph Hellwig Signed-off-by: Goldwyn Rodrigues --- mm/filemap.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 1694623..e08f3b9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2982,7 +2982,12 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file->f_mapping->host; ssize_t ret; - inode_lock(inode); + if (!inode_trylock(inode)) { + /* Don't sleep on inode rwsem */ + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + inode_lock(inode); + } ret = generic_write_checks(iocb, from); if (ret > 0) ret = __generic_file_write_iter(iocb, from); -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Goldwyn RodriguesFind out if the write will trigger a wait due to writeback. If yes, return -EAGAIN. This introduces a new function filemap_range_has_page() which returns true if the file's mapping has a page within the range mentioned. Return -EINVAL for buffered AIO: there are multiple causes of delay such as page locks, dirty throttling logic, page loading from disk etc. which cannot be taken care of. Signed-off-by: Goldwyn Rodrigues --- include/linux/fs.h | 2 ++ mm/filemap.c | 50 +++--- 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index e8d9346..4a30e8f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2514,6 +2514,8 @@ extern int filemap_fdatawait(struct address_space *); extern void filemap_fdatawait_keep_errors(struct address_space *); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern int filemap_range_has_page(struct address_space *, loff_t lstart, + loff_t lend); extern int filemap_write_and_wait(struct address_space *mapping); extern int filemap_write_and_wait_range(struct address_space *mapping, loff_t lstart, loff_t lend); diff --git a/mm/filemap.c b/mm/filemap.c index e08f3b9..c020e23 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); +/** + * filemap_range_has_page - check if a page exists in range. + * @mapping: address space structure to wait for + * @start_byte:offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Find at least one page in the range supplied, usually used to check if + * direct writing in this range will trigger a writeback. + */ +int filemap_range_has_page(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + pgoff_t index = start_byte >> PAGE_SHIFT; + pgoff_t end = end_byte >> PAGE_SHIFT; + struct pagevec pvec; + int ret; + + if (end_byte < start_byte) + return 0; + + if (mapping->nrpages == 0) + return 0; + + pagevec_init(, 0); + ret = pagevec_lookup(, mapping, index, 1); + if (!ret) + return 0; + ret = (pvec.pages[0]->index <= end); + pagevec_release(); + return ret; +} +EXPORT_SYMBOL(filemap_range_has_page); + static int __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { @@ -2640,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) pos = iocb->ki_pos; + if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) + return -EINVAL; + if (limit != RLIM_INFINITY) { if (iocb->ki_pos >= limit) { send_sig(SIGXFSZ, current, 0); @@ -2709,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) write_len = iov_iter_count(from); end = (pos + write_len - 1) >> PAGE_SHIFT; - written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); - if (written) - goto out; + if (iocb->ki_flags & IOCB_NOWAIT) { + /* If there are pages to writeback, return */ + if (filemap_range_has_page(inode->i_mapping, pos, + pos + iov_iter_count(from))) + return -EAGAIN; + } else { + written = filemap_write_and_wait_range(mapping, pos, + pos + write_len - 1); + if (written) + goto out; + } /* * After a write we want buffered reads to be sure to go to disk to get -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] nowait aio: btrfs
From: Goldwyn RodriguesReturn EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/file.c | 25 - fs/btrfs/inode.c | 3 +++ 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 520cb72..a870e5d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, ssize_t num_written = 0; bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); ssize_t err; - loff_t pos; - size_t count; + loff_t pos = iocb->ki_pos; + size_t count = iov_iter_count(from); loff_t oldsize; int clean_page = 0; - inode_lock(inode); + if ((iocb->ki_flags & IOCB_NOWAIT) && + (iocb->ki_flags & IOCB_DIRECT)) { + /* Don't sleep on inode rwsem */ + if (!inode_trylock(inode)) + return -EAGAIN; + /* +* We will allocate space in case nodatacow is not set, +* so bail +*/ + if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | + BTRFS_INODE_PREALLOC)) || + check_can_nocow(BTRFS_I(inode), pos, ) <= 0) { + inode_unlock(inode); + return -EAGAIN; + } + } else + inode_lock(inode); + err = generic_write_checks(iocb, from); if (err <= 0) { inode_unlock(inode); @@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, */ update_time_for_write(inode); - pos = iocb->ki_pos; - count = iov_iter_count(from); start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c40060c..788bb93 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8613,6 +8613,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) dio_data.overwrite = 1; inode_unlock(inode); relock = true; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; } ret = btrfs_delalloc_reserve_space(inode, offset, count); if (ret) -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] nowait aio: xfs
From: Goldwyn RodriguesIf IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Signed-off-by: Goldwyn Rodrigues --- fs/xfs/xfs_file.c | 15 +++ fs/xfs/xfs_iomap.c | 13 + 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 35703a8..08a5eef 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, ); if (ret) goto out; @@ -553,9 +556,13 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if ((iocb->ki_flags & IOCB_NOWAIT) && +atomic_read(>i_dio_count)) + return -EAGAIN; inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 288ee5b..6843725 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1015,6 +1015,11 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* A reflinked inode will result in CoW alloc */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, , , ); @@ -1032,6 +1037,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) { /* +* If nowait is set bail since we are going to make +* allocations. +*/ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] nowait aio: return on congested block device
From: Goldwyn RodriguesA new flag BIO_NOWAIT is introduced to identify bio's orignating from iocb with IOCB_NOWAIT. This flag indicates to return immediately if a request cannot be made instead of retrying. Signed-off-by: Goldwyn Rodrigues --- block/blk-core.c | 12 ++-- block/blk-mq-sched.c | 3 +++ block/blk-mq.c| 4 fs/direct-io.c| 11 +-- include/linux/bio.h | 6 ++ include/linux/blk_types.h | 1 + 6 files changed, 33 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 0eeb99e..2e5cba2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op, if (!IS_ERR(rq)) return rq; + if (bio && bio_flagged(bio, BIO_NOWAIT)) { + blk_put_rl(rl); + return ERR_PTR(-EAGAIN); + } + if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - if (likely(blk_queue_enter(q, false) == 0)) { + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) { struct bio_list hold; struct bio_list lower, same; @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_merge(_list_on_stack, ); bio_list_merge(_list_on_stack, ); } else { - bio_io_error(bio); + if (unlikely(bio_flagged(bio, BIO_NOWAIT))) + bio_wouldblock_error(bio); + else + bio_io_error(bio); } bio = bio_list_pop(current->bio_list); } while (bio); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 09af8ff..40e78b5 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q, if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); + if (likely(bio) && bio_flagged(bio, BIO_NOWAIT)) + data->flags |= BLK_MQ_REQ_NOWAIT; + if (e) { data->flags |= BLK_MQ_REQ_INTERNAL; diff --git a/block/blk-mq.c b/block/blk-mq.c index 159187a..942ce8c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, ); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio && bio_flagged(bio, BIO_NOWAIT)) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } @@ -1642,6 +1644,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, ); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio && bio_flagged(bio, BIO_NOWAIT)) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } diff --git a/fs/direct-io.c b/fs/direct-io.c index a04ebea..f6835d3 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, else bio->bi_end_io = dio_bio_end_io; + if (dio->iocb->ki_flags & IOCB_NOWAIT) + bio_set_flag(bio, BIO_NOWAIT); + sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; } @@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) unsigned i; int err; - if (bio->bi_error) - dio->io_error = -EIO; + if (bio->bi_error) { + if (bio_flagged(bio, BIO_NOWAIT)) + dio->io_error = -EAGAIN; + else + dio->io_error = -EIO; + } if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) { err = bio->bi_error; diff --git a/include/linux/bio.h b/include/linux/bio.h index 8e52119..1a92707 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio) bio_endio(bio); } +static inline void bio_wouldblock_error(struct bio *bio) +{ + bio->bi_error = -EAGAIN; + bio_endio(bio); +} + struct request_queue; extern int bio_phys_segments(struct request_queue *, struct bio *); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index
[PATCH 6/8] nowait aio: ext4
From: Goldwyn RodriguesReturn EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues --- fs/ext4/file.c | 48 +++- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 8210c1f..e223b9f 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -127,27 +127,22 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos) return 0; } -/* Is IO overwriting allocated and initialized blocks? */ -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) +/* Are IO blocks allocated */ +static bool ext4_blocks_mapped(struct inode *inode, loff_t pos, loff_t len, + struct ext4_map_blocks *map) { - struct ext4_map_blocks map; unsigned int blkbits = inode->i_blkbits; int err, blklen; if (pos + len > i_size_read(inode)) return false; - map.m_lblk = pos >> blkbits; - map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits); - blklen = map.m_len; + map->m_lblk = pos >> blkbits; + map->m_len = EXT4_MAX_BLOCKS(len, pos, blkbits); + blklen = map->m_len; - err = ext4_map_blocks(NULL, inode, , 0); - /* -* 'err==len' means that all of the blocks have been preallocated, -* regardless of whether they have been initialized or not. To exclude -* unwritten extents, we need to check m_flags. -*/ - return err == blklen && (map.m_flags & EXT4_MAP_MAPPED); + err = ext4_map_blocks(NULL, inode, map, 0); + return err == blklen; } static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) @@ -204,6 +199,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct inode *inode = file_inode(iocb->ki_filp); int o_direct = iocb->ki_flags & IOCB_DIRECT; + int nowait = iocb->ki_flags & IOCB_NOWAIT; int unaligned_aio = 0; int overwrite = 0; ssize_t ret; @@ -216,7 +212,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) return ext4_dax_write_iter(iocb, from); #endif - inode_lock(inode); + if (o_direct && nowait) { + if (!inode_trylock(inode)) + return -EAGAIN; + } else { + inode_lock(inode); + } + ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -235,9 +237,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->private = /* Check whether we do a DIO overwrite or not */ - if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio && - ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) - overwrite = 1; + if (o_direct && !unaligned_aio) { + struct ext4_map_blocks map; + if (ext4_blocks_mapped(inode, iocb->ki_pos, + iov_iter_count(from), )) { + /* To exclude unwritten extents, we need to check +* m_flags. +*/ + if (ext4_should_dioread_nolock(inode) && + (map.m_flags & EXT4_MAP_MAPPED)) + overwrite = 1; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; + } + } ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT
From: Goldwyn RodriguesIOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps. This is used by XFS in the XFS patch. Signed-off-by: Goldwyn Rodrigues --- fs/iomap.c| 2 ++ include/linux/iomap.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index 141c3cd..d1c8175 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, } else { dio->flags |= IOMAP_DIO_WRITE; flags |= IOMAP_WRITE; + if (iocb->ki_flags & IOCB_NOWAIT) + flags |= IOMAP_NOWAIT; } if (mapping->nrpages) { diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 7291810..53f6af8 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -51,6 +51,7 @@ struct iomap { #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */ #define IOMAP_FAULT(1 << 3) /* mapping for page fault */ #define IOMAP_DIRECT (1 << 4) /* direct I/O */ +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */ struct iomap_ops { /* -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8 v3] No wait AIO
Formerly known as non-blocking AIO. This series adds nonblocking feature to asynchronous I/O writes. io_submit() can be delayed because of a number of reason: - Block allocation for files - Data writebacks for direct I/O - Sleeping because of waiting to acquire i_rwsem - Congested block device The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if any of these conditions are met. This way userspace can push most of the write()s to the kernel to the best of its ability to complete and if it returns -EAGAIN, can defer it to another thread. In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to IOCB_NOWAIT for struct iocb, BIO_NOWAIT for bio and IOMAP_NOWAIT for iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could not use aio_flags because it is not currently checked for invalidity in the kernel. This feature is provided for direct I/O of asynchronous I/O only. I have tested it against xfs, ext4, and btrfs. Changes since v1: + changed name from _NONBLOCKING to *_NOWAIT + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range(). + BIO_NOWAIT limited to get_request() + XFS fixes - included reflink - use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag - Translate the flag through IOMAP_NOWAIT (iomap) to check for block allocation for the file. + ext4 coding style Changes since v2: + Using aio_reserved1 as aio_rw_flags instead of aio_flags + blk-mq support + xfs uptodate with kernel and reflink changes -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] cleanup __btrfs_map_block
On Wed, Mar 15, 2017 at 02:07:53PM +0100, David Sterba wrote: > On Tue, Mar 14, 2017 at 01:33:54PM -0700, Liu Bo wrote: > > This is attempting to make __btrfs_map_block less scary :) > > > > The major changes are > > > > 1) split operations for discard out of __btrfs_map_block and we don't copy > > discard operations for the target device of dev replace since they're not > > as important as writes. > > > > 2) put dev replace stuff into helpers since they're basically > > self-independant. > > Thank, I'm going to add the branch to 4.12 queue (right now the branch > is misc-next but it could change), > > https://marc.info/?l=linux-btrfs=148741582021588 > > and fix that one too. Oh, sorry about that, copy-and-paste... Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
possible deadlock between fsfreeze and asynchronous faults
Hello, Here is a nother lockdep splat I got: [ 1131.517411] == [ 1131.518059] [ INFO: possible circular locking dependency detected ] [ 1131.518059] 4.11.0-rc1-nbor #147 Tainted: GW [ 1131.518059] --- [ 1131.518059] xfs_io/2661 is trying to acquire lock: [ 1131.518059] (sb_internal#2){.+}, at: [] percpu_down_write+0x25/0x120 [ 1131.518059] [ 1131.518059] but task is already holding lock: [ 1131.518059] (sb_pagefaults){..}, at: [] percpu_down_write+0x25/0x120 [ 1131.518059] [ 1131.518059] which lock already depends on the new lock. [ 1131.518059] [ 1131.518059] [ 1131.518059] the existing dependency chain (in reverse order) is: [ 1131.518059] [ 1131.518059] -> #4 (sb_pagefaults){..}: [ 1131.518059]lock_acquire+0xc5/0x220 [ 1131.518059]__sb_start_write+0x119/0x1d0 [ 1131.518059]btrfs_page_mkwrite+0x51/0x420 [ 1131.518059]do_page_mkwrite+0x38/0xb0 [ 1131.518059]__handle_mm_fault+0x6b5/0xef0 [ 1131.518059]handle_mm_fault+0x175/0x300 [ 1131.518059]__do_page_fault+0x1e0/0x4d0 [ 1131.518059]trace_do_page_fault+0xaa/0x270 [ 1131.518059]do_async_page_fault+0x19/0x70 [ 1131.518059]async_page_fault+0x28/0x30 [ 1131.518059] [ 1131.518059] -> #3 (>mmap_sem){++}: [ 1131.518059]lock_acquire+0xc5/0x220 [ 1131.518059]down_read+0x47/0x70 [ 1131.518059]get_user_pages_unlocked+0x4f/0x1a0 [ 1131.518059]get_user_pages_fast+0x81/0x170 [ 1131.518059]iov_iter_get_pages+0xc1/0x300 [ 1131.518059]__blockdev_direct_IO+0x14f8/0x34e0 [ 1131.518059]btrfs_direct_IO+0x1e8/0x390 [ 1131.518059]generic_file_direct_write+0xb5/0x160 [ 1131.518059]btrfs_file_write_iter+0x26d/0x500 [ 1131.518059]aio_write+0xdb/0x190 [ 1131.518059]do_io_submit+0x5aa/0x830 [ 1131.518059]SyS_io_submit+0x10/0x20 [ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6 [ 1131.518059] [ 1131.518059] -> #2 (>dio_sem){.+}: [ 1131.518059]lock_acquire+0xc5/0x220 [ 1131.518059]down_write+0x44/0x80 [ 1131.518059]btrfs_log_changed_extents+0x7c/0x660 [ 1131.518059]btrfs_log_inode+0xb78/0xf50 [ 1131.518059]btrfs_log_inode_parent+0x2a9/0xa70 [ 1131.518059]btrfs_log_dentry_safe+0x74/0xa0 [ 1131.518059]btrfs_sync_file+0x321/0x4d0 [ 1131.518059]vfs_fsync_range+0x46/0xc0 [ 1131.518059]vfs_fsync+0x1c/0x20 [ 1131.518059]do_fsync+0x38/0x60 [ 1131.518059]SyS_fsync+0x10/0x20 [ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6 [ 1131.518059] [ 1131.518059] -> #1 (>log_mutex){+.+...}: [ 1131.518059]lock_acquire+0xc5/0x220 [ 1131.518059]__mutex_lock+0x7c/0x960 [ 1131.518059]mutex_lock_nested+0x1b/0x20 [ 1131.518059]btrfs_record_unlink_dir+0x3e/0xb0 [ 1131.518059]btrfs_unlink+0x72/0xf0 [ 1131.518059]vfs_unlink+0xbe/0x1b0 [ 1131.518059]do_unlinkat+0x244/0x280 [ 1131.518059]SyS_unlinkat+0x1d/0x30 [ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6 [ 1131.518059] [ 1131.518059] -> #0 (sb_internal#2){.+}: [ 1131.518059]__lock_acquire+0x16f1/0x17c0 [ 1131.518059]lock_acquire+0xc5/0x220 [ 1131.518059]down_write+0x44/0x80 [ 1131.518059]percpu_down_write+0x25/0x120 [ 1131.518059]freeze_super+0xbf/0x1a0 [ 1131.518059]do_vfs_ioctl+0x598/0x770 [ 1131.518059]SyS_ioctl+0x4c/0x90 [ 1131.518059]entry_SYSCALL_64_fastpath+0x23/0xc6 [ 1131.518059] [ 1131.518059] other info that might help us debug this: [ 1131.518059] [ 1131.518059] Chain exists of: [ 1131.518059] sb_internal#2 --> >mmap_sem --> sb_pagefaults [ 1131.518059] [ 1131.518059] Possible unsafe locking scenario: [ 1131.518059] [ 1131.518059]CPU0CPU1 [ 1131.518059] [ 1131.518059] lock(sb_pagefaults); [ 1131.518059]lock(>mmap_sem); [ 1131.518059]lock(sb_pagefaults); [ 1131.518059] lock(sb_internal#2); [ 1131.518059] [ 1131.518059] *** DEADLOCK *** [ 1131.518059] [ 1131.518059] 3 locks held by xfs_io/2661: [ 1131.518059] #0: (sb_writers#11){.+}, at: [] percpu_down_write+0x25/0x120 [ 1131.518059] #1: (>s_umount_key#33){+.}, at: [] freeze_super+0x93/0x1a0 [ 1131.518059] #2: (sb_pagefaults){..}, at: [] percpu_down_write+0x25/0x120 [ 1131.518059] [ 1131.518059] stack backtrace: [ 1131.518059] CPU: 0 PID: 2661 Comm: xfs_io Tainted: GW 4.11.0-rc1-nbor #147 [ 1131.518059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 1131.518059] Call Trace: [ 1131.518059] dump_stack+0x85/0xc9 [ 1131.518059] print_circular_bug+0x2ac/0x2ba [ 1131.518059] __lock_acquire+0x16f1/0x17c0 [
[PATCH 5/7] btrfs: remove redundant parameter from reada_find_zone
We can read fs_info from dev. Signed-off-by: David Sterba--- fs/btrfs/reada.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 5edf7328f67d..c1fc79cd4b2a 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -235,10 +235,10 @@ int btree_readahead_hook(struct extent_buffer *eb, int err) return ret; } -static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info, - struct btrfs_device *dev, u64 logical, +static struct reada_zone *reada_find_zone(struct btrfs_device *dev, u64 logical, struct btrfs_bio *bbio) { + struct btrfs_fs_info *fs_info = dev->fs_info; int ret; struct reada_zone *zone; struct btrfs_block_group_cache *cache = NULL; @@ -372,7 +372,7 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, if (!dev->bdev) continue; - zone = reada_find_zone(fs_info, dev, logical, bbio); + zone = reada_find_zone(dev, logical, bbio); if (!zone) continue; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] btrfs: preallocate radix tree node for readahead
We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the per-device radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, same as an allocation above anyway, but also because readahead is optional and not on any critical writeout path. Signed-off-by: David Sterba--- fs/btrfs/reada.c | 7 +++ fs/btrfs/volumes.c | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index e88bca87f5d2..fdae8ca79401 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -270,6 +270,12 @@ static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info, if (!zone) return NULL; + ret = radix_tree_preload(GFP_KERNEL); + if (ret) { + kfree(zone); + return NULL; + } + zone->start = start; zone->end = end; INIT_LIST_HEAD(>list); @@ -299,6 +305,7 @@ static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info, zone = NULL; } spin_unlock(_info->reada_lock); + radix_tree_preload_end(); return zone; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 73d56eef5e60..f158b8657ae3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -247,7 +247,7 @@ static struct btrfs_device *__alloc_device(void) atomic_set(>reada_in_flight, 0); atomic_set(>dev_stats_ccnt, 0); btrfs_device_data_ordered_init(dev); - INIT_RADIX_TREE(>reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM); + INIT_RADIX_TREE(>reada_zones, GFP_KERNEL); INIT_RADIX_TREE(>reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM); return dev; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] btrfs: remove local blocksize variable in reada_find_extent
The name is misleading and the local variable serves no purpose. Signed-off-by: David Sterba--- fs/btrfs/reada.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 91df381a60ce..64425c3fe4f5 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -318,7 +318,6 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, struct btrfs_bio *bbio = NULL; struct btrfs_device *dev; struct btrfs_device *prev_dev; - u32 blocksize; u64 length; int real_stripes; int nzones = 0; @@ -339,7 +338,6 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, if (!re) return NULL; - blocksize = fs_info->nodesize; re->logical = logical; re->top = *top; INIT_LIST_HEAD(>extctl); @@ -349,10 +347,10 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, /* * map block */ - length = blocksize; + length = fs_info->nodesize; ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, , , 0); - if (ret || !bbio || length < blocksize) + if (ret || !bbio || length < fs_info->nodesize) goto error; if (bbio->num_stripes > BTRFS_MAX_MIRRORS) { -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] btrfs: remove redundant parameter from btree_readahead_hook
We can read fs_info from eb. Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 3 +-- fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/reada.c | 4 ++-- 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29b7fc28c607..173fac68323a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3671,8 +3671,7 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end); int btrfs_reada_wait(void *handle); void btrfs_reada_detach(void *handle); -int btree_readahead_hook(struct btrfs_fs_info *fs_info, -struct extent_buffer *eb, int err); +int btree_readahead_hook(struct extent_buffer *eb, int err); static inline int is_fstree(u64 rootid) { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1d4c30327247..995b28179af9 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -762,7 +762,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, err: if (reads_done && test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) - btree_readahead_hook(fs_info, eb, ret); + btree_readahead_hook(eb, ret); if (ret) { /* @@ -787,7 +787,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror) eb->read_mirror = failed_mirror; atomic_dec(>io_pages); if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) - btree_readahead_hook(eb->fs_info, eb, -EIO); + btree_readahead_hook(eb, -EIO); return -EIO;/* we fixed nothing */ } diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 4c5a9b241cab..5edf7328f67d 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -209,9 +209,9 @@ static void __readahead_hook(struct btrfs_fs_info *fs_info, return; } -int btree_readahead_hook(struct btrfs_fs_info *fs_info, -struct extent_buffer *eb, int err) +int btree_readahead_hook(struct extent_buffer *eb, int err) { + struct btrfs_fs_info *fs_info = eb->fs_info; int ret = 0; struct reada_extent *re; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] btrfs: preallocate radix tree node for global readahead tree
We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the global radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, because readahead is optional and not on any critical writeout path. Signed-off-by: David Sterba--- fs/btrfs/disk-io.c | 2 +- fs/btrfs/reada.c | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 08b74daf35d0..1d4c30327247 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2693,7 +2693,7 @@ int open_ctree(struct super_block *sb, fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL; fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */ /* readahead state */ - INIT_RADIX_TREE(_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM); + INIT_RADIX_TREE(_info->reada_tree, GFP_KERNEL); spin_lock_init(_info->reada_lock); fs_info->thread_pool_size = min_t(unsigned long, diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index dd78af5d265d..4c5a9b241cab 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -391,6 +391,10 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, goto error; } + ret = radix_tree_preload(GFP_KERNEL); + if (ret) + goto error; + /* insert extent in reada_tree + all per-device trees, all or nothing */ btrfs_dev_replace_lock(_info->dev_replace, 0); spin_lock(_info->reada_lock); @@ -400,13 +404,16 @@ static struct reada_extent *reada_find_extent(struct btrfs_fs_info *fs_info, re_exist->refcnt++; spin_unlock(_info->reada_lock); btrfs_dev_replace_unlock(_info->dev_replace, 0); + radix_tree_preload_end(); goto error; } if (ret) { spin_unlock(_info->reada_lock); btrfs_dev_replace_unlock(_info->dev_replace, 0); + radix_tree_preload_end(); goto error; } + radix_tree_preload_end(); prev_dev = NULL; dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing( _info->dev_replace); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] Readahead clenups
I spotted some GFP_NOFS uses in readahead and converted them to GFP_KERNEL with a few cleanups along the way. David Sterba (7): btrfs: preallocate radix tree node for readahead btrfs: use simpler readahead zone lookups btrfs: preallocate radix tree node for global readahead tree btrfs: remove redundant parameter from btree_readahead_hook btrfs: remove redundant parameter from reada_find_zone btrfs: remove redundant parameter from reada_start_machine_dev btrfs: remove local blocksize variable in reada_find_extent fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 6 ++-- fs/btrfs/reada.c | 89 -- fs/btrfs/volumes.c | 2 +- 4 files changed, 51 insertions(+), 49 deletions(-) -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] btrfs: use simpler readahead zone lookups
No point using radix_tree_gang_lookup if we're looking up just one slot. Signed-off-by: David Sterba--- fs/btrfs/reada.c | 52 ++-- 1 file changed, 22 insertions(+), 30 deletions(-) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index fdae8ca79401..dd78af5d265d 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -246,11 +246,9 @@ static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info, u64 end; int i; - zone = NULL; spin_lock(_info->reada_lock); - ret = radix_tree_gang_lookup(>reada_zones, (void **), -logical >> PAGE_SHIFT, 1); - if (ret == 1 && logical >= zone->start && logical <= zone->end) { + zone = radix_tree_lookup(>reada_zones, logical >> PAGE_SHIFT); + if (zone && logical >= zone->start && logical <= zone->end) { kref_get(>refcnt); spin_unlock(_info->reada_lock); return zone; @@ -297,9 +295,9 @@ static struct reada_zone *reada_find_zone(struct btrfs_fs_info *fs_info, if (ret == -EEXIST) { kfree(zone); - ret = radix_tree_gang_lookup(>reada_zones, (void **), -logical >> PAGE_SHIFT, 1); - if (ret == 1 && logical >= zone->start && logical <= zone->end) + zone = radix_tree_lookup(>reada_zones, + logical >> PAGE_SHIFT); + if (zone && logical >= zone->start && logical <= zone->end) kref_get(>refcnt); else zone = NULL; @@ -604,7 +602,6 @@ static int reada_pick_zone(struct btrfs_device *dev) u64 top_elems = 0; u64 top_locked_elems = 0; unsigned long index = 0; - int ret; if (dev->reada_curr_zone) { reada_peer_zones_set_lock(dev->reada_curr_zone, 0); @@ -615,9 +612,8 @@ static int reada_pick_zone(struct btrfs_device *dev) while (1) { struct reada_zone *zone; - ret = radix_tree_gang_lookup(>reada_zones, -(void **), index, 1); - if (ret == 0) + zone = radix_tree_lookup(>reada_zones, index); + if (!zone) break; index = (zone->end >> PAGE_SHIFT) + 1; if (zone->locked) { @@ -669,19 +665,18 @@ static int reada_start_machine_dev(struct btrfs_fs_info *fs_info, * a contiguous block of extents, we could also coagulate them or use * plugging to speed things up */ - ret = radix_tree_gang_lookup(>reada_extents, (void **), -dev->reada_next >> PAGE_SHIFT, 1); - if (ret == 0 || re->logical > dev->reada_curr_zone->end) { + re = radix_tree_lookup(>reada_extents, + dev->reada_next >> PAGE_SHIFT); + if (!re || re->logical > dev->reada_curr_zone->end) { ret = reada_pick_zone(dev); if (!ret) { spin_unlock(_info->reada_lock); return 0; } - re = NULL; - ret = radix_tree_gang_lookup(>reada_extents, (void **), - dev->reada_next >> PAGE_SHIFT, 1); + re = radix_tree_lookup(>reada_extents, + dev->reada_next >> PAGE_SHIFT); } - if (ret == 0) { + if (!re) { spin_unlock(_info->reada_lock); return 0; } @@ -809,7 +804,6 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int all) struct btrfs_device *device; struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; unsigned long index; - int ret; int i; int j; int cnt; @@ -821,9 +815,9 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int all) index = 0; while (1) { struct reada_zone *zone; - ret = radix_tree_gang_lookup(>reada_zones, -(void **), index, 1); - if (ret == 0) + + zone = radix_tree_lookup(>reada_zones, index); + if (!zone) break; pr_debug(" zone %llu-%llu elems %llu locked %d devs", zone->start, zone->end, zone->elems, @@ -841,11 +835,10 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int all) cnt = 0; index = 0; while (all) { - struct reada_extent *re = NULL; + struct reada_extent *re; - ret = radix_tree_gang_lookup(>reada_extents, -
[PATCH 6/7] btrfs: remove redundant parameter from reada_start_machine_dev
We can read fs_info from dev. Signed-off-by: David Sterba--- fs/btrfs/reada.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index c1fc79cd4b2a..91df381a60ce 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -649,9 +649,9 @@ static int reada_pick_zone(struct btrfs_device *dev) return 1; } -static int reada_start_machine_dev(struct btrfs_fs_info *fs_info, - struct btrfs_device *dev) +static int reada_start_machine_dev(struct btrfs_device *dev) { + struct btrfs_fs_info *fs_info = dev->fs_info; struct reada_extent *re = NULL; int mirror_num = 0; struct extent_buffer *eb = NULL; @@ -763,8 +763,7 @@ static void __reada_start_machine(struct btrfs_fs_info *fs_info) list_for_each_entry(device, _devices->devices, dev_list) { if (atomic_read(>reada_in_flight) < MAX_IN_FLIGHT) - enqueued += reada_start_machine_dev(fs_info, - device); + enqueued += reada_start_machine_dev(device); } mutex_unlock(_devices->device_list_mutex); total += enqueued; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: remove unused qgroup members from btrfs_trans_handle
The members have been effectively unused since "Btrfs: rework qgroup accounting" (fcebe4562dec83b3), there's no substitute for assert_qgroups_uptodate so it's removed as well. Signed-off-by: David Sterba--- fs/btrfs/extent-tree.c | 1 - fs/btrfs/qgroup.c| 12 fs/btrfs/qgroup.h| 1 - fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 3 --- fs/btrfs/transaction.h | 2 -- 6 files changed, 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index be5477676cc8..b5682abf6f68 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3003,7 +3003,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, goto again; } out: - assert_qgroups_uptodate(trans); trans->can_flush_pending_bgs = can_flush_pending_bgs; return 0; } diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index a5da750c1087..2fa0b10d239f 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2487,18 +2487,6 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info, spin_unlock(_info->qgroup_lock); } -void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) -{ - if (list_empty(>qgroup_ref_list) && !trans->delayed_ref_elem.seq) - return; - btrfs_err(trans->fs_info, - "qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x", - trans, list_empty(>qgroup_ref_list) ? "" : " not", - (u32)(trans->delayed_ref_elem.seq >> 32), - (u32)trans->delayed_ref_elem.seq); - BUG(); -} - /* * returns < 0 on error, 0 when more leafs are to be scanned. * returns 1 when done. diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 26932a8a1993..96fc56ebf55a 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -196,7 +196,6 @@ static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info, btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes); trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes); } -void assert_qgroups_uptodate(struct btrfs_trans_handle *trans); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid, diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c index ea272432c930..b18ab8f327a5 100644 --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -237,7 +237,6 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle *trans) { memset(trans, 0, sizeof(*trans)); trans->transid = 1; - INIT_LIST_HEAD(>qgroup_ref_list); trans->type = __TRANS_DUMMY; } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 61b807de3e16..9db3b4ca0264 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -572,7 +572,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items, h->type = type; h->can_flush_pending_bgs = true; - INIT_LIST_HEAD(>qgroup_ref_list); INIT_LIST_HEAD(>new_bgs); smp_mb(); @@ -917,7 +916,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, wake_up_process(info->transaction_kthread); err = -EIO; } - assert_qgroups_uptodate(trans); kmem_cache_free(btrfs_trans_handle_cachep, trans); if (must_run_delayed_refs) { @@ -2223,7 +2221,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) switch_commit_roots(cur_trans, fs_info); - assert_qgroups_uptodate(trans); ASSERT(list_empty(_trans->dirty_bgs)); ASSERT(list_empty(_trans->io_bgs)); update_super_roots(fs_info); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 5dfb5590fff6..2e560d2abdff 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -125,8 +125,6 @@ struct btrfs_trans_handle { unsigned int type; struct btrfs_root *root; struct btrfs_fs_info *fs_info; - struct seq_list delayed_ref_elem; - struct list_head qgroup_ref_list; struct list_head new_bgs; }; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: provide enumeration for __merge_refs mode argument
On Mon, Mar 13, 2017 at 02:32:03PM -0600, ednadol...@gmail.com wrote: > @@ -809,14 +814,12 @@ static int __add_missing_keys(struct btrfs_fs_info > *fs_info, > /* > * merge backrefs and adjust counts accordingly > * > - * mode = 1: merge identical keys, if key is set > *FIXME: if we add more keys in __add_prelim_ref, we can merge more here. > * additionally, we could even add a key range for the blocks we > * looked into to merge even more (-> replace unresolved refs by > those > * having a parent). The 'FIXME' seems to refer to mode = 1, but now that you remove it, it's not clear what it's referring to. Mentioning MERGE_IDENTICAL_KEYS would be good. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro
On Mon, Mar 13, 2017 at 02:32:04PM -0600, ednadol...@gmail.com wrote: > From: Edmund Nadolski> > Define the SEQ_NONE macro to replace (u64)-1 in places where said > value triggers a special-case ref search behavior. > index 9c41fba..20915a6 100644 > --- a/fs/btrfs/backref.h > +++ b/fs/btrfs/backref.h > @@ -23,6 +23,8 @@ > #include "ulist.h" > #include "extent_io.h" > > +#define SEQ_NONE ((u64)-1) Can you please move the definition to ctree.h, near line 660, where seq_list and SEQ_LIST_INIT are defined, so thay're all grouped together? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
claim of vtagfs feature
Hi to everyone here! Firstly thank you all for this amasing project! I want to claim "virtual TAG file system" feature to be implemented in BTRFS. What is it? It is a feature to simplify use and search data (files) with common tags. Some tags may be defined as a file attribute, like year of creation&|change so user can access same file with different (virtual) paths like: two diff files under ~/vtagfs/root/ : firm1/reports/2016/report firm1/reports/2017/report (firm1/reports/2017/.tags/report or just .tags - sysfile with all tags for this file/dir) ... and 2017/reports/firm1/report - will be link to second one also may be set by default automatic lock to edit files with "old year tag"... and so on... How it can works. /etc/vtagfs/ - place for global configs ~/.vtagfs/ - user space for configs ~/vtagfs/ - default or fixed place for data and links ~/vtagfs/.tags/ - place for data used by vtagfs itself ~/vtagfs/root/ - (root is fixed dir for all data operated by users' vtagfs - to be known by other programs) ~/vtagfs/root/tag1/ - place for data with tag1 (only tag1) ~/vtagfs/root/tag1/.tag - file marker that this dir is used as a tag. It maybe used to declare possible sets of tags with this one. ~/vtagfs/root/tag1//tag2/ - place for data with tag1 & tag2 ... ~/vtagfs/tag1/tag2 - slink showed to user ~/vtagfs/tag1/tag3 - slink showed to user ~/vtagfs/tag2/tag1 - slink showed to user ~/vtagfs/tag2/tag3 - slink showed to user ~/vtagfs/tag1/tag2/tag3/ - slink showed to user ... Sure, must be something like tagManager to define/operate with tags used by user. or/and it can be realized as transparent to user/system requests like mkdir, so mkdir tag1 inside ~/vtagfs/ will create tag1... and cp file1 tag1/tag2/tag3/ - will set these tags to file1 ... ... and make cp file1 ~/vtagfs/root/tag1/tag2/tag3/ where will be real place of file, known to system. Hope this feature will be useful, so thank you all for patience :) Best regards, Alexandr. l-in-k . c o m -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] cleanup __btrfs_map_block
On Tue, Mar 14, 2017 at 01:33:54PM -0700, Liu Bo wrote: > This is attempting to make __btrfs_map_block less scary :) > > The major changes are > > 1) split operations for discard out of __btrfs_map_block and we don't copy > discard operations for the target device of dev replace since they're not > as important as writes. > > 2) put dev replace stuff into helpers since they're basically > self-independant. Thank, I'm going to add the branch to 4.12 queue (right now the branch is misc-next but it could change), https://marc.info/?l=linux-btrfs=148741582021588 and fix that one too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Hi to everyone here! Firstly thank you all for this amasing project! I want to claim "virtual TAG file system" feature to be implemented in BTRFS. What is it? It is a feature to simplify use and search data (files) with common tags. Some tags may be defined as a file attribute, like year of creation&|change so user can access same file with different (virtual) paths like: two diff files under ~/vtagfs/root/ : firm1/reports/2016/report firm1/reports/2017/report (firm1/reports/2017/.tags/report - sysfile with all tags for this file) ... and 2017/reports/firm1/report - will be link to second one also may be set by default automatic lock to edit files with "old year tag"... and so on... How it can works. /etc/vtagfs/ - place for global configs ~/.vtagfs/ - user space for configs ~/vtagfs/ - default or fixed place for data and links ~/vtagfs/.tags/ - place for data used by vtagfs itself ~/vtagfs/root/ - (root is fixed dir for all data operated by users' vtagfs - to be known by other programs) ~/vtagfs/root/tag1/ - place for data with tag1 (only tag1) ~/vtagfs/root/tag1/.tag - file marker that this dir is used as a tag. It maybe used to declare possible sets of tags with this one. ~/vtagfs/root/tag1//tag2/ - place for data with tag1 & tag2 ... ~/vtagfs/tag1/tag2 - slink showed to user ~/vtagfs/tag1/tag3 - slink showed to user ~/vtagfs/tag2/tag1 - slink showed to user ~/vtagfs/tag2/tag3 - slink showed to user ~/vtagfs/tag1/tag2/tag3/ - slink showed to user ... Sure, must be something like tagManager to define/operate with tags used by user. or/and it can be realized as transparent to user/system requests like mkdir, so mkdir tag1 inside ~/vtagfs/ will create tag1... and cp file1 tag1/tag2/tag3/ - will set these tags to file1 ... ... and make cp file1 ~/vtagfs/root/tag1/tag2/tag3/ where will be real place of file, known to system. Hope this feature will be useful, so thank you all for patience :) Best regards, and welcome to https://l-in-k.com Alexandr. http://ungerware.biz https://l-in-k.com/147a258u369 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted: > Today is safe to use btrfs for home storage? No raid, just secure > storage for some files and create snapshots from it. I'll echo the others... but with emphasis on a few caveats the others mentioned but didn't give the emphasis I thought they deserved: 1) Btrfs is, as I repeatedly put it in post after post, "stabilizing, but not yet fully stable and mature." In general, that means it's likely to work quite or even very well for you (as it has done for us) if you don't try the too unusual or get too cocky, but get too close to the edge and you just might find yourself over that edge. Don't worry too much, tho, those edges are clearly marked if you're paying attention, and just by asking here, you're already paying way more attention than too many we see here... /after/ they've found themselves over the edge. That's a _very_ good sign. =:^) 2) "Stabilizing, not fully stable and mature", means even more than ever, if you value your data more than the time, hassle and resources necessary to have backups, you HAVE them, tested and available for practical use should it be necessary. Of course any sysadmin (and that's what you are for at least your own systems if you're making this choice) worth the name will tell you the value of the data is really defined by the number of backups it has, not by any arbitrary claims to value absent those backups. No backups, you simply didn't value the data enough to have them, whatever claims of value you might otherwise try to make. Backups, you /did/ value the data. And of course the corollary to that first sysadmin's rule of backups is that an untested as restorable backup isn't yet a backup, only a potential backup, because the job isn't finished and it can't be properly called a backup until you know you can restore from it if necessary. And lest anyone get the wrong idea, a snapshot is /not/ a backup for purposes of the above rules. It's on the same filesystem and hardware media and if that goes down... you've lost it just the same. And since that filesystem is still stabilizing, you really must be even more prepared for it to go down, even if the chances are still quite good it won't. 3) "Stabilizing, not fully stable and mature", also means that since the current best-practices code is still a moving target, you better be prepared to move with it. The list-recommended kernels are the latest two releases of either the current or (mainline) LTS kernel series. On the current track, 4.10 is out, so 4.10 and 4.9 are supported. If you're still on 4.8 or earlier and can't point to a very specific known reason, you're behind. On the LTS track, 4.9 is the latest LTS kernel as well, with 4.4 the one before that. 4.1's the one before that but that's a very long time ago in btrfs-development time, and while we'll generally still /try/ to help, honestly, our memory and thus our effectiveness at trying to help are going to be down dramatically from that of the recommended series. If you prefer longer term "enterprise" or just Debian-stable distro support, fine, but honestly, the sort of stability they target doesn't have much in common with a still stabilizing btrfs, and the chances are /extremely/ high that either one or the other isn't a good match for your needs. Either you want/need a more leading edge aka current distro which btrfs as still stabilizing fits in well with, or you want/need the stability of those longer term releases, and btrfs as still very actively stabilizing sticks out like a sore thumb and you're very likely to be better off on something that's actually considered stable, ext4 or xfs, perhaps, or my longer term stability favorite, reiserfs (which tends to be so stable in part because there's nobody screwing with it and messing things up, any more, reference the period when the mainline kernel devs switched the otherwise quite stable ext3 to the rather less stable data=writeback mode, for instance). 4) Keep the number of snapshots per subvolume under tight control as already suggested. A few hundred, NOT a few thousand. Easy enough if you do those snapshots manually, but easy enough to get thousands if you're not paying attention to thin out the old ones and using an automated tool such as snapper. 5) Stay away from quotas. Either you need the feature and thus need a more mature filesystem where it's actually stable and does what it says on the label, or you don't, in which case you'll save yourself a /lot/ of headaches keeping them off. Maybe someday... 6) Stay away from raid56 mode. It has known problems ATM and is simply not ready. FWIW, single-device and raid1 mode are the best tested and most reliable (within the single-device limitations for it, of course). But even raid1 mode has some caveats about rebuilding that it might be wise to familiarize yourself with /before/ they happen, if