Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 2018年07月11日 15:50, Anand Jain wrote: > > > BTRFS Volume operations, Device Lists and Locks all in one page: > > Devices are managed in two contexts, the scan context and the mounted > context. In scan context the threads originate from the btrfs_control > ioctl and in the mounted context the threads originates from the mount > point ioctl. > Apart from these two context, there also can be two transient state > where device state are transitioning from the scan to the mount context > or from the mount to the scan context. > > Device List and Locks:- > > Count: btrfs_fs_devices::num_devices > List : btrfs_fs_devices::devices -> btrfs_devices::dev_list > Lock : btrfs_fs_devices::device_list_mutex > > Count: btrfs_fs_devices::rw_devices So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO devices. How seed and ro devices are different in this case? > List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list > Lock : btrfs_fs_info::chunk_mutex At least the chunk_mutex is also shared with chunk allocator, or we should have some mutex in btrfs_fs_devices other than fs_info. Right? > > Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP > > FSID List and Lock:- > > Count : None > HEAD : Global::fs_uuids -> btrfs_fs_devices::fs_list > Lock : Global::uuid_mutex > > > After the fs_devices is mounted, the btrfs_fs_devices::opened > 0. fs_devices::opended should be btrfs_fs_devices::num_devices if no device is missing and -1 or -2 for degraded case, right? > > In the scan context we have the following device operations.. > > Device SCAN:- which creates the btrfs_fs_devices and its corresponding > btrfs_device entries, also checks and frees the duplicate device entries. > Lock: uuid_mutex > SCAN > if (found_duplicate && btrfs_fs_devices::opened == 0) > Free_duplicate > Unlock: uuid_mutex > > Device READY:- check if the volume is ready. Also does an implicit scan > and duplicate device free as in Device SCAN. > Lock: uuid_mutex > SCAN > if (found_duplicate && btrfs_fs_devices::opened == 0) > Free_duplicate > Check READY > Unlock: uuid_mutex > > Device FORGET:- (planned) free a given or all unmounted devices and > empty fs_devices if any. > Lock: uuid_mutex > if (found_duplicate && btrfs_fs_devices::opened == 0) > Free duplicate > Unlock: uuid_mutex > > Device mount operation -> A Transient state leading to the mounted context > Lock: uuid_mutex > Find, SCAN, btrfs_fs_devices::opened++ > Unlock: uuid_mutex > > Device umount operation -> A transient state leading to the unmounted > context or scan context > Lock: uuid_mutex > btrfs_fs_devices::opened-- > Unlock: uuid_mutex > > > In the mounted context we have the following device operations.. > > Device Rename through SCAN:- This is a special case where the device > path gets renamed after its been mounted. (Ubuntu changes the boot path > during boot up so we need this feature). Currently, this is part of > Device SCAN as above. And we need the locks as below, because the > dynamic disappearing device might cleanup the btrfs_device::name > Lock: btrfs_fs_devices::device_list_mutex > Rename > Unlock: btrfs_fs_devices::device_list_mutex > > Commit Transaction:- Write All supers. > Lock: btrfs_fs_devices::device_list_mutex > Write all super of btrfs_devices::dev_list > Unlock: btrfs_fs_devices::device_list_mutex > > Device add:- Add a new device to the existing mounted volume. > set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP > Lock: btrfs_fs_devices::device_list_mutex > Lock: btrfs_fs_info::chunk_mutex > List_add btrfs_devices::dev_list > List_add btrfs_devices::dev_alloc_list > Unlock: btrfs_fs_info::chunk_mutex > Unlock: btrfs_fs_devices::device_list_mutex > > Device remove:- Remove a device from the mounted volume. > set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP > Lock: btrfs_fs_devices::device_list_mutex > Lock: btrfs_fs_info::chunk_mutex > List_del btrfs_devices::dev_list > List_del btrfs_devices::dev_alloc_list > Unlock: btrfs_fs_info::chunk_mutex > Unlock: btrfs_fs_devices::device_list_mutex > > Device Replace:- Replace a device. > set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP > Lock: btrfs_fs_devices::device_list_mutex > Lock: btrfs_fs_info::chunk_mutex > List_update btrfs_devices::dev_list Here we still just add a new device but not deleting the existing one until the replace is finished. > List_update btrfs_devices::dev_alloc_list > Unlock: btrfs_fs_info::chunk_mutex > Unlock: btrfs_fs_devices::device_list_mutex > > Sprouting:- Add a RW device to the mounted RO seed device, so to make > the mount point writable. > The following steps are used to hold the seed and sprout fs_devices. > (first two steps are not necessary for the sprouting, they are there to > ensure the seed device remains scanned, and it might change) > . Clone the (mounted) fs_devices, lets call it as old_devices > . Now add old_devices to fs_uuids (yeah, there
Why original mode doesn't use swap? (Original: Re: btrfs check lowmem, take 2)
On 2018年07月12日 01:09, Chris Murphy wrote: > On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN wrote: >> Thanks to Su and Qu, I was able to get my filesystem to a point that >> it's mountable. >> I then deleted loads of snapshots and I'm down to 26. >> >> IT now looks like this: >> gargamel:~# btrfs fi show /mnt/mnt >> Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d >> Total devices 1 FS bytes used 12.30TiB >> devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 >> >> gargamel:~# btrfs fi df /mnt/mnt >> Data, single: total=13.57TiB, used=12.19TiB >> System, DUP: total=32.00MiB, used=1.55MiB >> Metadata, DUP: total=124.50GiB, used=115.62GiB >> Metadata, single: total=216.00MiB, used=0.00B >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> >> Problems >> 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the >> server, despite my deleting lots of snapshots. >> Is it because I have too many files then? > > I think originally needs most of metdata in memory. > > I'm not understanding why btrfs check won't use swap like at least > xfs_repair and pretty sure e2fsck will as well. I don't understand either. Isn't memory from malloc() swappable? Thanks, Qu > > Using 128G swap on nvme with original check is still gonna be faster > than lowmem mode. > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14.8 09/14] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
From: Wang Xiaoguang Unlike in-memory or on-disk dedupe method, only SHA256 hash method is supported yet, so implement btrfs_dedupe_calc_hash() interface using SHA256. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 50 +++ 1 file changed, 50 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index e3084deb1eb7..14c8d245480e 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -651,3 +651,53 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, } return ret; } + +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash) +{ + int i; + int ret; + struct page *p; + struct shash_desc *shash; + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + struct crypto_shash *tfm = dedupe_info->dedupe_driver; + u64 dedupe_bs; + u64 sectorsize = fs_info->sectorsize; + + shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm), GFP_NOFS); + if (!shash) + return -ENOMEM; + + if (!fs_info->dedupe_enabled || !hash) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + WARN_ON(!IS_ALIGNED(start, sectorsize)); + + dedupe_bs = dedupe_info->blocksize; + + shash->tfm = tfm; + shash->flags = 0; + ret = crypto_shash_init(shash); + if (ret) + return ret; + for (i = 0; sectorsize * i < dedupe_bs; i++) { + char *d; + + p = find_get_page(inode->i_mapping, + (start >> PAGE_SHIFT) + i); + if (WARN_ON(!p)) + return -ENOENT; + d = kmap(p); + ret = crypto_shash_update(shash, d, sectorsize); + kunmap(p); + put_page(p); + if (ret) + return ret; + } + ret = crypto_shash_final(shash, hash->hash); + return ret; +} -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14.8 03/14] btrfs: dedupe: Introduce dedupe framework and its header
From: Wang Xiaoguang Introduce the header for btrfs in-band(write time) de-duplication framework and needed header. The new de-duplication framework is going to support 2 different dedupe methods and 1 dedupe hash. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 7 ++ fs/btrfs/dedupe.h | 136 - fs/btrfs/disk-io.c | 1 + include/uapi/linux/btrfs.h | 34 ++ 4 files changed, 176 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8743fdcfe139..ad31ccac86a3 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1136,6 +1136,13 @@ struct btrfs_fs_info { spinlock_t ref_verify_lock; struct rb_root block_tree; #endif + + /* +* Inband de-duplication related structures +*/ + unsigned long dedupe_enabled:1; + struct btrfs_dedupe_info *dedupe_info; + struct mutex dedupe_ioctl_lock; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index 90281a7a35a8..681cf4717396 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -6,7 +6,139 @@ #ifndef BTRFS_DEDUPE_H #define BTRFS_DEDUPE_H -/* later in-band dedupe will expand this struct */ -struct btrfs_dedupe_hash; +#include +#include +#include +static const int btrfs_hash_sizes[] = { 32 }; + +/* + * For caller outside of dedupe.c + * + * Different dedupe backends should have their own hash structure + */ +struct btrfs_dedupe_hash { + u64 bytenr; + u32 num_bytes; + + /* last field is a variable length array of dedupe hash */ + u8 hash[]; +}; + +struct btrfs_dedupe_info { + /* dedupe blocksize */ + u64 blocksize; + u16 backend; + u16 hash_algo; + + struct crypto_shash *dedupe_driver; + + /* +* Use mutex to portect both backends +* Even for in-memory backends, the rb-tree can be quite large, +* so mutex is better for such use case. +*/ + struct mutex lock; + + /* following members are only used in in-memory backend */ + struct rb_root hash_root; + struct rb_root bytenr_root; + struct list_head lru_list; + u64 limit_nr; + u64 current_nr; +}; + +struct btrfs_trans_handle; + +static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash) +{ + return (hash && hash->bytenr); +} + +int btrfs_dedupe_hash_size(u16 algo); +struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo); + +/* + * Initial inband dedupe info + * Called at dedupe enable time. + * + * Return 0 for success + * Return <0 for any error + * (from unsupported param to tree creation error for some backends) + */ +int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, + struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Disable dedupe and invalidate all its dedupe data. + * Called at dedupe disable time. + * + * Return 0 for success + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); + +/* + * Get current dedupe status. + * Return 0 for success + * No possible error yet + */ +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Calculate hash for dedupe. + * Caller must ensure [start, start + dedupe_bs) has valid data. + * + * Return 0 for success + * Return <0 for any error + * (error from hash codes) + */ +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash); + +/* + * Search for duplicated extents by calculated hash + * Caller must call btrfs_dedupe_calc_hash() first to get the hash. + * + * @inode: the inode for we are writing + * @file_pos: offset inside the inode + * As we will increase extent ref immediately after a hash match, + * we need @file_pos and @inode in this case. + * + * Return > 0 for a hash match, and the extent ref will be + * *INCREASED*, and hash->bytenr/num_bytes will record the existing + * extent data. + * Return 0 for a hash miss. Nothing is done + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 file_pos, + struct btrfs_dedupe_hash *hash); + +/* + * Add a dedupe hash into dedupe info + * Return 0 for success + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_add(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, +struct btrfs_dedupe_hash *hash); + +/* + * Remove a dedupe hash from dedupe info + * Return 0 for success + * Return <0 for any error + * (tree operation error for some
[PATCH v14.8 11/14] btrfs: dedupe: Inband in-memory only de-duplication implement
From: Qu Wenruo Core implement for inband de-duplication. It reuses the async_cow_start() facility to do the calculate dedupe hash. And use dedupe hash to do inband de-duplication at extent level. The workflow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedupe_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedupe_tree. Compress for hash miss case is not supported yet. Current implement restore all dedupe hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 4 +- fs/btrfs/dedupe.h | 18 +++ fs/btrfs/extent-tree.c | 31 - fs/btrfs/extent_io.c | 5 +- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 3 + fs/btrfs/inode.c | 305 ++--- fs/btrfs/ioctl.c | 1 + fs/btrfs/relocation.c | 17 +++ 9 files changed, 329 insertions(+), 56 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ad31ccac86a3..8fff17adc8d2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -107,9 +107,11 @@ static inline u32 count_max_extents(u64 size, u64 max_extent_size) enum btrfs_metadata_reserve_type { BTRFS_RESERVE_NORMAL, BTRFS_RESERVE_COMPRESS, + BTRFS_RESERVE_DEDUPE, }; -u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); +u64 btrfs_max_extent_size(struct btrfs_inode *inode, + enum btrfs_metadata_reserve_type reserve_type); int inode_need_compress(struct inode *inode, u64 start, u64 end); struct btrfs_mapping_tree { diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index f19f6a8ff2ba..ebcbb89d79a0 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -9,6 +9,7 @@ #include #include #include +#include "btrfs_inode.h" static const int btrfs_hash_sizes[] = { 32 }; @@ -50,6 +51,23 @@ struct btrfs_dedupe_info { struct btrfs_trans_handle; +static inline u64 btrfs_dedupe_blocksize(struct btrfs_inode *inode) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + + return fs_info->dedupe_info->blocksize; +} + +static inline int inode_need_dedupe(struct inode *inode) +{ + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; + + if (!fs_info->dedupe_enabled) + return 0; + + return 1; +} + static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash) { return (hash && hash->bytenr); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 225ebcb1fd09..7a3a9d3fb0b9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -28,6 +28,7 @@ #include "sysfs.h" #include "qgroup.h" #include "ref-verify.h" +#include "dedupe.h" #undef SCRAMBLE_DELAYED_REFS @@ -2612,6 +2613,17 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); if (head->is_data) { + /* +* If insert_reserved is given, it means +* a new extent is revered, then deleted +* in one tran, and inc/dec get merged to 0. +* +* In this case, we need to remove its dedupe +* hash. +*/ + ret = btrfs_dedupe_del(trans, fs_info, head->bytenr); + if (ret < 0) + return ret; ret = btrfs_del_csums(trans, fs_info, head->bytenr, head->num_bytes); } @@ -6017,15 +6029,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, spin_unlock(_rsv->lock); } -u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type) +u64 btrfs_max_extent_size(struct btrfs_inode *inode, + enum btrfs_metadata_reserve_type reserve_type) { if (reserve_type == BTRFS_RESERVE_NORMAL) return BTRFS_MAX_EXTENT_SIZE; else if (reserve_type == BTRFS_RESERVE_COMPRESS) return SZ_128K; - - ASSERT(0); - return BTRFS_MAX_EXTENT_SIZE; + else if (reserve_type == BTRFS_RESERVE_DEDUPE) + return btrfs_dedupe_blocksize(inode); + else + return BTRFS_MAX_EXTENT_SIZE; } int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, @@ -6036,7 +6050,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL; int ret = 0; bool delalloc_lock = true;
[PATCH v14.8 14/14] btrfs: dedupe: Introduce new reconfigure ioctl
From: Qu Wenruo Introduce new reconfigure ioctl and new FORCE flag for in-band dedupe ioctls. Now dedupe enable and reconfigure ioctl are stateful. | Current state | Ioctl| Next state | | Disabled | enable| Enabled | | Enabled | enable| Not allowed | | Enabled | reconf| Enabled | | Enabled | disable | Disabled| | Disabled | dsiable | Disabled| | Disabled | reconf| Not allowed | (While disbale is always stateless) While for guys prefer stateless ioctl (myself for example), new FORCE flag is introduced. In FORCE mode, enable/disable is completely stateless. | Current state | Ioctl| Next state | | Disabled | enable| Enabled | | Enabled | enable| Enabled | | Enabled | disable | Disabled| | Disabled | disable | Disabled| Also, re-configure ioctl will only modify specified fields. Unlike enable, un-specified fields will be filled with default value. For example: # btrfs dedupe enable --block-size 64k /mnt # btrfs dedupe reconfigure --limit-hash 1m /mnt Will leads to: dedupe blocksize: 64K dedupe hash limit nr: 1m While for enable: # btrfs dedupe enable --force --block-size 64k /mnt # btrfs dedupe enable --force --limit-hash 1m /mnt Will reset blocksize to default value: dedupe blocksize: 128K << reset dedupe hash limit nr: 1m Suggested-by: David Sterba Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 131 ++--- fs/btrfs/dedupe.h | 13 fs/btrfs/ioctl.c | 13 include/uapi/linux/btrfs.h | 11 +++- 4 files changed, 143 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index f068321fdd1c..71b090c2938f 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -29,6 +29,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo) GFP_NOFS); } +/* + * Copy from current dedupe info to fill dargs. + * For reconf case, only fill members which is uninitialized. + */ +static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info, + struct btrfs_ioctl_dedupe_args *dargs) +{ + int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF); + + dargs->status = 1; + + if (!reconf || (reconf && dargs->blocksize == (u64)-1)) + dargs->blocksize = dedupe_info->blocksize; + if (!reconf || (reconf && dargs->backend == (u16)-1)) + dargs->backend = dedupe_info->backend; + if (!reconf || (reconf && dargs->hash_algo == (u16)-1)) + dargs->hash_algo = dedupe_info->hash_algo; + + /* +* For re-configure case, if not modifying limit, +* therir limit will be set to 0, unlike other fields +*/ + if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) { + dargs->limit_nr = dedupe_info->limit_nr; + dargs->limit_mem = dedupe_info->limit_nr * + (sizeof(struct inmem_hash) + +btrfs_hash_sizes[dedupe_info->hash_algo]); + } + + /* current_nr doesn't makes sense for reconfig case */ + if (!reconf) + dargs->current_nr = dedupe_info->current_nr; +} + void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs) { @@ -45,15 +79,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, return; } mutex_lock(_info->lock); - dargs->status = 1; - dargs->blocksize = dedupe_info->blocksize; - dargs->backend = dedupe_info->backend; - dargs->hash_algo = dedupe_info->hash_algo; - dargs->limit_nr = dedupe_info->limit_nr; - dargs->limit_mem = dedupe_info->limit_nr * - (sizeof(struct inmem_hash) + -btrfs_hash_sizes[dedupe_info->hash_algo]); - dargs->current_nr = dedupe_info->current_nr; + get_dedupe_status(dedupe_info, dargs); mutex_unlock(_info->lock); memset(dargs->__unused, -1, sizeof(dargs->__unused)); } @@ -102,17 +128,50 @@ static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs) { - u64 blocksize = dargs->blocksize; - u64 limit_nr = dargs->limit_nr; - u64 limit_mem = dargs->limit_mem; - u16 hash_algo = dargs->hash_algo; - u8 backend = dargs->backend; + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + u64 blocksize; + u64 limit_nr; + u64
[PATCH v14.8 07/14] btrfs: delayed-ref: Add support for increasing data ref under spinlock
From: Qu Wenruo For in-band dedupe, btrfs needs to increase data ref with delayed_ref locked, so add a new function btrfs_add_delayed_data_ref_lock() to increase extent ref with delayed_refs already locked. Export init_delayed_ref_head and init_delayed_ref_common for inband dedupe. Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/delayed-ref.c | 49 ++ fs/btrfs/delayed-ref.h | 16 ++ 2 files changed, 51 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 03dec673d12a..10de8011ada7 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -526,7 +526,7 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, spin_unlock(>lock); } -static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref, +void btrfs_init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref, struct btrfs_qgroup_extent_record *qrecord, u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, int action, bool is_data, @@ -654,7 +654,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans, } /* - * init_delayed_ref_common - Initialize the structure which represents a + * btrfs_init_delayed_ref_common - Initialize the structure which represents a * modification to a an extent. * * @fs_info:Internal to the mounted filesystem mount structure. @@ -678,7 +678,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans, * when recording a metadata extent or BTRFS_SHARED_DATA_REF_KEY/ * BTRFS_EXTENT_DATA_REF_KEY when recording data extent */ -static void init_delayed_ref_common(struct btrfs_fs_info *fs_info, +void btrfs_init_delayed_ref_common(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *ref, u64 bytenr, u64 num_bytes, u64 ref_root, int action, u8 ref_type) @@ -734,7 +734,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref_type = BTRFS_SHARED_BLOCK_REF_KEY; else ref_type = BTRFS_TREE_BLOCK_REF_KEY; - init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, + btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, ref_root, action, ref_type); ref->root = ref_root; ref->parent = parent; @@ -751,7 +751,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, goto free_head_ref; } - init_delayed_ref_head(head_ref, record, bytenr, num_bytes, + btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root, 0, action, false, is_system); head_ref->extent_op = extent_op; @@ -788,6 +788,29 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, return -ENOMEM; } +/* + * Do real delayed data ref insert. + * Caller must hold delayed_refs->lock and allocation memory + * for dref,head_ref and record. + */ +int btrfs_add_delayed_data_ref_locked(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head_ref, + struct btrfs_qgroup_extent_record *qrecord, + struct btrfs_delayed_data_ref *ref, int action, + int *qrecord_inserted_ret, int *old_ref_mod, + int *new_ref_mod) +{ + struct btrfs_delayed_ref_root *delayed_refs; + + head_ref = add_delayed_ref_head(trans, head_ref, qrecord, + action, qrecord_inserted_ret, + old_ref_mod, new_ref_mod); + + delayed_refs = >transaction->delayed_refs; + + return insert_delayed_ref(trans, delayed_refs, head_ref, >node); +} + /* * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref. */ @@ -814,7 +837,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, ref_type = BTRFS_SHARED_DATA_REF_KEY; else ref_type = BTRFS_EXTENT_DATA_REF_KEY; - init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, + btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, ref_root, action, ref_type); ref->root = ref_root; ref->parent = parent; @@ -839,8 +862,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, } } - init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root, - reserved, action, true, false); + btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes, + ref_root, reserved, action, true, false);
[PATCH v14.8 13/14] btrfs: relocation: Enhance error handling to avoid BUG_ON
From: Qu Wenruo Since the introduction of btrfs dedupe tree, it's possible that balance can race with dedupe disabling. When this happens, dedupe_enabled will make btrfs_get_fs_root() return PTR_ERR(-ENOENT). But due to a bug in error handling branch, when this happens backref_cache->nr_nodes is increased but the node is neither added to backref_cache or nr_nodes decreased. Causing BUG_ON() in backref_cache_cleanup() [ 2611.668810] [ cut here ] [ 2611.669946] kernel BUG at /home/sat/ktest/linux/fs/btrfs/relocation.c:243! [ 2611.670572] invalid opcode: [#1] SMP [ 2611.686797] Call Trace: [ 2611.687034] [] btrfs_relocate_block_group+0x1b3/0x290 [btrfs] [ 2611.687706] [] btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs] [ 2611.688385] [] btrfs_balance+0xb22/0x11e0 [btrfs] [ 2611.688966] [] btrfs_ioctl_balance+0x391/0x3a0 [btrfs] [ 2611.689587] [] btrfs_ioctl+0x1650/0x2290 [btrfs] [ 2611.690145] [] ? lru_cache_add+0x3a/0x80 [ 2611.690647] [] ? lru_cache_add_active_or_unevictable+0x4c/0xc0 [ 2611.691310] [] ? handle_mm_fault+0xcd4/0x17f0 [ 2611.691842] [] ? cp_new_stat+0x153/0x180 [ 2611.692342] [] ? __vma_link_rb+0xfd/0x110 [ 2611.692842] [] ? vma_link+0xb9/0xc0 [ 2611.693303] [] do_vfs_ioctl+0xa1/0x5a0 [ 2611.693781] [] ? __do_page_fault+0x1b4/0x400 [ 2611.694310] [] SyS_ioctl+0x41/0x70 [ 2611.694758] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0 05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44 [ 2611.697870] RIP [] relocate_block_group+0x741/0x7a0 [btrfs] [ 2611.698818] RSP This patch will call remove_backref_node() in error handling branch, and cache the returned -ENOENT in relocate_tree_block() and continue balancing. Reported-by: Satoru Takeuchi Signed-off-by: Qu Wenruo --- fs/btrfs/relocation.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 3841cddef6ab..573ab5a04be5 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -885,6 +885,13 @@ struct backref_node *build_backref_tree(struct reloc_control *rc, root = read_fs_root(rc->extent_root->fs_info, key.offset); if (IS_ERR(root)) { err = PTR_ERR(root); + /* +* Don't forget to cleanup current node. +* As it may not be added to backref_cache but nr_node +* increased. +* This will cause BUG_ON() in backref_cache_cleanup(). +*/ + remove_backref_node(>backref_cache, cur); goto out; } @@ -3058,14 +3065,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, } rb_node = rb_first(blocks); - while (rb_node) { + for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) { block = rb_entry(rb_node, struct tree_block, rb_node); node = build_backref_tree(rc, >key, block->level, block->bytenr); if (IS_ERR(node)) { + /* +* The root(dedupe tree yet) of the tree block is +* going to be freed and can't be reached. +* Just skip it and continue balancing. +*/ + if (PTR_ERR(node) == -ENOENT) + continue; err = PTR_ERR(node); - goto out; + break; } ret = relocate_tree_block(trans, rc, node, >key, @@ -3073,11 +3087,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, if (ret < 0) { if (ret != -EAGAIN || rb_node == rb_first(blocks)) err = ret; - goto out; + break; } - rb_node = rb_next(rb_node); } -out: err = finish_pending_nodes(trans, rc, path, err); out_free_path: -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14.8 04/14] btrfs: dedupe: Introduce function to initialize dedupe info
From: Wang Xiaoguang Add generic function to initialize dedupe info. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/Makefile | 2 +- fs/btrfs/dedupe.c | 174 + fs/btrfs/dedupe.h | 13 ++- include/uapi/linux/btrfs.h | 4 +- 4 files changed, 189 insertions(+), 4 deletions(-) create mode 100644 fs/btrfs/dedupe.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..78fdc87dba39 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ - uuid-tree.o props.o free-space-tree.o tree-checker.o + uuid-tree.o props.o free-space-tree.o tree-checker.o dedupe.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c new file mode 100644 index ..23b9cd8ae3ff --- /dev/null +++ b/fs/btrfs/dedupe.c @@ -0,0 +1,174 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + */ + +#include "ctree.h" +#include "dedupe.h" +#include "btrfs_inode.h" +#include "transaction.h" +#include "delayed-ref.h" + +struct inmem_hash { + struct rb_node hash_node; + struct rb_node bytenr_node; + struct list_head lru_list; + + u64 bytenr; + u32 num_bytes; + + u8 hash[]; +}; + +static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, + struct btrfs_ioctl_dedupe_args *dargs) +{ + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS); + if (!dedupe_info) + return -ENOMEM; + + dedupe_info->hash_algo = dargs->hash_algo; + dedupe_info->backend = dargs->backend; + dedupe_info->blocksize = dargs->blocksize; + dedupe_info->limit_nr = dargs->limit_nr; + + /* only support SHA256 yet */ + dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(dedupe_info->dedupe_driver)) { + int ret; + + ret = PTR_ERR(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return ret; + } + + dedupe_info->hash_root = RB_ROOT; + dedupe_info->bytenr_root = RB_ROOT; + dedupe_info->current_nr = 0; + INIT_LIST_HEAD(_info->lru_list); + mutex_init(_info->lock); + + *ret_info = dedupe_info; + return 0; +} + +/* + * Helper to check if parameters are valid. + * The first invalid field will be set to (-1), to info user which parameter + * is invalid. + * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned + * to info user, since user can specify any value to limit, except 0. + */ +static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, + struct btrfs_ioctl_dedupe_args *dargs) +{ + u64 blocksize = dargs->blocksize; + u64 limit_nr = dargs->limit_nr; + u64 limit_mem = dargs->limit_mem; + u16 hash_algo = dargs->hash_algo; + u8 backend = dargs->backend; + + /* +* Set all reserved fields to -1, allow user to detect +* unsupported optional parameters. +*/ + memset(dargs->__unused, -1, sizeof(dargs->__unused)); + if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX || + blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN || + blocksize < fs_info->sectorsize || + !is_power_of_2(blocksize) || + blocksize < PAGE_SIZE) { + dargs->blocksize = (u64)-1; + return -EINVAL; + } + if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) { + dargs->hash_algo = (u16)-1; + return -EINVAL; + } + if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) { + dargs->backend = (u8)-1; + return -EINVAL; + } + + /* Backend specific check */ + if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + /* only one limit is accepted for enable*/ + if (dargs->limit_nr && dargs->limit_mem) { + dargs->limit_nr = 0; + dargs->limit_mem = 0; + return -EINVAL; + } + + if (!limit_nr && !limit_mem) + dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT; + else { + u64 tmp = (u64)-1; + + if (limit_mem) { + tmp = div_u64(limit_mem, + (sizeof(struct inmem_hash)) + +
[PATCH v14.8 12/14] btrfs: dedupe: Add ioctl for inband deduplication
From: Wang Xiaoguang Add ioctl interface for inband deduplication, which includes: 1) enable 2) disable 3) status And a pseudo RO compat flag, to imply that btrfs now supports inband dedup. However we don't add any ondisk format change, it's just a pseudo RO compat flag. All these ioctl interfaces are state-less, which means caller don't need to bother previous dedupe state before calling them, and only need to care the final desired state. For example, if user want to enable dedupe with specified block size and limit, just fill the ioctl structure and call enable ioctl. No need to check if dedupe is already running. These ioctls will handle things like re-configure or disable quite well. Also, for invalid parameters, enable ioctl interface will set the field of the first encountered invalid parameter to (-1) to inform caller. While for limit_nr/limit_mem, the value will be (0). Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 50 fs/btrfs/dedupe.h | 17 +++--- fs/btrfs/disk-io.c | 3 ++ fs/btrfs/ioctl.c | 67 ++ fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 12 ++- 6 files changed, 145 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 14c8d245480e..f068321fdd1c 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -29,6 +29,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo) GFP_NOFS); } +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled || !dedupe_info) { + dargs->status = 0; + dargs->blocksize = 0; + dargs->backend = 0; + dargs->hash_algo = 0; + dargs->limit_nr = 0; + dargs->current_nr = 0; + memset(dargs->__unused, -1, sizeof(dargs->__unused)); + return; + } + mutex_lock(_info->lock); + dargs->status = 1; + dargs->blocksize = dedupe_info->blocksize; + dargs->backend = dedupe_info->backend; + dargs->hash_algo = dedupe_info->hash_algo; + dargs->limit_nr = dedupe_info->limit_nr; + dargs->limit_mem = dedupe_info->limit_nr * + (sizeof(struct inmem_hash) + +btrfs_hash_sizes[dedupe_info->hash_algo]); + dargs->current_nr = dedupe_info->current_nr; + mutex_unlock(_info->lock); + memset(dargs->__unused, -1, sizeof(dargs->__unused)); +} + static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, struct btrfs_ioctl_dedupe_args *dargs) { @@ -409,6 +438,27 @@ static void unblock_all_writers(struct btrfs_fs_info *fs_info) percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); } +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + + fs_info->dedupe_enabled = 0; + /* same as disable */ + smp_wmb(); + dedupe_info = fs_info->dedupe_info; + fs_info->dedupe_info = NULL; + + if (!dedupe_info) + return 0; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} + int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) { struct btrfs_dedupe_info *dedupe_info; diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index ebcbb89d79a0..85a87093ab04 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -96,6 +96,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo) int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Get inband dedupe info + * Since it needs to access different backends' hash size, which + * is not exported, we need such simple function. + */ +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs); + /* * Disable dedupe and invalidate all its dedupe data. * Called at dedupe disable time. @@ -107,12 +116,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); /* - * Get current dedupe status. - * Return 0 for success - * No possible error yet + * Cleanup current btrfs_dedupe_info + * Called in umount time */ -void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, -struct btrfs_ioctl_dedupe_args *dargs); +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info); /* * Calculate hash for dedupe. diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
[PATCH v14.8 00/14] Btrfs In-band De-duplication
This patchset can be fetched from github: https://github.com/littleroad/linux.git dedupe_latest This is just a normal rebase update. Now the new base is v4.18-rc4 Normal test cases from auto group exposes no regression, and ib-dedupe group can pass without problem. xfstests ib-dedupe group can be fetched from github: https://github.com/littleroad/xfstests-dev.git btrfs_dedupe_latest Changelog: v2: Totally reworked to handle multiple backends v3: Fix a stupid but deadly on-disk backend bug Add handle for multiple hash on same bytenr corner case to fix abort trans error Increase dedup rate by enhancing delayed ref handler for both backend. Move dedup_add() to run_delayed_ref() time, to fix abort trans error. Increase dedup block size up limit to 8M. v4: Add dedup prop for disabling dedup for given files/dirs. Merge inmem_search() and ondisk_search() into generic_search() to save some code Fix another delayed_ref related bug. Use the same mutex for both inmem and ondisk backend. Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup rate. v5: Reuse compress routine for much simpler dedup function. Slightly improved performance due to above modification. Fix race between dedup enable/disable Fix for false ENOSPC report v6: Further enable/disable race window fix. Minor format change according to checkpatch. v7: Fix one concurrency bug with balance. Slightly modify return value from -EINVAL to -EOPNOTSUPP for btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands and wrong parameter. Rebased to integration-4.6. v8: Rename 'dedup' to 'dedupe'. Add support to allow dedupe and compression work at the same time. Fix several balance related bugs. Special thanks to Satoru Takeuchi, who exposed most of them. Small dedupe hit case performance improvement. v9: Re-order the patchset to completely separate pure in-memory and any on-disk format change. Fold bug fixes into its original patch. v10: Adding back missing bug fix patch. Reduce on-disk item size. Hide dedupe ioctl under CONFIG_BTRFS_DEBUG. v11: Remove other backend and props support to focus on the framework and in-memory backend. Suggested by David. Better disable and buffered write race protection. Comprehensive fix to dedupe metadata ENOSPC problem. v12: Stateful 'enable' ioctl and new 'reconf' ioctl New FORCE flag for enable ioctl to allow stateless ioctl Precise error report and extendable ioctl structure. v12.1 Rebase to David's for-next-20160704 branch Add co-ordinate patch for subpage and dedupe patchset. v12.2 Rebase to David's for-next-20160715 branch Add co-ordinate patch for other patchset. v13 Rebase to David's for-next-20160906 branch Fix a reserved space leak bug, which only frees quota reserved space but not space_info->byte_may_use. v13.1 Rebase to Chris' for-linux-4.9 branch v14 Use generic ENOSPC fix for both compression and dedupe. v14.1 Further split ENOSPC fix. v14.2 Rebase to v4.11-rc2. Co-operate with count_max_extent() to calculate num_extents. No longer rely on qgroup fixes. v14.3 Rebase to v4.12-rc1. v14.4 Rebase to kdave/for-4.13-part1. v14.5 Rebase to v4.15-rc3. v14.6 Rebase to v4.17-rc5. v14.7 Replace SHASH_DESC_ON_STACK with kmalloc to remove VLA. Fixed the following errors by switching to div_u64. ├── arm-allmodconfig │ └── ERROR:__aeabi_uldivmod-fs-btrfs-btrfs.ko-undefined └── i386-allmodconfig └── ERROR:__udivdi3-fs-btrfs-btrfs.ko-undefined v14.8 Rebase to v4.18-rc4. Qu Wenruo (4): btrfs: delayed-ref: Add support for increasing data ref under spinlock btrfs: dedupe: Inband in-memory only de-duplication implement btrfs: relocation: Enhance error handling to avoid BUG_ON btrfs: dedupe: Introduce new reconfigure ioctl Wang Xiaoguang (10): btrfs: introduce type based delalloc metadata reserve btrfs: Introduce COMPRESS reserve type to fix false enospc for compression btrfs: dedupe: Introduce dedupe framework and its header btrfs: dedupe: Introduce function to initialize dedupe info btrfs: dedupe: Introduce function to add hash into in-memory tree btrfs: dedupe: Introduce function to remove hash from in-memory tree btrfs: dedupe: Introduce function to search for an existing hash btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface btrfs: ordered-extent: Add support for dedupe btrfs: dedupe: Add ioctl for inband deduplication fs/btrfs/Makefile| 2 +- fs/btrfs/ctree.h | 54 ++- fs/btrfs/dedupe.c| 836 +++ fs/btrfs/dedupe.h| 183 +++- fs/btrfs/delayed-ref.c | 49 +- fs/btrfs/delayed-ref.h | 16 + fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 69 ++- fs/btrfs/extent_io.c | 8 +- fs/btrfs/extent_io.h | 2 + fs/btrfs/file.c | 36 +- fs/btrfs/free-space-cache.c | 6 +-
[PATCH v14.8 02/14] btrfs: Introduce COMPRESS reserve type to fix false enospc for compression
From: Wang Xiaoguang When testing btrfs compression, sometimes we got ENOSPC error, though fs still has much free space, xfstests generic/171, generic/172, generic/173, generic/174, generic/175 can reveal this bug in my test environment when compression is enabled. After some debugging work, we found that it's btrfs_delalloc_reserve_metadata() which sometimes tries to reserve too much metadata space, even for very small data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes to reserve is calculated by the difference between outstanding extents and reserved extents. But due to bad designed drop_outstanding_extent() function, it can make the difference too big, and cause problem. The problem happens in the following flow with compression enabled. 1) Buffered write 128M data with 128K blocksize outstanding_extents = 1 reserved_extents = 1024 (128M / 128K, one blocksize will get one reserved_extent) Note: it's btrfs_merge_extent_hook() to merge outstanding extents. But reserved extents are still 1024. 2) Allocate extents for dirty range cow_file_range_async() split above large extent into small 128K extents. Let's assume 2 compressed extents have been split. So we have: outstanding_extents = 3 reserved_extents = 1024 range [0, 256K) has extents allocated 3) One ordered extent get finished btrfs_finish_ordered_io() |- btrfs_delalloc_release_metadata() |- drop_outstanding_extent() drop_outstanding_extent() will free *ALL* redundant reserved extents. So we have: outstanding_extents = 2 (One has finished) reserved_extents = 2 4) Continue allocating extents for dirty range cow_file_range_async() continue handling the remaining range. When the whole 128M range is done and assume no more ordered extents have finished. outstanding_extents = 1023 (One has finished in Step 3) reserved_extents = 2 (*ALL* freed in Step 3) 5) Another buffered write happens to the file btrfs_delalloc_reserve_metadata() will calculate metadata space. The calculation is: meta_to_reserve = (outstanding_extents - reserved_extents) * \ nodesize * max_tree_level(8) * 2 If nodesize is 16K, it's 1021 * 16K * 8 * 2, near 256M. If nodesize is 64K, it's about 1G. That's totally insane. The fix is to introduce new reserve type, COMPRESSION, to info outstanding extents calculation algorithm, to get correct outstanding_extents based extent size. So in Step 1), outstanding_extents = 1024 reserved_extents = 1024 Step 2): outstanding_extents = 1024 reserved_extents = 1024 Step 3): outstanding_extents = 1023 reserved_extents = 1023 And in Step 5) we reserve correct amount of metadata space. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 2 ++ fs/btrfs/extent_io.c | 7 ++-- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 3 ++ fs/btrfs/inode.c | 81 +++--- fs/btrfs/ioctl.c | 2 ++ fs/btrfs/relocation.c | 3 ++ 8 files changed, 86 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f906aab71116..8743fdcfe139 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -106,9 +106,11 @@ static inline u32 count_max_extents(u64 size, u64 max_extent_size) */ enum btrfs_metadata_reserve_type { BTRFS_RESERVE_NORMAL, + BTRFS_RESERVE_COMPRESS, }; u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); +int inode_need_compress(struct inode *inode, u64 start, u64 end); struct btrfs_mapping_tree { struct extent_map_tree map_tree; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8e7ad123aa95..225ebcb1fd09 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6021,6 +6021,8 @@ u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type) { if (reserve_type == BTRFS_RESERVE_NORMAL) return BTRFS_MAX_EXTENT_SIZE; + else if (reserve_type == BTRFS_RESERVE_COMPRESS) + return SZ_128K; ASSERT(0); return BTRFS_MAX_EXTENT_SIZE; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e55843f536bc..25d1c302dd47 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -596,7 +596,7 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, btrfs_debug_check_extent_io_range(tree, start, end); if (bits & EXTENT_DELALLOC) - bits |= EXTENT_NORESERVE; + bits |= EXTENT_NORESERVE | EXTENT_COMPRESS; if (delete) bits |= ~EXTENT_CTLBITS; @@ -1489,6 +1489,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, u64 cur_start = *start; u64 found = 0; u64 total_bytes = 0; + unsigned int pre_state;
[PATCH v14.8 10/14] btrfs: ordered-extent: Add support for dedupe
From: Wang Xiaoguang Add ordered-extent support for dedupe. Note, current ordered-extent support only supports non-compressed source extent. Support for compressed source extent will be added later. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik --- fs/btrfs/ordered-data.c | 46 + fs/btrfs/ordered-data.h | 13 2 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 78cdf572ca9c..520d384d1923 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -13,6 +13,7 @@ #include "extent_io.h" #include "disk-io.h" #include "compression.h" +#include "dedupe.h" static struct kmem_cache *btrfs_ordered_extent_cache; @@ -171,7 +172,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + int type, int dio, int compress_type, + struct btrfs_dedupe_hash *hash) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; @@ -192,6 +194,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; + entry->hash = NULL; + /* +* A hash hit means we have already incremented the extents delayed +* ref. +* We must handle this even if another process is trying to +* turn off dedupe, otherwise we will leak a reference. +*/ + if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) { + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = root->fs_info->dedupe_info; + if (WARN_ON(dedupe_info == NULL)) { + kmem_cache_free(btrfs_ordered_extent_cache, + entry); + return -EINVAL; + } + entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo); + if (!entry->hash) { + kmem_cache_free(btrfs_ordered_extent_cache, entry); + return -ENOMEM; + } + entry->hash->bytenr = hash->bytenr; + entry->hash->num_bytes = hash->num_bytes; + memcpy(entry->hash->hash, hash->hash, + btrfs_hash_sizes[dedupe_info->hash_algo]); + } + if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, >flags); @@ -246,15 +275,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } +int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + struct btrfs_dedupe_hash *hash) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0, + BTRFS_COMPRESS_NONE, hash); +} int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type) { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 1, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, @@ -263,7 +300,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - compress_type); + compress_type, NULL); } /* @@ -568,6 +605,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry) list_del(>list); kfree(sum); } + kfree(entry->hash); kmem_cache_free(btrfs_ordered_extent_cache, entry); } } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 5bad40387023..1d86674758d7 100644 ---
[PATCH v14.8 01/14] btrfs: introduce type based delalloc metadata reserve
From: Wang Xiaoguang Introduce type based metadata reserve parameter for delalloc space reservation/freeing function. The problem we are going to solve is, btrfs use different max extent size for different mount options. For compression, the max extent size is 128K, while for non-compress write it's 128M. And furthermore, split/merge extent hook highly depends that max extent size. Such situation contributes to quite a lot of false ENOSPC. So this patch introduces the facility to help solve these false ENOSPC related to different max extent size. Currently, only normal 128M extent size is supported. More types will follow soon. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 43 ++--- fs/btrfs/extent-tree.c | 48 --- fs/btrfs/file.c | 30 + fs/btrfs/free-space-cache.c | 6 +- fs/btrfs/inode-map.c | 9 ++- fs/btrfs/inode.c | 115 +-- fs/btrfs/ioctl.c | 23 +++ fs/btrfs/ordered-data.c | 6 +- fs/btrfs/ordered-data.h | 3 +- fs/btrfs/relocation.c| 22 --- fs/btrfs/tests/inode-tests.c | 15 +++-- 11 files changed, 223 insertions(+), 97 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..f906aab71116 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -92,11 +92,24 @@ static const int btrfs_csum_sizes[] = { 4 }; /* * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size */ -static inline u32 count_max_extents(u64 size) +static inline u32 count_max_extents(u64 size, u64 max_extent_size) { - return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE); + return div_u64(size + max_extent_size - 1, max_extent_size); } +/* + * Type based metadata reserve type + * This affects how btrfs reserve metadata space for buffered write. + * + * This is caused by the different max extent size for normal COW + * and compression, and further in-band dedupe + */ +enum btrfs_metadata_reserve_type { + BTRFS_RESERVE_NORMAL, +}; + +u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); + struct btrfs_mapping_tree { struct extent_map_tree map_tree; }; @@ -2760,8 +2773,9 @@ int btrfs_check_data_free_space(struct inode *inode, void btrfs_free_reserved_data_space(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len); void btrfs_delalloc_release_space(struct inode *inode, - struct extent_changeset *reserved, - u64 start, u64 len, bool qgroup_free); + struct extent_changeset *reserved, + u64 start, u64 len, bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, u64 len); void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans); @@ -2771,13 +2785,17 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes, - bool qgroup_free); + bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes, -bool qgroup_free); + bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); int btrfs_delalloc_reserve_space(struct inode *inode, - struct extent_changeset **reserved, u64 start, u64 len); + struct extent_changeset **reserved, u64 start, u64 len, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, unsigned short type); @@ -3188,7 +3206,11 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root); int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, - struct extent_state **cached_state, int dedupe); +
[PATCH v14.8 06/14] btrfs: dedupe: Introduce function to remove hash from in-memory tree
From: Wang Xiaoguang Introduce static function inmem_del() to remove hash from in-memory dedupe tree. And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces. Also for btrfs_dedupe_disable(), add new functions to wait existing writer and block incoming writers to eliminate all possible race. Cc: Mark Fasheh Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/dedupe.c | 132 +++--- 1 file changed, 126 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index a0911dcdf502..3232fe5ae530 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -175,12 +175,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, return ret; } -int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) -{ - /* Place holder for bisect, will be implemented in later patches */ - return 0; -} - static int inmem_insert_hash(struct rb_root *root, struct inmem_hash *hash, int hash_len) { @@ -323,3 +317,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans, return inmem_add(dedupe_info, hash); return -EINVAL; } + +static struct inmem_hash * +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct rb_node **p = _info->bytenr_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, bytenr_node); + + if (bytenr < entry->bytenr) + p = &(*p)->rb_left; + else if (bytenr > entry->bytenr) + p = &(*p)->rb_right; + else + return entry; + } + + return NULL; +} + +/* Delete a hash from in-memory dedupe tree */ +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct inmem_hash *hash; + + mutex_lock(_info->lock); + hash = inmem_search_bytenr(dedupe_info, bytenr); + if (!hash) { + mutex_unlock(_info->lock); + return 0; + } + + __inmem_del(dedupe_info, hash); + mutex_unlock(_info->lock); + return 0; +} + +/* Remove a dedupe hash from dedupe tree */ +int btrfs_dedupe_del(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, u64 bytenr) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + return inmem_del(dedupe_info, bytenr); + return -EINVAL; +} + +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info) +{ + struct inmem_hash *entry, *tmp; + + mutex_lock(_info->lock); + list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list) + __inmem_del(dedupe_info, entry); + mutex_unlock(_info->lock); +} + +/* + * Helper function to wait and block all incoming writers + * + * Use rw_sem introduced for freeze to wait/block writers. + * So during the block time, no new write will happen, so we can + * do something quite safe, espcially helpful for dedupe disable, + * as it affect buffered write. + */ +static void block_all_writers(struct btrfs_fs_info *fs_info) +{ + struct super_block *sb = fs_info->sb; + + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); + down_write(>s_umount); +} + +static void unblock_all_writers(struct btrfs_fs_info *fs_info) +{ + struct super_block *sb = fs_info->sb; + + up_write(>s_umount); + percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); +} + +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + int ret; + + dedupe_info = fs_info->dedupe_info; + + if (!dedupe_info) + return 0; + + /* Don't allow disable status change in RO mount */ + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; + + /* +* Wait for all unfinished writers and block further writers. +* Then sync the whole fs so all current write will go through +* dedupe, and all later write won't go through dedupe. +*/ + block_all_writers(fs_info); + ret = sync_filesystem(fs_info->sb); + fs_info->dedupe_enabled = 0; + fs_info->dedupe_info = NULL; + unblock_all_writers(fs_info); + if (ret < 0) + return ret; + + /* now we are OK to clean up everything */ + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} -- 2.18.0 -- To
[PATCH v14.8 08/14] btrfs: dedupe: Introduce function to search for an existing hash
From: Wang Xiaoguang Introduce static function inmem_search() to handle the job for in-memory hash tree. The trick is, we must ensure the delayed ref head is not being run at the time we search the for the hash. With inmem_search(), we can implement the btrfs_dedupe_search() interface. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 208 ++ 1 file changed, 208 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 3232fe5ae530..e3084deb1eb7 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -8,6 +8,7 @@ #include "btrfs_inode.h" #include "transaction.h" #include "delayed-ref.h" +#include "qgroup.h" struct inmem_hash { struct rb_node hash_node; @@ -443,3 +444,210 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) kfree(dedupe_info); return 0; } + +/* + * Caller must ensure the corresponding ref head is not being run. + */ +static struct inmem_hash * +inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash) +{ + struct rb_node **p = _info->hash_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + u16 hash_algo = dedupe_info->hash_algo; + int hash_len = btrfs_hash_sizes[hash_algo]; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, hash_node); + + if (memcmp(hash, entry->hash, hash_len) < 0) { + p = &(*p)->rb_left; + } else if (memcmp(hash, entry->hash, hash_len) > 0) { + p = &(*p)->rb_right; + } else { + /* Found, need to re-add it to LRU list head */ + list_del(>lru_list); + list_add(>lru_list, _info->lru_list); + return entry; + } + } + return NULL; +} + +static int inmem_search(struct btrfs_dedupe_info *dedupe_info, + struct inode *inode, u64 file_pos, + struct btrfs_dedupe_hash *hash) +{ + int ret; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + struct btrfs_delayed_ref_root *delayed_refs; + struct btrfs_delayed_ref_head *head; + struct btrfs_delayed_ref_head *insert_head; + struct btrfs_delayed_data_ref *insert_dref; + struct btrfs_qgroup_extent_record *insert_qrecord = NULL; + struct inmem_hash *found_hash; + int free_insert = 1; + int qrecord_inserted = 0; + u64 ref_root = root->root_key.objectid; + u64 bytenr; + u32 num_bytes; + + insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS); + if (!insert_head) + return -ENOMEM; + insert_head->extent_op = NULL; + + insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); + if (!insert_dref) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head); + return -ENOMEM; + } + if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) && + is_fstree(ref_root)) { + insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS); + if (!insert_qrecord) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, + insert_head); + kmem_cache_free(btrfs_delayed_data_ref_cachep, + insert_dref); + return -ENOMEM; + } + } + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto free_mem; + } + +again: + mutex_lock(_info->lock); + found_hash = inmem_search_hash(dedupe_info, hash->hash); + /* If we don't find a duplicated extent, just return. */ + if (!found_hash) { + ret = 0; + goto out; + } + bytenr = found_hash->bytenr; + num_bytes = found_hash->num_bytes; + + btrfs_init_delayed_ref_head(insert_head, insert_qrecord, bytenr, + num_bytes, ref_root, 0, BTRFS_ADD_DELAYED_REF, true, + false); + + btrfs_init_delayed_ref_common(trans->fs_info, _dref->node, + bytenr, num_bytes, ref_root, BTRFS_ADD_DELAYED_REF, + BTRFS_EXTENT_DATA_REF_KEY); + insert_dref->root = ref_root; + insert_dref->parent = 0; + insert_dref->objectid = btrfs_ino(BTRFS_I(inode)); + insert_dref->offset = file_pos; + + delayed_refs = >transaction->delayed_refs; + + spin_lock(_refs->lock); + head = btrfs_find_delayed_ref_head(>transaction->delayed_refs, + bytenr); + if (!head) { +
[PATCH v10.3 4/5] btrfs-progs: dedupe: Add status subcommand
From: Qu Wenruo Add status subcommand for dedupe command group. Signed-off-by: Qu Wenruo --- Documentation/btrfs-dedupe-inband.asciidoc | 3 + btrfs-completion | 2 +- cmds-dedupe-ib.c | 81 ++ 3 files changed, 85 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index de32eb97d9dd..df068c31ca3a 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,9 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*status* :: +Show current in-band de-duplication status of a filesystem. + BACKENDS Btrfs in-band de-duplication will support different storage backends, with diff --git a/btrfs-completion b/btrfs-completion index 2f113e01fb01..c8e67b459341 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe='enable disable' + commands_dedupe='enable disable status' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 031766c1d91c..854cbda131a3 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -302,12 +302,93 @@ out: return 0; } +static const char * const cmd_dedupe_ib_status_usage[] = { + "btrfs dedupe status ", + "Show current in-band(write time) de-duplication status of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_status(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + int print_limit = 1; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_status_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + ret = 1; + goto out; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_STATUS; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to get inband deduplication status: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + if (dargs.status == 0) { + printf("Status: \t\t\tDisabled\n"); + goto out; + } + printf("Status:\t\t\tEnabled\n"); + + if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256) + printf("Hash algorithm:\t\tSHA-256\n"); + else + printf("Hash algorithm:\t\tUnrecognized(%x)\n", + dargs.hash_algo); + + if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + printf("Backend:\t\tIn-memory\n"); + print_limit = 1; + } else { + printf("Backend:\t\tUnrecognized(%x)\n", + dargs.backend); + } + + printf("Dedup Blocksize:\t%llu\n", dargs.blocksize); + + if (print_limit) { + u64 cur_mem; + + /* Limit nr may be 0 */ + if (dargs.limit_nr) + cur_mem = dargs.current_nr * (dargs.limit_mem / + dargs.limit_nr); + else + cur_mem = 0; + + printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr, + dargs.limit_nr); + printf("Memory usage: \t\t[%s/%s]\n", + pretty_size(cur_mem), + pretty_size(dargs.limit_mem)); + } +out: + close_file_or_dir(fd, dirstream); + return ret; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, NULL, 0}, + { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10.3 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication
From: Qu Wenruo Add disable subcommand for dedupe command group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 5 +++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 42 ++ 3 files changed, 48 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 82f970a69953..de32eb97d9dd 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,6 +22,11 @@ use with caution. SUBCOMMAND -- +*disable* :: +Disable in-band de-duplication for a filesystem. ++ +This will trash all stored dedupe hash. ++ *enable* [options] :: Enable in-band de-duplication for a filesystem. + diff --git a/btrfs-completion b/btrfs-completion index 69e02ad11990..2f113e01fb01 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe='enable' + commands_dedupe='enable disable' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index cb62d0064167..031766c1d91c 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -262,10 +262,52 @@ out: return ret; } +static const char * const cmd_dedupe_ib_disable_usage[] = { + "btrfs dedupe disable ", + "Disable in-band(write time) de-duplication of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_disable_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + return 1; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to disable inband deduplication: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + +out: + close_file_or_dir(fd, dirstream); + return 0; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, + { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10.3 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand
From: Qu Wenruo Introduce reconfigure subcommand to co-operate with new kernel ioctl modification. Signed-off-by: Qu Wenruo --- Documentation/btrfs-dedupe-inband.asciidoc | 7 ++ cmds-dedupe-ib.c | 75 +- 2 files changed, 66 insertions(+), 16 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index df068c31ca3a..5fc4bb0d5940 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,13 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*reconfigure* [options] :: +Re-configure in-band de-duplication parameters of a filesystem. ++ +In-band de-duplication must be enbaled first before re-configuration. ++ +[Options] are the same with 'btrfs dedupe-inband enable'. + *status* :: Show current in-band de-duplication status of a filesystem. diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 854cbda131a3..925d5a8f756a 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -56,7 +56,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = { NULL }; - #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \ ({ \ if (dargs->member != old->member && \ @@ -88,6 +87,12 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, } report_option_parameter(dargs, old, flags, u8, -1, x); } + + if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) { + error("must enable dedupe before reconfiguration"); + return; + } + if (report_fatal_parameter(dargs, old, cmd, u16, -1, u) || report_fatal_parameter(dargs, old, blocksize, u64, -1, llu) || report_fatal_parameter(dargs, old, backend, u16, -1, u) || @@ -100,14 +105,17 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, old->limit_nr, old->limit_mem); } -static int cmd_dedupe_ib_enable(int argc, char **argv) +static int enable_reconfig_dedupe(int argc, char **argv, int reconf) { int ret; int fd = -1; char *path; u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT; + int blocksize_set = 0; u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256; + int hash_algo_set = 0; u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY; + int backend_set = 0; u64 limit_nr = 0; u64 limit_mem = 0; u64 sys_mem = 0; @@ -129,20 +137,22 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) { NULL, 0, NULL, 0} }; - c = getopt_long(argc, argv, "s:b:a:l:m:", long_options, NULL); + c = getopt_long(argc, argv, "s:b:a:l:m:f", long_options, NULL); if (c < 0) break; switch (c) { case 's': - if (!strcasecmp("inmemory", optarg)) + if (!strcasecmp("inmemory", optarg)) { backend = BTRFS_DEDUPE_BACKEND_INMEMORY; - else { + backend_set = 1; + } else { error("unsupported dedupe backend: %s", optarg); exit(1); } break; case 'b': blocksize = parse_size(optarg); + blocksize_set = 1; break; case 'a': if (strcmp("sha256", optarg)) { @@ -226,26 +236,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) return 1; } memset(, -1, sizeof(dargs)); - dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; - dargs.blocksize = blocksize; - dargs.hash_algo = hash_algo; - dargs.limit_nr = limit_nr; - dargs.limit_mem = limit_mem; - dargs.backend = backend; - if (force) - dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE; - else - dargs.flags = 0; + if (reconf) { + dargs.cmd = BTRFS_DEDUPE_CTL_RECONF; + if (blocksize_set) + dargs.blocksize = blocksize; + if (hash_algo_set) + dargs.hash_algo = hash_algo; + if (backend_set) + dargs.backend = backend; + dargs.limit_nr = limit_nr; + dargs.limit_mem = limit_mem; + } else { + dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; + dargs.blocksize = blocksize; + dargs.hash_algo = hash_algo; + dargs.limit_nr =
[PATCH v10.3 1/5] btrfs-progs: Basic framework for dedupe-inband command group
From: Qu Wenruo Add basic ioctl header and command group framework for later use. Alone with basic man page doc. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 40 ++ Documentation/btrfs.asciidoc | 4 +++ Makefile | 3 +- btrfs.c| 2 ++ cmds-dedupe-ib.c | 35 +++ commands.h | 2 ++ dedupe-ib.h| 28 +++ ioctl.h| 36 +++ 9 files changed, 150 insertions(+), 1 deletion(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in index 184647c41940..402155fae001 100644 --- a/Documentation/Makefile.in +++ b/Documentation/Makefile.in @@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc MAN8_TXT += btrfs-replace.asciidoc MAN8_TXT += btrfs-restore.asciidoc MAN8_TXT += btrfs-property.asciidoc +MAN8_TXT += btrfs-dedupe-inband.asciidoc # Category 5 manual page MAN5_TXT += btrfs-man5.asciidoc diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc new file mode 100644 index ..9ee2bc75db3a --- /dev/null +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -0,0 +1,40 @@ +btrfs-dedupe(8) +== + +NAME + +btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs +filesystem + +SYNOPSIS + +*btrfs dedupe-inband* + +DESCRIPTION +--- +*btrfs dedupe-inband* is used to enable/disable or show current in-band de-duplication +status of a btrfs filesystem. + +Kernel support for in-band de-duplication starts from 4.8. + +WARNING: In-band de-duplication is still an experimental feautre of btrfs, +use with caution. + +SUBCOMMAND +-- +Nothing yet + +EXIT STATUS +--- +*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is +returned in case of failure. + +AVAILABILITY + +*btrfs* is part of btrfs-progs. +Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for +further details. + +SEE ALSO + +`mkfs.btrfs`(8), diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc index 7316ac094413..d37ae3571bd3 100644 --- a/Documentation/btrfs.asciidoc +++ b/Documentation/btrfs.asciidoc @@ -50,6 +50,10 @@ COMMANDS Do off-line check on a btrfs filesystem. + See `btrfs-check`(8) for details. +*dedupe*:: + Control btrfs in-band(write time) de-duplication. + + See `btrfs-dedupe`(8) for details. + *device*:: Manage devices managed by btrfs, including add/delete/scan and so on. + diff --git a/Makefile b/Makefile index 544410e6440c..1ebed7135714 100644 --- a/Makefile +++ b/Makefile @@ -123,7 +123,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ - mkfs/common.o check/mode-common.o check/mode-lowmem.o + mkfs/common.o check/mode-common.o check/mode-lowmem.o \ + cmds-dedupe-ib.o libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \ kernel-lib/crc32c.o messages.o \ uuid-tree.o utils-lib.o rbtree-utils.o diff --git a/btrfs.c b/btrfs.c index 2d39f2ced3e8..2168f5a8bc7f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -255,6 +255,8 @@ static const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, _cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, _cmd_group, 0 }, { "replace", cmd_replace, NULL, _cmd_group, 0 }, + { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group, + 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, NULL_CMD_STRUCT diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c new file mode 100644 index ..73c923a797da --- /dev/null +++ b/cmds-dedupe-ib.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2017 Fujitsu. All rights reserved. + */ + +#include +#include +#include + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" +#include "kerncompat.h" +#include "dedupe-ib.h" + +static const char * const dedupe_ib_cmd_group_usage[] = { + "btrfs dedupe-inband [options] ", + NULL +}; + +static const char dedupe_ib_cmd_group_info[] = +"manage inband(write time)
[PATCH v10.3 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group
From: Qu Wenruo Add enable subcommand for dedupe commmand group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 114 +- btrfs-completion | 6 +- cmds-dedupe-ib.c | 241 + ioctl.h| 2 + 4 files changed, 361 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 9ee2bc75db3a..82f970a69953 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,7 +22,119 @@ use with caution. SUBCOMMAND -- -Nothing yet +*enable* [options] :: +Enable in-band de-duplication for a filesystem. ++ +`Options` ++ +-f|--force +Force 'enable' command to be exected. +Will skip memory limit check and allow 'enable' to be executed even in-band +de-duplication is already enabled. ++ +NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be +reset to its default value. + +-s|--storage-backend +Specify de-duplication hash storage backend. +Only 'inmemory' backend is supported yet. +If not specified, default value is 'inmemory'. ++ +Refer to *BACKENDS* sector for more information. + +-b|--blocksize +Specify dedupe block size. +Supported values are power of 2 from '16K' to '8M'. +Default value is '128K'. ++ +Refer to *BLOCKSIZE* sector for more information. + +-a|--hash-algorithm +Specify hash algorithm. +Only 'sha256' is supported yet. + +-l|--limit-hash +Specify maximum number of hashes stored in memory. +Only works for 'inmemory' backend. +Conflicts with '-m' option. ++ +Only positive values are valid. +Default value is '32K'. + +-m|--limit-memory +Specify maximum memory used for hashes. +Only works for 'inmemory' backend. +Conflicts with '-l' option. ++ +Only value larger than or equal to '1024' is valid. +No default value. ++ +NOTE: Memory limit will be rounded down to kernel internal hash size, +so the memory limit shown in 'btrfs dedupe status' may be different +from the . + +WARNING: Too large value for '-l' or '-m' will easily trigger OOM. +Please use with caution according to system memory. + +NOTE: In-band de-duplication is not compactible with compression yet. +And compression has higher priority than in-band de-duplication, means if +compression and de-duplication is enabled at the same time, only compression +will work. + +BACKENDS + +Btrfs in-band de-duplication will support different storage backends, with +different use case and features. + +In-memory backend:: +This backend provides backward-compatibility, and more fine-tuning options. +But hash pool is non-persistent and may exhaust kernel memory if not setup +properly. ++ +This backend can be used on old btrfs(without '-O dedupe' mkfs option). +When used on old btrfs, this backend needs to be enabled manually after mount. ++ +Designed for fast hash search speed, in-memory backend will keep all dedupe +hashes in memory. (Although overall performance is still much the same with +'ondisk' backend if all 'ondisk' hash can be cached in memory) ++ +And only keeps limited number of hash in memory to avoid exhausting memory. +Hashes over the limit will be dropped following Last-Recent-Use behavior. +So this backend has a consistent overhead for given limit but can\'t ensure +all duplicated blocks will be de-duplicated. ++ +After umount and mount, in-memory backend need to refill its hash pool. + +On-disk backend:: +This backend provides persistent hash pool, with more smart memory management +for hash pool. +But it\'s not backward-compatible, meaning it must be used with '-O dedupe' mkfs +option and older kernel can\'t mount it read-write. ++ +Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk. +This behavior may cause extra disk IO for hash search under high memory +pressure. ++ +After umount and mount, on-disk backend still has its hash on disk, no need to +refill its dedupe hash pool. + +Currently, only 'inmemory' backend is supported in btrfs-progs. + +DEDUPE BLOCK SIZE + +In-band de-duplication is done at dedupe block size. +Any data smaller than dedupe block size won\'t go through in-band +de-duplication. + +And dedupe block size affects dedupe rate and fragmentation heavily. + +Smaller block size will cause more fragments, but higher dedupe rate. + +Larger block size will cause less fragments, but lower dedupe rate. + +In-band de-duplication rate is highly related to the workload pattern. +So it\'s highly recommended to align dedupe block size to the workload +block size to make full use of de-duplication. EXIT STATUS --- diff --git a/btrfs-completion b/btrfs-completion index ae683f4ecf61..69e02ad11990 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -29,7 +29,7 @@ _btrfs() local cmd=${words[1]} -
[PATCH v10.3 0/5] In-band de-duplication for btrfs-progs
Patchset can be fetched from github: https://github.com/littleroad/btrfs-progs.git dedupe_latest Inband dedupe(in-memory backend only) ioctl support for btrfs-progs. v7 changes: Update ctree.h to follow kernel structure change Update print-tree to follow kernel structure change V8 changes: Move dedup props and on-disk backend support out of the patchset Change command group name to "dedupe-inband", to avoid confusion with possible out-of-band dedupe. Suggested by Mark. Rebase to latest devel branch. V9 changes: Follow kernels ioctl change to support FORCE flag, new reconf ioctl, and more precious error reporting. v10 changes: Rebase to v4.10. Add BUILD_ASSERT for btrfs_ioctl_dedupe_args v10.1 changes: Rebase to v4.14. v10.2 changes: Rebase to v4.16.1. v10.3 changes: Rebase to v4.17. Qu Wenruo (5): btrfs-progs: Basic framework for dedupe-inband command group btrfs-progs: dedupe: Add enable command for dedupe command group btrfs-progs: dedupe: Add disable support for inband dedupelication btrfs-progs: dedupe: Add status subcommand btrfs-progs: dedupe: introduce reconfigure subcommand Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 167 Documentation/btrfs.asciidoc | 4 + Makefile | 3 +- btrfs-completion | 6 +- btrfs.c| 2 + cmds-dedupe-ib.c | 442 + commands.h | 2 + dedupe-ib.h| 28 ++ ioctl.h| 38 ++ 10 files changed, 691 insertions(+), 2 deletions(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 07/10/2018 09:38 AM, Martin Raiber wrote: > This is probably a known issue. See > https://www.spinics.net/lists/linux-btrfs/msg75647.html > You could apply the patch in this thread and mount with enospc_debug to > confirm it is the same issue. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > OK, I've applied the patch, by hand, and hopefully put it in the right place. Need to learn to patch better. Booted the (rebuilt with make ; make modules_install syslinux etc) kernel with the option enospc_debug for the two btrfs file systems (1st entry for each in fstb. I was not expecting to get the issue to appear quickly as it took several days to hit previously. However, on checking I see another error, not sure if it is related, still is in extent-tree.c. https://drive.google.com/file/d/1K12MfpWFB1aHSXBga1Rym5terbmHeDfI/view?usp=sharing Pete -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"
On Wed, Jul 11, 2018 at 9:37 AM, Udo Waechter wrote: > Hello everyone, > > I have a corrupted filesystem which I can't seem to recover. > > The machine is: > Debian Linux, kernel 4.9 and btrfs-progs v4.13.3 > > I have a HDD RAID5 with LVM and the volume in question is a LVM volume. > On top of that I had a RAID1 SSD cache with lvm-cache. > > Yesterday both! SSDs died within minutes. This lead to the corruped > filesystem that I have now. > > I hope I followed the procedure correctly. > > What I tried so far: > * "mount -o usebackuproot,ro " and "nospace_cache" "clear_cache" and all > permutations of these mount options > > I'm getting: > > [96926.830400] BTRFS info (device dm-2): trying to use backup root at > mount time > [96926.830406] BTRFS info (device dm-2): disk space caching is enabled > [96926.927978] BTRFS error (device dm-2): parent transid verify failed > on 321269628928 wanted 3276017 found 3275985 > [96926.938619] BTRFS error (device dm-2): parent transid verify failed > on 321269628928 wanted 3276017 found 3275985 > [96926.940705] BTRFS error (device dm-2): failed to recover balance: -5 > [96926.985801] BTRFS error (device dm-2): open_ctree failed > > The weird thing is that I can't really find information about the > "failed to recover balance: -5" error. - There was no rebalancing > running when during the crash. > > * btrfs-find-root: https://pastebin.com/qkjnSUF7 - It bothers me that I > don't see any "good generations" as described here: > https://btrfs.wiki.kernel.org/index.php/Restore > > * "btrfs rescue" - it starts, then goes to "looping on XYZ" then stops > > * "btrfs rescue super-recover -v" gives: > > All Devices: > Device: id = 1, name = /dev/vg00/... > Before Recovering: > [All good supers]: > device name = /dev/vg00/... > superblock bytenr = 65536 > > device name = /dev/vg00/... > superblock bytenr = 67108864 > > device name = /dev/vg00/... > superblock bytenr = 274877906944 > > [All bad supers]: > > All supers are valid, no need to recover > > > * Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it > turns out that might have been a bad idea > > > * Also, a "btrfs check --init-extent-tree" - https://pastebin.com/jATDCFZy > > The volume contained qcow2 images for VMs. I need only one of those, > since one piece of important software decided to not do backups :( > > Any help is highly appreciated. You should ask for help sooner. It's much harder to give advice after you've modified the file system multiple times since the original problem happened. But maybe someone has ideas on the way forward, other than 'btrfs restore' which is the offline scrape tool. https://btrfs.wiki.kernel.org/index.php/Restore There's a bunch of fixes since btrfs-progs 4.13 and 4.17 which is now current. But anyway with lvmcache and the SSDs dying, it sounds like there are too many transaction commits to Btrfs that are lost in the failed lvmcache. Also, gmail considers your email phishing. So something with your mail is misconfigured for use on lists. "This message has a from address in zoide.net but has failed zoide.net's required tests for authentication. Learn more" My best guess from the header is that dmarc is set by your email provider to fail, and while many mail clients ignore this, Google honors it. And it's the dmarc fail that makes it incompatible with email lists because lists always rewrite the email posting (they add footers and rewrite headers). Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@zoide.net header.s=mx header.b=vATMNdwx; spf=pass (google.com: best guess record for domain of linux-btrfs-ow...@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-btrfs-ow...@vger.kernel.org; dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE) header.from=zoide.net -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check mode normal still hard crash-hanging systems
On Wed, Jul 11, 2018 at 11:09:56AM -0600, Chris Murphy wrote: > On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN wrote: > > Thanks to Su and Qu, I was able to get my filesystem to a point that > > it's mountable. > > I then deleted loads of snapshots and I'm down to 26. > > > > IT now looks like this: > > gargamel:~# btrfs fi show /mnt/mnt > > Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > > Total devices 1 FS bytes used 12.30TiB > > devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 > > > > gargamel:~# btrfs fi df /mnt/mnt > > Data, single: total=13.57TiB, used=12.19TiB > > System, DUP: total=32.00MiB, used=1.55MiB > > Metadata, DUP: total=124.50GiB, used=115.62GiB > > Metadata, single: total=216.00MiB, used=0.00B > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > Problems > > 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the > > server, despite my deleting lots of snapshots. > > Is it because I have too many files then? > > I think originally needs most of metdata in memory. > > I'm not understanding why btrfs check won't use swap like at least > xfs_repair and pretty sure e2fsck will as well. > > Using 128G swap on nvme with original check is still gonna be faster > than lowmem mode. Yeah, that's been also a concern/question of mine all these years, even if Su isn't working on that code, and likely is the wrong person to ask. Personally, my take is that if btrfs wants to be taken seriously, at the very least its fsck tool should not hard crash a system you run it on. (and it really does the worst kind of hard crash I've ever seen, OOM can't trigger fast enough, linux doesn't panic, so it can't self reboot either, it just hard dies and hangs) Maybe David knows? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check lowmem, take 2
On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN wrote: > Thanks to Su and Qu, I was able to get my filesystem to a point that > it's mountable. > I then deleted loads of snapshots and I'm down to 26. > > IT now looks like this: > gargamel:~# btrfs fi show /mnt/mnt > Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > Total devices 1 FS bytes used 12.30TiB > devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 > > gargamel:~# btrfs fi df /mnt/mnt > Data, single: total=13.57TiB, used=12.19TiB > System, DUP: total=32.00MiB, used=1.55MiB > Metadata, DUP: total=124.50GiB, used=115.62GiB > Metadata, single: total=216.00MiB, used=0.00B > GlobalReserve, single: total=512.00MiB, used=0.00B > > > Problems > 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the > server, despite my deleting lots of snapshots. > Is it because I have too many files then? I think originally needs most of metdata in memory. I'm not understanding why btrfs check won't use swap like at least xfs_repair and pretty sure e2fsck will as well. Using 128G swap on nvme with original check is still gonna be faster than lowmem mode. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We test the patch on a 4-cores x86 machine: 1. falloate a 16GiB size test file. 2. take snapshot (so all following writes will be cow write). 3. run a 180 sec, 4 jobs, 4K random write fio on test file. We also add a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track lock_stat. [Results] unpatched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu:82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avgacq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protect a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock get transferred between different cpus, so the patch can really recude cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien --- V2: Rewrite commit comments. Add lock_stat test. Pull dirty_metadata_bytes out to a separate patch. fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 46 -- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; struct list_head list; /* Protected by the spinlock 'lock'. */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d9fe58c0080..937113534ef4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes, + space_info->total_bytes_pinned_batch); } /* @@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, flags = BTRFS_BLOCK_GROUP_METADATA; space_info = __find_space_info(fs_info, flags); ASSERT(space_info); - percpu_counter_add(_info->total_bytes_pinned, - -head->num_bytes); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, +
Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"
Hello everyone, I have a corrupted filesystem which I can't seem to recover. The machine is: Debian Linux, kernel 4.9 and btrfs-progs v4.13.3 I have a HDD RAID5 with LVM and the volume in question is a LVM volume. On top of that I had a RAID1 SSD cache with lvm-cache. Yesterday both! SSDs died within minutes. This lead to the corruped filesystem that I have now. I hope I followed the procedure correctly. What I tried so far: * "mount -o usebackuproot,ro " and "nospace_cache" "clear_cache" and all permutations of these mount options I'm getting: [96926.830400] BTRFS info (device dm-2): trying to use backup root at mount time [96926.830406] BTRFS info (device dm-2): disk space caching is enabled [96926.927978] BTRFS error (device dm-2): parent transid verify failed on 321269628928 wanted 3276017 found 3275985 [96926.938619] BTRFS error (device dm-2): parent transid verify failed on 321269628928 wanted 3276017 found 3275985 [96926.940705] BTRFS error (device dm-2): failed to recover balance: -5 [96926.985801] BTRFS error (device dm-2): open_ctree failed The weird thing is that I can't really find information about the "failed to recover balance: -5" error. - There was no rebalancing running when during the crash. * btrfs-find-root: https://pastebin.com/qkjnSUF7 - It bothers me that I don't see any "good generations" as described here: https://btrfs.wiki.kernel.org/index.php/Restore * "btrfs rescue" - it starts, then goes to "looping on XYZ" then stops * "btrfs rescue super-recover -v" gives: All Devices: Device: id = 1, name = /dev/vg00/... Before Recovering: [All good supers]: device name = /dev/vg00/... superblock bytenr = 65536 device name = /dev/vg00/... superblock bytenr = 67108864 device name = /dev/vg00/... superblock bytenr = 274877906944 [All bad supers]: All supers are valid, no need to recover * Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it turns out that might have been a bad idea * Also, a "btrfs check --init-extent-tree" - https://pastebin.com/jATDCFZy The volume contained qcow2 images for VMs. I need only one of those, since one piece of important software decided to not do backups :( Any help is highly appreciated. Many thanks, udo. signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 2/2] btrfs: get fs_devices pointer form btrfs_scan_one_device
On 07/11/2018 09:22 AM, Gu Jinxiang wrote: Instead of pointer to btrfs_fs_devices as an arg in btrfs_scan_one_device, better to make it as a return value. Yep this was in the list to fix. However I didn't like the idea to return the btrfs_fs_devices pointer, instead return the btrfs_device pointer, so that we can still retrieve its fs_devices. Thanks, Anand Signed-off-by: Gu Jinxiang --- Changelog: v3: as comment by robot, use PTR_ERR_OR_ZERO, and rebase to misc-next. v2: as comment by Nikolay, use ERR_CAST instead of cast type manually. fs/btrfs/super.c | 29 ++--- fs/btrfs/volumes.c | 14 +++--- fs/btrfs/volumes.h | 4 ++-- 3 files changed, 27 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 78b5d51c7bc7..20e1ee338a95 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -916,11 +916,13 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, error = -ENOMEM; goto out; } - error = btrfs_scan_one_device(device_name, - flags, holder, _devices); + fs_devices = btrfs_scan_one_device(device_name, + flags, holder); kfree(device_name); - if (error) + if (IS_ERR(fs_devices)) { + error = PTR_ERR(fs_devices); goto out; + } } } @@ -1537,9 +1539,11 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, return ERR_PTR(error); } - error = btrfs_scan_one_device(device_name, mode, fs_type, _devices); - if (error) + fs_devices = btrfs_scan_one_device(device_name, mode, fs_type); + if (IS_ERR(fs_devices)) { + error = PTR_ERR(fs_devices); goto error_sec_opts; + } /* * Setup a dummy root and fs_info for test/set super. This is because @@ -2220,7 +2224,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct btrfs_ioctl_vol_args *vol; - struct btrfs_fs_devices *fs_devices; + struct btrfs_fs_devices *fs_devices = NULL; int ret = -ENOTTY; if (!capable(CAP_SYS_ADMIN)) @@ -2232,14 +2236,17 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, switch (cmd) { case BTRFS_IOC_SCAN_DEV: - ret = btrfs_scan_one_device(vol->name, FMODE_READ, - _root_fs_type, _devices); + fs_devices = btrfs_scan_one_device(vol->name, FMODE_READ, + _root_fs_type); + ret = PTR_ERR_OR_ZERO(fs_devices); break; case BTRFS_IOC_DEVICES_READY: - ret = btrfs_scan_one_device(vol->name, FMODE_READ, - _root_fs_type, _devices); - if (ret) + fs_devices = btrfs_scan_one_device(vol->name, FMODE_READ, + _root_fs_type); + if (IS_ERR(fs_devices)) { + ret = PTR_ERR(fs_devices); break; + } ret = !(fs_devices->num_devices == fs_devices->total_devices); break; case BTRFS_IOC_GET_SUPPORTED_FEATURES: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index af2704de9ff9..6a6321e41f1b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1212,14 +1212,14 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, * and we are not allowed to call set_blocksize during the scan. The superblock * is read via pagecache */ -int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, - struct btrfs_fs_devices **fs_devices_ret) +struct btrfs_fs_devices *btrfs_scan_one_device(const char *path, fmode_t flags, + void *holder) { struct btrfs_super_block *disk_super; struct btrfs_device *device; struct block_device *bdev; struct page *page; - int ret = 0; + struct btrfs_fs_devices *ret = NULL; u64 bytenr; /* @@ -1233,19 +1233,19 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, bdev = blkdev_get_by_path(path, flags, holder); if (IS_ERR(bdev)) - return PTR_ERR(bdev); + return ERR_CAST(bdev); if (btrfs_read_disk_super(bdev, bytenr, , _super)) { - ret = -EINVAL; + ret = ERR_PTR(-EINVAL); goto error_bdev_put; } mutex_lock(_mutex); device = device_list_add(path,
Re: [PATCH v3 1/2] btrfs: make fs_devices to be a local variable
On 07/11/2018 09:22 AM, Gu Jinxiang wrote: fs_devices is always passed to btrfs_scan_one_device which overrides it. And in the call stack below fs_devices is passed to btrfs_scan_one_device from btrfs_mount_root. And in btrfs_mount_root the output fs_devices of this call stack is not used. btrfs_mount_root -> btrfs_parse_early_options ->btrfs_scan_one_device So, there is no necessary to pass fs_devices from btrfs_mount_root, use a local variable in btrfs_parse_early_options is enough. Signed-off-by: Gu Jinxiang Other than two nit below. Reviewed-by: Anand Jain --- Changelog: v3: rebase to misc-next. v2: deal with Nikolay's comment, make changelog more clair. fs/btrfs/super.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index e04bcf0b0ed4..78b5d51c7bc7 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -884,11 +884,12 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, * only when we need to allocate a new super block. */ static int btrfs_parse_early_options(const char *options, fmode_t flags, - void *holder, struct btrfs_fs_devices **fs_devices) + void *holder) While here pls indent the 2nd line argument to be below the const char options. { substring_t args[MAX_OPT_ARGS]; char *device_name, *opts, *orig, *p; int error = 0; + struct btrfs_fs_devices *fs_devices = NULL; Its a good idea to align the declarations to avoid space wastage, char *device_name, *opts, *orig, *p; + struct btrfs_fs_devices *fs_devices = NULL; int error = 0; Thanks, Anand if (!options) return 0; @@ -916,7 +917,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, goto out; } error = btrfs_scan_one_device(device_name, - flags, holder, fs_devices); + flags, holder, _devices); kfree(device_name); if (error) goto out; @@ -1524,8 +1525,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, if (!(flags & SB_RDONLY)) mode |= FMODE_WRITE; - error = btrfs_parse_early_options(data, mode, fs_type, - _devices); + error = btrfs_parse_early_options(data, mode, fs_type); if (error) { return ERR_PTR(error); } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] volume and uuid_mutex cleanups
Hi David, Here I have put together a set of volume related patches which were sent to the ML as independent patches earlier. These have been reviewed and tested. Please pull. g...@github.com:asj/btrfs-devel.git misc-next-for-kdave - [Anand:2] 6049bd5e9694 btrfs: add helper function check device delete able 8c96747831b0 btrfs: add helper btrfs_num_devices() to deduce num_devices dd61850ee7cf btrfs: warn for num_devices below 0 17c285ada2e4 btrfs: use the assigned fs_devices instead of the dereference e2f7c8a0f67b btrfs: do device clone using the btrfs_scan_one_device 89325c85d655 btrfs: fix race between free_stale_devices and close_fs_devices 0dfd68121520 btrfs: drop uuid_mutex in btrfs_free_extra_devids() [David:2] 6fa6985bd169 btrfs: fix mount and ioctl device scan ioctl race e9f25a7b239d btrfs: reorder initialization before the mount locks uuid_mutex 2c5058cdf788 btrfs: lift uuid_mutex to callers of btrfs_parse_early_options 8ffc96e797bb btrfs: lift uuid_mutex to callers of btrfs_open_devices 39a2036c1d13 btrfs: lift uuid_mutex to callers of btrfs_scan_one_device [Anand:1] e735e867d314 btrfs: fix btrfs_free_stale_devices() with needed locks bdc6cc879388 btrfs: btrfs_free_stale_devices() rename local variables 0dd1ff5cc6be btrfs: fix device_list_add() missing device_list_mutex() 622f0a7c31fe btrfs: do btrfs_free_stale_devices() outside of device_list_add() [David:1] 7302fc024079 btrfs: restore uuid_mutex in btrfs_open_devices [Nikolay] 4da856347110 btrfs: drop pending list in device close - Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About hung task on generic/041
Hi, When I run generic/041 with v4.18-rc3 (turn on kasan and hung task detection), btrfs-transaction kthread will trigger the hung task timeout (stall at wait_event in btrfs_commit_transaction). At the same time, you can see that xfs_io -c fsync will occupy 100% of the CPU. I am not sure whether this is a problem. Any suggestion? [Wed Jul 11 15:50:08 2018] INFO: task btrfs-transacti:1053 blocked for more than 120 seconds. [Wed Jul 11 15:50:08 2018] Not tainted 4.18.0-rc3-custom #14 [Wed Jul 11 15:50:08 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Jul 11 15:50:08 2018] btrfs-transacti D0 1053 2 0x8000 [Wed Jul 11 15:50:08 2018] Call Trace: [Wed Jul 11 15:50:08 2018] ? __schedule+0x5b2/0x1380 [Wed Jul 11 15:50:08 2018] ? check_flags.part.23+0x240/0x240 [Wed Jul 11 15:50:08 2018] ? firmware_map_remove+0x187/0x187 [Wed Jul 11 15:50:08 2018] ? ___preempt_schedule+0x16/0x18 [Wed Jul 11 15:50:08 2018] ? mark_held_locks+0x6e/0x90 [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x59/0x70 [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x46/0x70 [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_event+0x191/0x410 [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_exclusive+0x210/0x210 [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 [Wed Jul 11 15:50:08 2018] ? do_raw_spin_unlock+0x10f/0x1e0 [Wed Jul 11 15:50:08 2018] ? do_raw_spin_trylock+0x120/0x120 [Wed Jul 11 15:50:08 2018] schedule+0xca/0x260 [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 [Wed Jul 11 15:50:08 2018] ? __schedule+0x1380/0x1380 [Wed Jul 11 15:50:08 2018] ? ___might_sleep+0x126/0x370 [Wed Jul 11 15:50:08 2018] ? init_wait_entry+0xc7/0x100 [Wed Jul 11 15:50:08 2018] ? __wake_up_locked_key_bookmark+0x20/0x20 [Wed Jul 11 15:50:08 2018] ? __btrfs_run_delayed_items+0x1e5/0x280 [btrfs] [Wed Jul 11 15:50:08 2018] ? __might_sleep+0x31/0xd0 [Wed Jul 11 15:50:08 2018] btrfs_commit_transaction+0x122a/0x1640 [btrfs] [Wed Jul 11 15:50:08 2018] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] [Wed Jul 11 15:50:08 2018] ? wait_woken+0x150/0x150 [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 [Wed Jul 11 15:50:08 2018] ? deref_stack_reg+0xe0/0xe0 [Wed Jul 11 15:50:08 2018] ? __module_text_address+0x63/0xa0 [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x161/0x240 [btrfs] [Wed Jul 11 15:50:08 2018] ? is_module_text_address+0x2b/0x50 [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] [Wed Jul 11 15:50:08 2018] ? kernel_text_address+0x5a/0x100 [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 [Wed Jul 11 15:50:08 2018] ? __save_stack_trace+0x82/0x100 [Wed Jul 11 15:50:08 2018] ? kasan_kmalloc+0x142/0x170 [Wed Jul 11 15:50:08 2018] ? kmem_cache_alloc+0xfc/0x2e0 [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] [Wed Jul 11 15:50:08 2018] ? kthread+0x1b9/0x1e0 [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 [Wed Jul 11 15:50:08 2018] ? mark_lock+0x149/0xa80 [Wed Jul 11 15:50:08 2018] ? init_object+0x6b/0x80 [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 [Wed Jul 11 15:50:08 2018] ? rcu_oom_callback+0x40/0x40 [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] [Wed Jul 11 15:50:08 2018] ? rcu_read_lock_sched_held+0x8f/0xa0 [Wed Jul 11 15:50:08 2018] ? btrfs_record_root_in_trans+0x1f/0xa0 [btrfs] [Wed Jul 11 15:50:08 2018] ? start_transaction+0x26b/0x930 [btrfs] [Wed Jul 11 15:50:08 2018] ? btrfs_commit_transaction+0x1640/0x1640 [btrfs] [Wed Jul 11 15:50:08 2018] ? check_flags.part.23+0x240/0x240 [Wed Jul 11 15:50:08 2018] ? lock_downgrade+0x380/0x380 [Wed Jul 11 15:50:08 2018] ? do_raw_spin_unlock+0x10f/0x1e0 [Wed Jul 11 15:50:08 2018] ? do_raw_spin_unlock+0x10f/0x1e0 [Wed Jul 11 15:50:08 2018] ? do_raw_spin_trylock+0x120/0x120 [Wed Jul 11 15:50:08 2018] transaction_kthread+0x219/0x240 [btrfs] [Wed Jul 11 15:50:08 2018] ? btrfs_cleanup_transaction+0x6f0/0x6f0 [btrfs] [Wed Jul 11 15:50:08 2018] kthread+0x1b9/0x1e0 [Wed Jul 11 15:50:08 2018] ? kthread_flush_work_fn+0x10/0x10 [Wed Jul 11 15:50:08 2018] ret_from_fork+0x27/0x50 [Wed Jul 11 15:50:08 2018] Showing all locks held in the system:
[DOC] BTRFS Volume operations, Device Lists and Locks all in one page
BTRFS Volume operations, Device Lists and Locks all in one page: Devices are managed in two contexts, the scan context and the mounted context. In scan context the threads originate from the btrfs_control ioctl and in the mounted context the threads originates from the mount point ioctl. Apart from these two context, there also can be two transient state where device state are transitioning from the scan to the mount context or from the mount to the scan context. Device List and Locks:- Count: btrfs_fs_devices::num_devices List : btrfs_fs_devices::devices -> btrfs_devices::dev_list Lock : btrfs_fs_devices::device_list_mutex Count: btrfs_fs_devices::rw_devices List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list Lock : btrfs_fs_info::chunk_mutex Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP FSID List and Lock:- Count : None HEAD : Global::fs_uuids -> btrfs_fs_devices::fs_list Lock : Global::uuid_mutex After the fs_devices is mounted, the btrfs_fs_devices::opened > 0. In the scan context we have the following device operations.. Device SCAN:- which creates the btrfs_fs_devices and its corresponding btrfs_device entries, also checks and frees the duplicate device entries. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Unlock: uuid_mutex Device READY:- check if the volume is ready. Also does an implicit scan and duplicate device free as in Device SCAN. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Check READY Unlock: uuid_mutex Device FORGET:- (planned) free a given or all unmounted devices and empty fs_devices if any. Lock: uuid_mutex if (found_duplicate && btrfs_fs_devices::opened == 0) Free duplicate Unlock: uuid_mutex Device mount operation -> A Transient state leading to the mounted context Lock: uuid_mutex Find, SCAN, btrfs_fs_devices::opened++ Unlock: uuid_mutex Device umount operation -> A transient state leading to the unmounted context or scan context Lock: uuid_mutex btrfs_fs_devices::opened-- Unlock: uuid_mutex In the mounted context we have the following device operations.. Device Rename through SCAN:- This is a special case where the device path gets renamed after its been mounted. (Ubuntu changes the boot path during boot up so we need this feature). Currently, this is part of Device SCAN as above. And we need the locks as below, because the dynamic disappearing device might cleanup the btrfs_device::name Lock: btrfs_fs_devices::device_list_mutex Rename Unlock: btrfs_fs_devices::device_list_mutex Commit Transaction:- Write All supers. Lock: btrfs_fs_devices::device_list_mutex Write all super of btrfs_devices::dev_list Unlock: btrfs_fs_devices::device_list_mutex Device add:- Add a new device to the existing mounted volume. set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP Lock: btrfs_fs_devices::device_list_mutex Lock: btrfs_fs_info::chunk_mutex List_add btrfs_devices::dev_list List_add btrfs_devices::dev_alloc_list Unlock: btrfs_fs_info::chunk_mutex Unlock: btrfs_fs_devices::device_list_mutex Device remove:- Remove a device from the mounted volume. set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP Lock: btrfs_fs_devices::device_list_mutex Lock: btrfs_fs_info::chunk_mutex List_del btrfs_devices::dev_list List_del btrfs_devices::dev_alloc_list Unlock: btrfs_fs_info::chunk_mutex Unlock: btrfs_fs_devices::device_list_mutex Device Replace:- Replace a device. set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP Lock: btrfs_fs_devices::device_list_mutex Lock: btrfs_fs_info::chunk_mutex List_update btrfs_devices::dev_list List_update btrfs_devices::dev_alloc_list Unlock: btrfs_fs_info::chunk_mutex Unlock: btrfs_fs_devices::device_list_mutex Sprouting:- Add a RW device to the mounted RO seed device, so to make the mount point writable. The following steps are used to hold the seed and sprout fs_devices. (first two steps are not necessary for the sprouting, they are there to ensure the seed device remains scanned, and it might change) . Clone the (mounted) fs_devices, lets call it as old_devices . Now add old_devices to fs_uuids (yeah, there is duplicate fsid in the list but we change the other fsid before we release the uuid_mutex, so its fine). . Alloc a new fs_devices, lets call it as seed_devices . Copy fs_devices into the seed_devices . Move fs_deviecs devices list into seed_devices . Bring seed_devices to under fs_devices (fs_devices->seed = seed_devices) . Assign a new FSID to the fs_devices and add the new writable device to the fs_devices. In the unmounted context the fs_devices::seed is always NULL. We alloc the fs_devices::seed only at the time of mount and or at sprouting. And free at the time of umount or if the seed device is replaced or deleted. Locks: Sprouting: Lock: uuid_mutex <-- because fsid rename and Device SCAN Reuses Device Add code Locks: Splitting: