date:20180711

Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page

2018-07-11 Thread Qu Wenruo



On 2018年07月11日 15:50, Anand Jain wrote:
> 
> 
> BTRFS Volume operations, Device Lists and Locks all in one page:
> 
> Devices are managed in two contexts, the scan context and the mounted
> context. In scan context the threads originate from the btrfs_control
> ioctl and in the mounted context the threads originates from the mount
> point ioctl.
> Apart from these two context, there also can be two transient state
> where device state are transitioning from the scan to the mount context
> or from the mount to the scan context.
> 
> Device List and Locks:-
> 
>  Count: btrfs_fs_devices::num_devices
>  List : btrfs_fs_devices::devices -> btrfs_devices::dev_list
>  Lock : btrfs_fs_devices::device_list_mutex
> 
>  Count: btrfs_fs_devices::rw_devices

So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO
devices.

How seed and ro devices are different in this case?


>  List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list
>  Lock : btrfs_fs_info::chunk_mutex

At least the chunk_mutex is also shared with chunk allocator, or we
should have some mutex in btrfs_fs_devices other than fs_info.
Right?


> 
>  Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
> 
> FSID List and Lock:-
> 
>  Count : None
>  HEAD  : Global::fs_uuids -> btrfs_fs_devices::fs_list
>  Lock  : Global::uuid_mutex
> 
> 
> After the fs_devices is mounted, the btrfs_fs_devices::opened > 0.

fs_devices::opended should be btrfs_fs_devices::num_devices if no device
is missing and -1 or -2 for degraded case, right?

> 
> In the scan context we have the following device operations..
> 
> Device SCAN:-  which creates the btrfs_fs_devices and its corresponding
> btrfs_device entries, also checks and frees the duplicate device entries.
> Lock: uuid_mutex
>   SCAN
>   if (found_duplicate && btrfs_fs_devices::opened == 0)
>  Free_duplicate
> Unlock: uuid_mutex
> 
> Device READY:- check if the volume is ready. Also does an implicit scan
> and duplicate device free as in Device SCAN.
> Lock: uuid_mutex
>   SCAN
>   if (found_duplicate && btrfs_fs_devices::opened == 0)
>  Free_duplicate
>   Check READY
> Unlock: uuid_mutex
> 
> Device FORGET:- (planned) free a given or all unmounted devices and
> empty fs_devices if any.
> Lock: uuid_mutex
>   if (found_duplicate && btrfs_fs_devices::opened == 0)
>     Free duplicate
> Unlock: uuid_mutex
> 
> Device mount operation -> A Transient state leading to the mounted context
> Lock: uuid_mutex
>  Find, SCAN, btrfs_fs_devices::opened++
> Unlock: uuid_mutex
> 
> Device umount operation -> A transient state leading to the unmounted
> context or scan context
> Lock: uuid_mutex
>   btrfs_fs_devices::opened--
> Unlock: uuid_mutex
> 
> 
> In the mounted context we have the following device operations..
> 
> Device Rename through SCAN:- This is a special case where the device
> path gets renamed after its been mounted. (Ubuntu changes the boot path
> during boot up so we need this feature). Currently, this is part of
> Device SCAN as above. And we need the locks as below, because the
> dynamic disappearing device might cleanup the btrfs_device::name
> Lock: btrfs_fs_devices::device_list_mutex
>    Rename
> Unlock: btrfs_fs_devices::device_list_mutex
> 
> Commit Transaction:- Write All supers.
> Lock: btrfs_fs_devices::device_list_mutex
>   Write all super of btrfs_devices::dev_list
> Unlock: btrfs_fs_devices::device_list_mutex
> 
> Device add:- Add a new device to the existing mounted volume.
> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
> Lock: btrfs_fs_devices::device_list_mutex
> Lock: btrfs_fs_info::chunk_mutex
>    List_add btrfs_devices::dev_list
>    List_add btrfs_devices::dev_alloc_list
> Unlock: btrfs_fs_info::chunk_mutex
> Unlock: btrfs_fs_devices::device_list_mutex
> 
> Device remove:- Remove a device from the mounted volume.
> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
> Lock: btrfs_fs_devices::device_list_mutex
> Lock: btrfs_fs_info::chunk_mutex
>    List_del btrfs_devices::dev_list
>    List_del btrfs_devices::dev_alloc_list
> Unlock: btrfs_fs_info::chunk_mutex
> Unlock: btrfs_fs_devices::device_list_mutex
> 
> Device Replace:- Replace a device.
> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
> Lock: btrfs_fs_devices::device_list_mutex
> Lock: btrfs_fs_info::chunk_mutex
>    List_update btrfs_devices::dev_list

Here we still just add a new device but not deleting the existing one
until the replace is finished.

>    List_update btrfs_devices::dev_alloc_list
> Unlock: btrfs_fs_info::chunk_mutex
> Unlock: btrfs_fs_devices::device_list_mutex
> 
> Sprouting:- Add a RW device to the mounted RO seed device, so to make
> the mount point writable.
> The following steps are used to hold the seed and sprout fs_devices.
> (first two steps are not necessary for the sprouting, they are there to
> ensure the seed device remains scanned, and it might change)
> . Clone the (mounted) fs_devices, lets call it as old_devices
> . Now add old_devices to fs_uuids (yeah, there

Why original mode doesn't use swap? (Original: Re: btrfs check lowmem, take 2)

2018-07-11 Thread Qu Wenruo




On 2018年07月12日 01:09, Chris Murphy wrote:
> On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN  wrote:
>> Thanks to Su and Qu, I was able to get my filesystem to a point that
>> it's mountable.
>> I then deleted loads of snapshots and I'm down to 26.
>>
>> IT now looks like this:
>> gargamel:~# btrfs fi show /mnt/mnt
>> Label: 'dshelf2'  uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
>> Total devices 1 FS bytes used 12.30TiB
>> devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2
>>
>> gargamel:~# btrfs fi df /mnt/mnt
>> Data, single: total=13.57TiB, used=12.19TiB
>> System, DUP: total=32.00MiB, used=1.55MiB
>> Metadata, DUP: total=124.50GiB, used=115.62GiB
>> Metadata, single: total=216.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> Problems
>> 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the
>> server, despite my deleting lots of snapshots.
>> Is it because I have too many files then?
> 
> I think originally needs most of metdata in memory.
> 
> I'm not understanding why btrfs check won't use swap like at least
> xfs_repair and pretty sure e2fsck will as well.

I don't understand either.

Isn't memory from malloc() swappable?

Thanks,
Qu

> 
> Using 128G swap on nvme with original check is still gonna be faster
> than lowmem mode.
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.8 09/14] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index e3084deb1eb7..14c8d245480e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -651,3 +651,53 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct shash_desc *shash;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   u64 dedupe_bs;
+   u64 sectorsize = fs_info->sectorsize;
+
+   shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm), GFP_NOFS);
+   if (!shash)
+   return -ENOMEM;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   shash->tfm = tfm;
+   shash->flags = 0;
+   ret = crypto_shash_init(shash);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(shash, d, sectorsize);
+   kunmap(p);
+   put_page(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(shash, hash->hash);
+   return ret;
+}
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.8 03/14] btrfs: dedupe: Introduce dedupe framework and its header

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Introduce the header for btrfs in-band(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h   |   7 ++
 fs/btrfs/dedupe.h  | 136 -
 fs/btrfs/disk-io.c |   1 +
 include/uapi/linux/btrfs.h |  34 ++
 4 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8743fdcfe139..ad31ccac86a3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1136,6 +1136,13 @@ struct btrfs_fs_info {
spinlock_t ref_verify_lock;
struct rb_root block_tree;
 #endif
+
+   /*
+* Inband de-duplication related structures
+*/
+   unsigned long dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 90281a7a35a8..681cf4717396 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -6,7 +6,139 @@
 #ifndef BTRFS_DEDUPE_H
 #define BTRFS_DEDUPE_H
 
-/* later in-band dedupe will expand this struct */
-struct btrfs_dedupe_hash;
+#include 
+#include 
+#include 
 
+static const int btrfs_hash_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedupe.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_algo;
+
+   struct crypto_shash *dedupe_driver;
+
+   /*
+* Use mutex to portect both backends
+* Even for in-memory backends, the rb-tree can be quite large,
+* so mutex is better for such use case.
+*/
+   struct mutex lock;
+
+   /* following members are only used in in-memory backend */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 algo);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (from unsupported param to tree creation error for some backends)
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
+   struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Get current dedupe status.
+ * Return 0 for success
+ * No possible error yet
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Calculate hash for dedupe.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (error from hash codes)
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Add a dedupe hash into dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+struct btrfs_dedupe_hash *hash);
+
+/*
+ * Remove a dedupe hash from dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some

[PATCH v14.8 11/14] btrfs: dedupe: Inband in-memory only de-duplication implement

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Core implement for inband de-duplication.
It reuses the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The workflow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dedupe.h  |  18 +++
 fs/btrfs/extent-tree.c |  31 -
 fs/btrfs/extent_io.c   |   5 +-
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file.c|   3 +
 fs/btrfs/inode.c   | 305 ++---
 fs/btrfs/ioctl.c   |   1 +
 fs/btrfs/relocation.c  |  17 +++
 9 files changed, 329 insertions(+), 56 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ad31ccac86a3..8fff17adc8d2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -107,9 +107,11 @@ static inline u32 count_max_extents(u64 size, u64 
max_extent_size)
 enum btrfs_metadata_reserve_type {
BTRFS_RESERVE_NORMAL,
BTRFS_RESERVE_COMPRESS,
+   BTRFS_RESERVE_DEDUPE,
 };
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+u64 btrfs_max_extent_size(struct btrfs_inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type);
 int inode_need_compress(struct inode *inode, u64 start, u64 end);
 
 struct btrfs_mapping_tree {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f19f6a8ff2ba..ebcbb89d79a0 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include "btrfs_inode.h"
 
 static const int btrfs_hash_sizes[] = { 32 };
 
@@ -50,6 +51,23 @@ struct btrfs_dedupe_info {
 
 struct btrfs_trans_handle;
 
+static inline u64 btrfs_dedupe_blocksize(struct btrfs_inode *inode)
+{
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+   return fs_info->dedupe_info->blocksize;
+}
+
+static inline int inode_need_dedupe(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   return 1;
+}
+
 static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 {
return (hash && hash->bytenr);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 225ebcb1fd09..7a3a9d3fb0b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -28,6 +28,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "ref-verify.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2612,6 +2613,17 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedupe
+* hash.
+*/
+   ret = btrfs_dedupe_del(trans, fs_info, head->bytenr);
+   if (ret < 0)
+   return ret;
ret = btrfs_del_csums(trans, fs_info, head->bytenr,
  head->num_bytes);
}
@@ -6017,15 +6029,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct 
btrfs_fs_info *fs_info,
spin_unlock(_rsv->lock);
 }
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type)
+u64 btrfs_max_extent_size(struct btrfs_inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type)
 {
if (reserve_type == BTRFS_RESERVE_NORMAL)
return BTRFS_MAX_EXTENT_SIZE;
else if (reserve_type == BTRFS_RESERVE_COMPRESS)
return SZ_128K;
-
-   ASSERT(0);
-   return BTRFS_MAX_EXTENT_SIZE;
+   else if (reserve_type == BTRFS_RESERVE_DEDUPE)
+   return btrfs_dedupe_blocksize(inode);
+   else
+   return BTRFS_MAX_EXTENT_SIZE;
 }
 
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
@@ -6036,7 +6050,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes,
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
int ret = 0;
bool delalloc_lock = true;

[PATCH v14.8 14/14] btrfs: dedupe: Introduce new reconfigure ioctl

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Introduce new reconfigure ioctl and new FORCE flag for in-band dedupe
ioctls.

Now dedupe enable and reconfigure ioctl are stateful.


| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Not allowed |
| Enabled   |  reconf| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  dsiable   | Disabled|
| Disabled  |  reconf| Not allowed |

(While disbale is always stateless)

While for guys prefer stateless ioctl (myself for example), new FORCE
flag is introduced.

In FORCE mode, enable/disable is completely stateless.

| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  disable   | Disabled|


Also, re-configure ioctl will only modify specified fields.
Unlike enable, un-specified fields will be filled with default value.

For example:
 # btrfs dedupe enable --block-size 64k /mnt
 # btrfs dedupe reconfigure --limit-hash 1m /mnt
Will leads to:
 dedupe blocksize: 64K
 dedupe hash limit nr: 1m

While for enable:
 # btrfs dedupe enable --force --block-size 64k /mnt
 # btrfs dedupe enable --force --limit-hash 1m /mnt
Will reset blocksize to default value:
 dedupe blocksize: 128K << reset
 dedupe hash limit nr: 1m

Suggested-by: David Sterba 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c  | 131 ++---
 fs/btrfs/dedupe.h  |  13 
 fs/btrfs/ioctl.c   |  13 
 include/uapi/linux/btrfs.h |  11 +++-
 4 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index f068321fdd1c..71b090c2938f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -29,6 +29,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+/*
+ * Copy from current dedupe info to fill dargs.
+ * For reconf case, only fill members which is uninitialized.
+ */
+static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF);
+
+   dargs->status = 1;
+
+   if (!reconf || (reconf && dargs->blocksize == (u64)-1))
+   dargs->blocksize = dedupe_info->blocksize;
+   if (!reconf || (reconf && dargs->backend == (u16)-1))
+   dargs->backend = dedupe_info->backend;
+   if (!reconf || (reconf && dargs->hash_algo == (u16)-1))
+   dargs->hash_algo = dedupe_info->hash_algo;
+
+   /*
+* For re-configure case, if not modifying limit,
+* therir limit will be set to 0, unlike other fields
+*/
+   if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) {
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
+   /* current_nr doesn't makes sense for reconfig case */
+   if (!reconf)
+   dargs->current_nr = dedupe_info->current_nr;
+}
+
 void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
 struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -45,15 +79,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
return;
}
mutex_lock(_info->lock);
-   dargs->status = 1;
-   dargs->blocksize = dedupe_info->blocksize;
-   dargs->backend = dedupe_info->backend;
-   dargs->hash_algo = dedupe_info->hash_algo;
-   dargs->limit_nr = dedupe_info->limit_nr;
-   dargs->limit_mem = dedupe_info->limit_nr *
-   (sizeof(struct inmem_hash) +
-btrfs_hash_sizes[dedupe_info->hash_algo]);
-   dargs->current_nr = dedupe_info->current_nr;
+   get_dedupe_status(dedupe_info, dargs);
mutex_unlock(_info->lock);
memset(dargs->__unused, -1, sizeof(dargs->__unused));
 }
@@ -102,17 +128,50 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info,
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
  struct btrfs_ioctl_dedupe_args *dargs)
 {
-   u64 blocksize = dargs->blocksize;
-   u64 limit_nr = dargs->limit_nr;
-   u64 limit_mem = dargs->limit_mem;
-   u16 hash_algo = dargs->hash_algo;
-   u8 backend = dargs->backend;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   u64 blocksize;
+   u64 limit_nr;
+   u64

[PATCH v14.8 07/14] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked. Export
init_delayed_ref_head and init_delayed_ref_common for inband dedupe.

Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/delayed-ref.c | 49 ++
 fs/btrfs/delayed-ref.h | 16 ++
 2 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 03dec673d12a..10de8011ada7 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -526,7 +526,7 @@ update_existing_head_ref(struct btrfs_delayed_ref_root 
*delayed_refs,
spin_unlock(>lock);
 }
 
-static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
+void btrfs_init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
  struct btrfs_qgroup_extent_record *qrecord,
  u64 bytenr, u64 num_bytes, u64 ref_root,
  u64 reserved, int action, bool is_data,
@@ -654,7 +654,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 }
 
 /*
- * init_delayed_ref_common - Initialize the structure which represents a
+ * btrfs_init_delayed_ref_common - Initialize the structure which represents a
  *  modification to a an extent.
  *
  * @fs_info:Internal to the mounted filesystem mount structure.
@@ -678,7 +678,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
  * when recording a metadata extent or BTRFS_SHARED_DATA_REF_KEY/
  * BTRFS_EXTENT_DATA_REF_KEY when recording data extent
  */
-static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
+void btrfs_init_delayed_ref_common(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *ref,
u64 bytenr, u64 num_bytes, u64 ref_root,
int action, u8 ref_type)
@@ -734,7 +734,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
ref_type = BTRFS_SHARED_BLOCK_REF_KEY;
else
ref_type = BTRFS_TREE_BLOCK_REF_KEY;
-   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+   btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
ref_root, action, ref_type);
ref->root = ref_root;
ref->parent = parent;
@@ -751,7 +751,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
goto free_head_ref;
}
 
-   init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
+   btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
  ref_root, 0, action, false, is_system);
head_ref->extent_op = extent_op;
 
@@ -788,6 +788,29 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
return -ENOMEM;
 }
 
+/*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+int btrfs_add_delayed_data_ref_locked(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   struct btrfs_delayed_data_ref *ref, int action,
+   int *qrecord_inserted_ret, int *old_ref_mod,
+   int *new_ref_mod)
+{
+   struct btrfs_delayed_ref_root *delayed_refs;
+
+   head_ref = add_delayed_ref_head(trans, head_ref, qrecord,
+   action, qrecord_inserted_ret,
+   old_ref_mod, new_ref_mod);
+
+   delayed_refs = >transaction->delayed_refs;
+
+   return insert_delayed_ref(trans, delayed_refs, head_ref, >node);
+}
+
 /*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
@@ -814,7 +837,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
ref_type = BTRFS_SHARED_DATA_REF_KEY;
else
ref_type = BTRFS_EXTENT_DATA_REF_KEY;
-   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+   btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
ref_root, action, ref_type);
ref->root = ref_root;
ref->parent = parent;
@@ -839,8 +862,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
}
}
 
-   init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root,
- reserved, action, true, false);
+   btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
+ ref_root, reserved, action, true, false);

[PATCH v14.8 13/14] btrfs: relocation: Enhance error handling to avoid BUG_ON

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Since the introduction of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3841cddef6ab..573ab5a04be5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -885,6 +885,13 @@ struct backref_node *build_backref_tree(struct 
reloc_control *rc,
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(>backref_cache, cur);
goto out;
}
 
@@ -3058,14 +3065,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, >key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, >key,
@@ -3073,11 +3087,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.8 04/14] btrfs: dedupe: Introduce function to initialize dedupe info

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/dedupe.c  | 174 +
 fs/btrfs/dedupe.h  |  13 ++-
 include/uapi/linux/btrfs.h |   4 +-
 4 files changed, 189 insertions(+), 4 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..78fdc87dba39 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o free-space-tree.o tree-checker.o
+  uuid-tree.o props.o free-space-tree.o tree-checker.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index ..23b9cd8ae3ff
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ */
+
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
+   struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_algo = dargs->hash_algo;
+   dedupe_info->backend = dargs->backend;
+   dedupe_info->blocksize = dargs->blocksize;
+   dedupe_info->limit_nr = dargs->limit_nr;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   mutex_init(_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+/*
+ * Helper to check if parameters are valid.
+ * The first invalid field will be set to (-1), to info user which parameter
+ * is invalid.
+ * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned
+ * to info user, since user can specify any value to limit, except 0.
+ */
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   u64 blocksize = dargs->blocksize;
+   u64 limit_nr = dargs->limit_nr;
+   u64 limit_mem = dargs->limit_mem;
+   u16 hash_algo = dargs->hash_algo;
+   u8 backend = dargs->backend;
+
+   /*
+* Set all reserved fields to -1, allow user to detect
+* unsupported optional parameters.
+*/
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->sectorsize ||
+   !is_power_of_2(blocksize) ||
+   blocksize < PAGE_SIZE) {
+   dargs->blocksize = (u64)-1;
+   return -EINVAL;
+   }
+   if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) {
+   dargs->hash_algo = (u16)-1;
+   return -EINVAL;
+   }
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) {
+   dargs->backend = (u8)-1;
+   return -EINVAL;
+   }
+
+   /* Backend specific check */
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   /* only one limit is accepted for enable*/
+   if (dargs->limit_nr && dargs->limit_mem) {
+   dargs->limit_nr = 0;
+   dargs->limit_mem = 0;
+   return -EINVAL;
+   }
+
+   if (!limit_nr && !limit_mem)
+   dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+   else {
+   u64 tmp = (u64)-1;
+
+   if (limit_mem) {
+   tmp = div_u64(limit_mem,
+   (sizeof(struct inmem_hash)) +
+

[PATCH v14.8 12/14] btrfs: dedupe: Add ioctl for inband deduplication

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Add ioctl interface for inband deduplication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interfaces are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Also, for invalid parameters, enable ioctl interface will set the field
of the first encountered invalid parameter to (-1) to inform caller.
While for limit_nr/limit_mem, the value will be (0).

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c  | 50 
 fs/btrfs/dedupe.h  | 17 +++---
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/ioctl.c   | 67 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 12 ++-
 6 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 14c8d245480e..f068321fdd1c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -29,6 +29,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_algo = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   return;
+   }
+   mutex_lock(_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_algo = dedupe_info->hash_algo;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(_info->lock);
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -409,6 +438,27 @@ static void unblock_all_writers(struct btrfs_fs_info 
*fs_info)
percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index ebcbb89d79a0..85a87093ab04 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -96,6 +96,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 algo)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_dedupe_args *dargs);
 
+
+/*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -107,12 +116,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
- * Get current dedupe status.
- * Return 0 for success
- * No possible error yet
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
  */
-void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
-struct btrfs_ioctl_dedupe_args *dargs);
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
 
 /*
  * Calculate hash for dedupe.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index

[PATCH v14.8 00/14] Btrfs In-band De-duplication

2018-07-11 Thread Lu Fengqi

This patchset can be fetched from github:
https://github.com/littleroad/linux.git dedupe_latest

This is just a normal rebase update.
Now the new base is v4.18-rc4

Normal test cases from auto group exposes no regression, and ib-dedupe
group can pass without problem.

xfstests ib-dedupe group can be fetched from github:
https://github.com/littleroad/xfstests-dev.git btrfs_dedupe_latest

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
v11:
  Remove other backend and props support to focus on the framework and
  in-memory backend. Suggested by David.
  Better disable and buffered write race protection.
  Comprehensive fix to dedupe metadata ENOSPC problem.
v12:
  Stateful 'enable' ioctl and new 'reconf' ioctl
  New FORCE flag for enable ioctl to allow stateless ioctl
  Precise error report and extendable ioctl structure.
v12.1
  Rebase to David's for-next-20160704 branch
  Add co-ordinate patch for subpage and dedupe patchset.
v12.2
  Rebase to David's for-next-20160715 branch
  Add co-ordinate patch for other patchset.
v13
  Rebase to David's for-next-20160906 branch
  Fix a reserved space leak bug, which only frees quota reserved space
  but not space_info->byte_may_use.
v13.1
  Rebase to Chris' for-linux-4.9 branch
v14
  Use generic ENOSPC fix for both compression and dedupe.
v14.1
  Further split ENOSPC fix.
v14.2
  Rebase to v4.11-rc2.
  Co-operate with count_max_extent() to calculate num_extents.
  No longer rely on qgroup fixes.
v14.3
  Rebase to v4.12-rc1.
v14.4
  Rebase to kdave/for-4.13-part1.
v14.5
  Rebase to v4.15-rc3.
v14.6
  Rebase to v4.17-rc5.
v14.7
  Replace SHASH_DESC_ON_STACK with kmalloc to remove VLA.
  Fixed the following errors by switching to div_u64.
  ├── arm-allmodconfig
  │   └── ERROR:__aeabi_uldivmod-fs-btrfs-btrfs.ko-undefined
  └── i386-allmodconfig
  └── ERROR:__udivdi3-fs-btrfs-btrfs.ko-undefined
v14.8
  Rebase to v4.18-rc4.


Qu Wenruo (4):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Introduce new reconfigure ioctl

Wang Xiaoguang (10):
  btrfs: introduce type based delalloc metadata reserve
  btrfs: Introduce COMPRESS reserve type to fix false enospc for
compression
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband deduplication

 fs/btrfs/Makefile|   2 +-
 fs/btrfs/ctree.h |  54 ++-
 fs/btrfs/dedupe.c| 836 +++
 fs/btrfs/dedupe.h| 183 +++-
 fs/btrfs/delayed-ref.c   |  49 +-
 fs/btrfs/delayed-ref.h   |  16 +
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   |  69 ++-
 fs/btrfs/extent_io.c |   8 +-
 fs/btrfs/extent_io.h |   2 +
 fs/btrfs/file.c  |  36 +-
 fs/btrfs/free-space-cache.c  |   6 +-

[PATCH v14.8 02/14] btrfs: Introduce COMPRESS reserve type to fix false enospc for compression

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

When testing btrfs compression, sometimes we got ENOSPC error, though fs
still has much free space, xfstests generic/171, generic/172, generic/173,
generic/174, generic/175 can reveal this bug in my test environment when
compression is enabled.

After some debugging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve too
much metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes to
reserve is calculated by the difference between outstanding extents and
reserved extents.
But due to bad designed drop_outstanding_extent() function, it can make
the difference too big, and cause problem.

The problem happens in the following flow with compression enabled.

1) Buffered write 128M data with 128K blocksize
   outstanding_extents = 1
   reserved_extents = 1024 (128M / 128K, one blocksize will get one
reserved_extent)

   Note: it's btrfs_merge_extent_hook() to merge outstanding extents.
 But reserved extents are still 1024.

2) Allocate extents for dirty range
   cow_file_range_async() split above large extent into small 128K
   extents.
   Let's assume 2 compressed extents have been split.

   So we have:
   outstanding_extents = 3
   reserved_extents = 1024

   range [0, 256K) has extents allocated

3) One ordered extent get finished
   btrfs_finish_ordered_io()
   |- btrfs_delalloc_release_metadata()
  |- drop_outstanding_extent()

   drop_outstanding_extent() will free *ALL* redundant reserved extents.
   So we have:
   outstanding_extents = 2 (One has finished)
   reserved_extents = 2

4) Continue allocating extents for dirty range
   cow_file_range_async() continue handling the remaining range.

   When the whole 128M range is done and assume no more ordered extents
   have finished.
   outstanding_extents = 1023 (One has finished in Step 3)
   reserved_extents = 2 (*ALL* freed in Step 3)

5) Another buffered write happens to the file
   btrfs_delalloc_reserve_metadata() will calculate metadata space.

   The calculation is:
   meta_to_reserve = (outstanding_extents - reserved_extents) * \
 nodesize * max_tree_level(8) * 2

   If nodesize is 16K, it's 1021 * 16K * 8 * 2, near 256M.
   If nodesize is 64K, it's about 1G.

   That's totally insane.

The fix is to introduce new reserve type, COMPRESSION, to info outstanding
extents calculation algorithm, to get correct outstanding_extents based
extent size.

So in Step 1), outstanding_extents = 1024 reserved_extents = 1024
Step 2): outstanding_extents = 1024 reserved_extents = 1024
Step 3): outstanding_extents = 1023 reserved_extents = 1023

And in Step 5) we reserve correct amount of metadata space.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c |  2 ++
 fs/btrfs/extent_io.c   |  7 ++--
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/file.c|  3 ++
 fs/btrfs/inode.c   | 81 +++---
 fs/btrfs/ioctl.c   |  2 ++
 fs/btrfs/relocation.c  |  3 ++
 8 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f906aab71116..8743fdcfe139 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -106,9 +106,11 @@ static inline u32 count_max_extents(u64 size, u64 
max_extent_size)
  */
 enum btrfs_metadata_reserve_type {
BTRFS_RESERVE_NORMAL,
+   BTRFS_RESERVE_COMPRESS,
 };
 
 u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+int inode_need_compress(struct inode *inode, u64 start, u64 end);
 
 struct btrfs_mapping_tree {
struct extent_map_tree map_tree;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8e7ad123aa95..225ebcb1fd09 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6021,6 +6021,8 @@ u64 btrfs_max_extent_size(enum 
btrfs_metadata_reserve_type reserve_type)
 {
if (reserve_type == BTRFS_RESERVE_NORMAL)
return BTRFS_MAX_EXTENT_SIZE;
+   else if (reserve_type == BTRFS_RESERVE_COMPRESS)
+   return SZ_128K;
 
ASSERT(0);
return BTRFS_MAX_EXTENT_SIZE;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e55843f536bc..25d1c302dd47 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -596,7 +596,7 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 
start, u64 end,
btrfs_debug_check_extent_io_range(tree, start, end);
 
if (bits & EXTENT_DELALLOC)
-   bits |= EXTENT_NORESERVE;
+   bits |= EXTENT_NORESERVE | EXTENT_COMPRESS;
 
if (delete)
bits |= ~EXTENT_CTLBITS;
@@ -1489,6 +1489,7 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
u64 cur_start = *start;
u64 found = 0;
u64 total_bytes = 0;
+   unsigned int pre_state;

[PATCH v14.8 10/14] btrfs: ordered-extent: Add support for dedupe

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c | 46 +
 fs/btrfs/ordered-data.h | 13 
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 78cdf572ca9c..520d384d1923 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -13,6 +13,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -171,7 +172,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -192,6 +194,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* A hash hit means we have already incremented the extents delayed
+* ref.
+* We must handle this even if another process is trying to
+* turn off dedupe, otherwise we will leak a reference.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -246,15 +275,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -263,7 +300,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -568,6 +605,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(>list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 5bad40387023..1d86674758d7 100644
---

[PATCH v14.8 01/14] btrfs: introduce type based delalloc metadata reserve

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Introduce type based metadata reserve parameter for delalloc space
reservation/freeing function.

The problem we are going to solve is, btrfs use different max extent
size for different mount options.

For compression, the max extent size is 128K, while for non-compress write
it's 128M.
And furthermore, split/merge extent hook highly depends that max extent
size.

Such situation contributes to quite a lot of false ENOSPC.

So this patch introduces the facility to help solve these false ENOSPC
related to different max extent size.

Currently, only normal 128M extent size is supported. More types will
follow soon.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h |  43 ++---
 fs/btrfs/extent-tree.c   |  48 ---
 fs/btrfs/file.c  |  30 +
 fs/btrfs/free-space-cache.c  |   6 +-
 fs/btrfs/inode-map.c |   9 ++-
 fs/btrfs/inode.c | 115 +--
 fs/btrfs/ioctl.c |  23 +++
 fs/btrfs/ordered-data.c  |   6 +-
 fs/btrfs/ordered-data.h  |   3 +-
 fs/btrfs/relocation.c|  22 ---
 fs/btrfs/tests/inode-tests.c |  15 +++--
 11 files changed, 223 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..f906aab71116 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -92,11 +92,24 @@ static const int btrfs_csum_sizes[] = { 4 };
 /*
  * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size
  */
-static inline u32 count_max_extents(u64 size)
+static inline u32 count_max_extents(u64 size, u64 max_extent_size)
 {
-   return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
+   return div_u64(size + max_extent_size - 1, max_extent_size);
 }
 
+/*
+ * Type based metadata reserve type
+ * This affects how btrfs reserve metadata space for buffered write.
+ *
+ * This is caused by the different max extent size for normal COW
+ * and compression, and further in-band dedupe
+ */
+enum btrfs_metadata_reserve_type {
+   BTRFS_RESERVE_NORMAL,
+};
+
+u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+
 struct btrfs_mapping_tree {
struct extent_map_tree map_tree;
 };
@@ -2760,8 +2773,9 @@ int btrfs_check_data_free_space(struct inode *inode,
 void btrfs_free_reserved_data_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode,
- struct extent_changeset *reserved,
- u64 start, u64 len, bool qgroup_free);
+   struct extent_changeset *reserved,
+   u64 start, u64 len, bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
@@ -2771,13 +2785,17 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
  struct btrfs_block_rsv *rsv);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes,
-   bool qgroup_free);
+   bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
-bool qgroup_free);
+   bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 int btrfs_delalloc_reserve_space(struct inode *inode,
-   struct extent_changeset **reserved, u64 start, u64 len);
+   struct extent_changeset **reserved, u64 start, u64 len,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
  unsigned short type);
@@ -3188,7 +3206,11 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  unsigned int extra_bits,
- struct extent_state **cached_state, int dedupe);
+

[PATCH v14.8 06/14] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 132 +++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a0911dcdf502..3232fe5ae530 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -175,12 +175,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-   /* Place holder for bisect, will be implemented in later patches */
-   return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 struct inmem_hash *hash, int hash_len)
 {
@@ -323,3 +317,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+   down_write(>s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   up_write(>s_umount);
+   percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished writers and block further writers.
+* Then sync the whole fs so all current write will go through
+* dedupe, and all later write won't go through dedupe.
+*/
+   block_all_writers(fs_info);
+   ret = sync_filesystem(fs_info->sb);
+   fs_info->dedupe_enabled = 0;
+   fs_info->dedupe_info = NULL;
+   unblock_all_writers(fs_info);
+   if (ret < 0)
+   return ret;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
-- 
2.18.0



--
To

[PATCH v14.8 08/14] btrfs: dedupe: Introduce function to search for an existing hash

2018-07-11 Thread Lu Fengqi

From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 208 ++
 1 file changed, 208 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 3232fe5ae530..e3084deb1eb7 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -8,6 +8,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -443,3 +444,210 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_algo = dedupe_info->hash_algo;
+   int hash_len = btrfs_hash_sizes[hash_algo];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   int qrecord_inserted = 0;
+   u64 ref_root = root->root_key.objectid;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) &&
+   is_fstree(ref_root)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   btrfs_init_delayed_ref_head(insert_head, insert_qrecord, bytenr,
+   num_bytes, ref_root, 0, BTRFS_ADD_DELAYED_REF, true,
+   false);
+
+   btrfs_init_delayed_ref_common(trans->fs_info, _dref->node,
+   bytenr, num_bytes, ref_root, BTRFS_ADD_DELAYED_REF,
+   BTRFS_EXTENT_DATA_REF_KEY);
+   insert_dref->root = ref_root;
+   insert_dref->parent = 0;
+   insert_dref->objectid = btrfs_ino(BTRFS_I(inode));
+   insert_dref->offset = file_pos;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(>transaction->delayed_refs,
+  bytenr);
+   if (!head) {
+

[PATCH v10.3 4/5] btrfs-progs: dedupe: Add status subcommand

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Add status subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  3 +
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 81 ++
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index de32eb97d9dd..df068c31ca3a 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,9 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*status* ::
+Show current in-band de-duplication status of a filesystem.
+
 BACKENDS
 
 Btrfs in-band de-duplication will support different storage backends, with
diff --git a/btrfs-completion b/btrfs-completion
index 2f113e01fb01..c8e67b459341 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -41,7 +41,7 @@ _btrfs()
commands_quota='enable disable rescan'
commands_qgroup='assign remove create destroy show limit'
commands_replace='start status cancel'
-   commands_dedupe='enable disable'
+   commands_dedupe='enable disable status'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
COMPREPLY=( $( compgen -W '--help' -- "$cur" ) )
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index 031766c1d91c..854cbda131a3 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -302,12 +302,93 @@ out:
return 0;
 }
 
+static const char * const cmd_dedupe_ib_status_usage[] = {
+   "btrfs dedupe status ",
+   "Show current in-band(write time) de-duplication status of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_status(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+   int print_limit = 1;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_status_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   ret = 1;
+   goto out;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_STATUS;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to get inband deduplication status: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+   if (dargs.status == 0) {
+   printf("Status: \t\t\tDisabled\n");
+   goto out;
+   }
+   printf("Status:\t\t\tEnabled\n");
+
+   if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256)
+   printf("Hash algorithm:\t\tSHA-256\n");
+   else
+   printf("Hash algorithm:\t\tUnrecognized(%x)\n",
+   dargs.hash_algo);
+
+   if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   printf("Backend:\t\tIn-memory\n");
+   print_limit = 1;
+   } else  {
+   printf("Backend:\t\tUnrecognized(%x)\n",
+   dargs.backend);
+   }
+
+   printf("Dedup Blocksize:\t%llu\n", dargs.blocksize);
+
+   if (print_limit) {
+   u64 cur_mem;
+
+   /* Limit nr may be 0 */
+   if (dargs.limit_nr)
+   cur_mem = dargs.current_nr * (dargs.limit_mem /
+   dargs.limit_nr);
+   else
+   cur_mem = 0;
+
+   printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr,
+   dargs.limit_nr);
+   printf("Memory usage: \t\t[%s/%s]\n",
+   pretty_size(cur_mem),
+   pretty_size(dargs.limit_mem));
+   }
+out:
+   close_file_or_dir(fd, dirstream);
+   return ret;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
{ "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
  NULL, 0},
+   { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10.3 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Add disable subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  5 +++
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 42 ++
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 82f970a69953..de32eb97d9dd 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,6 +22,11 @@ use with caution.
 
 SUBCOMMAND
 --
+*disable* ::
+Disable in-band de-duplication for a filesystem.
++
+This will trash all stored dedupe hash.
++
 *enable* [options] ::
 Enable in-band de-duplication for a filesystem.
 +
diff --git a/btrfs-completion b/btrfs-completion
index 69e02ad11990..2f113e01fb01 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -41,7 +41,7 @@ _btrfs()
commands_quota='enable disable rescan'
commands_qgroup='assign remove create destroy show limit'
commands_replace='start status cancel'
-   commands_dedupe='enable'
+   commands_dedupe='enable disable'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
COMPREPLY=( $( compgen -W '--help' -- "$cur" ) )
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index cb62d0064167..031766c1d91c 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -262,10 +262,52 @@ out:
return ret;
 }
 
+static const char * const cmd_dedupe_ib_disable_usage[] = {
+   "btrfs dedupe disable ",
+   "Disable in-band(write time) de-duplication of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_disable(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_disable_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   return 1;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to disable inband deduplication: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+
+out:
+   close_file_or_dir(fd, dirstream);
+   return 0;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
+   { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10.3 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Introduce reconfigure subcommand to co-operate with new kernel ioctl
modification.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  7 ++
 cmds-dedupe-ib.c   | 75 +-
 2 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index df068c31ca3a..5fc4bb0d5940 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,13 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*reconfigure* [options] ::
+Re-configure in-band de-duplication parameters of a filesystem.
++
+In-band de-duplication must be enbaled first before re-configuration.
++
+[Options] are the same with 'btrfs dedupe-inband enable'.
+
 *status* ::
 Show current in-band de-duplication status of a filesystem.
 
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index 854cbda131a3..925d5a8f756a 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -56,7 +56,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = {
NULL
 };
 
-
 #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \
 ({ \
if (dargs->member != old->member && \
@@ -88,6 +87,12 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
}
report_option_parameter(dargs, old, flags, u8, -1, x);
}
+
+   if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) {
+   error("must enable dedupe before reconfiguration");
+   return;
+   }
+
if (report_fatal_parameter(dargs, old, cmd, u16, -1, u) ||
report_fatal_parameter(dargs, old, blocksize, u64, -1, llu) ||
report_fatal_parameter(dargs, old, backend, u16, -1, u) ||
@@ -100,14 +105,17 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
old->limit_nr, old->limit_mem);
 }
 
-static int cmd_dedupe_ib_enable(int argc, char **argv)
+static int enable_reconfig_dedupe(int argc, char **argv, int reconf)
 {
int ret;
int fd = -1;
char *path;
u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT;
+   int blocksize_set = 0;
u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+   int hash_algo_set = 0;
u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
+   int backend_set = 0;
u64 limit_nr = 0;
u64 limit_mem = 0;
u64 sys_mem = 0;
@@ -129,20 +137,22 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "s:b:a:l:m:", long_options, NULL);
+   c = getopt_long(argc, argv, "s:b:a:l:m:f", long_options, NULL);
if (c < 0)
break;
switch (c) {
case 's':
-   if (!strcasecmp("inmemory", optarg))
+   if (!strcasecmp("inmemory", optarg)) {
backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
-   else {
+   backend_set = 1;
+   } else {
error("unsupported dedupe backend: %s", optarg);
exit(1);
}
break;
case 'b':
blocksize = parse_size(optarg);
+   blocksize_set = 1;
break;
case 'a':
if (strcmp("sha256", optarg)) {
@@ -226,26 +236,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
return 1;
}
memset(, -1, sizeof(dargs));
-   dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE;
-   dargs.blocksize = blocksize;
-   dargs.hash_algo = hash_algo;
-   dargs.limit_nr = limit_nr;
-   dargs.limit_mem = limit_mem;
-   dargs.backend = backend;
-   if (force)
-   dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE;
-   else
-   dargs.flags = 0;
+   if (reconf) {
+   dargs.cmd = BTRFS_DEDUPE_CTL_RECONF;
+   if (blocksize_set)
+   dargs.blocksize = blocksize;
+   if (hash_algo_set)
+   dargs.hash_algo = hash_algo;
+   if (backend_set)
+   dargs.backend = backend;
+   dargs.limit_nr = limit_nr;
+   dargs.limit_mem = limit_mem;
+   } else {
+   dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE;
+   dargs.blocksize = blocksize;
+   dargs.hash_algo = hash_algo;
+   dargs.limit_nr =

[PATCH v10.3 1/5] btrfs-progs: Basic framework for dedupe-inband command group

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Add basic ioctl header and command group framework for later use.
Alone with basic man page doc.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/Makefile.in  |  1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 40 ++
 Documentation/btrfs.asciidoc   |  4 +++
 Makefile   |  3 +-
 btrfs.c|  2 ++
 cmds-dedupe-ib.c   | 35 +++
 commands.h |  2 ++
 dedupe-ib.h| 28 +++
 ioctl.h| 36 +++
 9 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in
index 184647c41940..402155fae001 100644
--- a/Documentation/Makefile.in
+++ b/Documentation/Makefile.in
@@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc
 MAN8_TXT += btrfs-replace.asciidoc
 MAN8_TXT += btrfs-restore.asciidoc
 MAN8_TXT += btrfs-property.asciidoc
+MAN8_TXT += btrfs-dedupe-inband.asciidoc
 
 # Category 5 manual page
 MAN5_TXT += btrfs-man5.asciidoc
diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
new file mode 100644
index ..9ee2bc75db3a
--- /dev/null
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -0,0 +1,40 @@
+btrfs-dedupe(8)
+==
+
+NAME
+
+btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs
+filesystem
+
+SYNOPSIS
+
+*btrfs dedupe-inband*  
+
+DESCRIPTION
+---
+*btrfs dedupe-inband* is used to enable/disable or show current in-band 
de-duplication
+status of a btrfs filesystem.
+
+Kernel support for in-band de-duplication starts from 4.8.
+
+WARNING: In-band de-duplication is still an experimental feautre of btrfs,
+use with caution.
+
+SUBCOMMAND
+--
+Nothing yet
+
+EXIT STATUS
+---
+*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc
index 7316ac094413..d37ae3571bd3 100644
--- a/Documentation/btrfs.asciidoc
+++ b/Documentation/btrfs.asciidoc
@@ -50,6 +50,10 @@ COMMANDS
Do off-line check on a btrfs filesystem. +
See `btrfs-check`(8) for details.
 
+*dedupe*::
+   Control btrfs in-band(write time) de-duplication. +
+   See `btrfs-dedupe`(8) for details.
+
 *device*::
Manage devices managed by btrfs, including add/delete/scan and so
on. +
diff --git a/Makefile b/Makefile
index 544410e6440c..1ebed7135714 100644
--- a/Makefile
+++ b/Makefile
@@ -123,7 +123,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
   cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o 
\
-  mkfs/common.o check/mode-common.o check/mode-lowmem.o
+  mkfs/common.o check/mode-common.o check/mode-lowmem.o \
+  cmds-dedupe-ib.o
 libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o 
\
   kernel-lib/crc32c.o messages.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
diff --git a/btrfs.c b/btrfs.c
index 2d39f2ced3e8..2168f5a8bc7f 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -255,6 +255,8 @@ static const struct cmd_group btrfs_cmd_group = {
{ "quota", cmd_quota, NULL, _cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, _cmd_group, 0 },
{ "replace", cmd_replace, NULL, _cmd_group, 0 },
+   { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group,
+   0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
new file mode 100644
index ..73c923a797da
--- /dev/null
+++ b/cmds-dedupe-ib.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Fujitsu.  All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+
+#include "ctree.h"
+#include "ioctl.h"
+
+#include "commands.h"
+#include "utils.h"
+#include "kerncompat.h"
+#include "dedupe-ib.h"
+
+static const char * const dedupe_ib_cmd_group_usage[] = {
+   "btrfs dedupe-inband  [options] ",
+   NULL
+};
+
+static const char dedupe_ib_cmd_group_info[] =
+"manage inband(write time)

[PATCH v10.3 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group

2018-07-11 Thread Lu Fengqi

From: Qu Wenruo 

Add enable subcommand for dedupe commmand group.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc | 114 +-
 btrfs-completion   |   6 +-
 cmds-dedupe-ib.c   | 241 +
 ioctl.h|   2 +
 4 files changed, 361 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 9ee2bc75db3a..82f970a69953 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,7 +22,119 @@ use with caution.
 
 SUBCOMMAND
 --
-Nothing yet
+*enable* [options] ::
+Enable in-band de-duplication for a filesystem.
++
+`Options`
++
+-f|--force
+Force 'enable' command to be exected.
+Will skip memory limit check and allow 'enable' to be executed even in-band
+de-duplication is already enabled.
++
+NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be
+reset to its default value.
+
+-s|--storage-backend 
+Specify de-duplication hash storage backend.
+Only 'inmemory' backend is supported yet.
+If not specified, default value is 'inmemory'.
++
+Refer to *BACKENDS* sector for more information.
+
+-b|--blocksize 
+Specify dedupe block size.
+Supported values are power of 2 from '16K' to '8M'.
+Default value is '128K'.
++
+Refer to *BLOCKSIZE* sector for more information.
+
+-a|--hash-algorithm 
+Specify hash algorithm.
+Only 'sha256' is supported yet.
+
+-l|--limit-hash 
+Specify maximum number of hashes stored in memory.
+Only works for 'inmemory' backend.
+Conflicts with '-m' option.
++
+Only positive values are valid.
+Default value is '32K'.
+
+-m|--limit-memory 
+Specify maximum memory used for hashes.
+Only works for 'inmemory' backend.
+Conflicts with '-l' option.
++
+Only value larger than or equal to '1024' is valid.
+No default value.
++
+NOTE: Memory limit will be rounded down to kernel internal hash size,
+so the memory limit shown in 'btrfs dedupe status' may be different
+from the .
+
+WARNING: Too large value for '-l' or '-m' will easily trigger OOM.
+Please use with caution according to system memory.
+
+NOTE: In-band de-duplication is not compactible with compression yet.
+And compression has higher priority than in-band de-duplication, means if
+compression and de-duplication is enabled at the same time, only compression
+will work.
+
+BACKENDS
+
+Btrfs in-band de-duplication will support different storage backends, with
+different use case and features.
+
+In-memory backend::
+This backend provides backward-compatibility, and more fine-tuning options.
+But hash pool is non-persistent and may exhaust kernel memory if not setup
+properly.
++
+This backend can be used on old btrfs(without '-O dedupe' mkfs option).
+When used on old btrfs, this backend needs to be enabled manually after mount.
++
+Designed for fast hash search speed, in-memory backend will keep all dedupe
+hashes in memory. (Although overall performance is still much the same with
+'ondisk' backend if all 'ondisk' hash can be cached in memory)
++
+And only keeps limited number of hash in memory to avoid exhausting memory.
+Hashes over the limit will be dropped following Last-Recent-Use behavior.
+So this backend has a consistent overhead for given limit but can\'t ensure
+all duplicated blocks will be de-duplicated.
++
+After umount and mount, in-memory backend need to refill its hash pool.
+
+On-disk backend::
+This backend provides persistent hash pool, with more smart memory management
+for hash pool.
+But it\'s not backward-compatible, meaning it must be used with '-O dedupe' 
mkfs
+option and older kernel can\'t mount it read-write.
++
+Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk.
+This behavior may cause extra disk IO for hash search under high memory
+pressure.
++
+After umount and mount, on-disk backend still has its hash on disk, no need to
+refill its dedupe hash pool.
+
+Currently, only 'inmemory' backend is supported in btrfs-progs.
+
+DEDUPE BLOCK SIZE
+
+In-band de-duplication is done at dedupe block size.
+Any data smaller than dedupe block size won\'t go through in-band
+de-duplication.
+
+And dedupe block size affects dedupe rate and fragmentation heavily.
+
+Smaller block size will cause more fragments, but higher dedupe rate.
+
+Larger block size will cause less fragments, but lower dedupe rate.
+
+In-band de-duplication rate is highly related to the workload pattern.
+So it\'s highly recommended to align dedupe block size to the workload
+block size to make full use of de-duplication.
 
 EXIT STATUS
 ---
diff --git a/btrfs-completion b/btrfs-completion
index ae683f4ecf61..69e02ad11990 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -29,7 +29,7 @@ _btrfs()
 
local cmd=${words[1]}
 
-

[PATCH v10.3 0/5] In-band de-duplication for btrfs-progs

2018-07-11 Thread Lu Fengqi

Patchset can be fetched from github:
https://github.com/littleroad/btrfs-progs.git dedupe_latest

Inband dedupe(in-memory backend only) ioctl support for btrfs-progs.

v7 changes:
   Update ctree.h to follow kernel structure change
   Update print-tree to follow kernel structure change
V8 changes:
   Move dedup props and on-disk backend support out of the patchset
   Change command group name to "dedupe-inband", to avoid confusion with
   possible out-of-band dedupe. Suggested by Mark.
   Rebase to latest devel branch.
V9 changes:
   Follow kernels ioctl change to support FORCE flag, new reconf ioctl,
   and more precious error reporting.
v10 changes:
   Rebase to v4.10.
   Add BUILD_ASSERT for btrfs_ioctl_dedupe_args
v10.1 changes:
   Rebase to v4.14.
v10.2 changes:
   Rebase to v4.16.1.
v10.3 changes:
   Rebase to v4.17.

Qu Wenruo (5):
  btrfs-progs: Basic framework for dedupe-inband command group
  btrfs-progs: dedupe: Add enable command for dedupe command group
  btrfs-progs: dedupe: Add disable support for inband dedupelication
  btrfs-progs: dedupe: Add status subcommand
  btrfs-progs: dedupe: introduce reconfigure subcommand

 Documentation/Makefile.in  |   1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 167 
 Documentation/btrfs.asciidoc   |   4 +
 Makefile   |   3 +-
 btrfs-completion   |   6 +-
 btrfs.c|   2 +
 cmds-dedupe-ib.c   | 442 +
 commands.h |   2 +
 dedupe-ib.h|  28 ++
 ioctl.h|  38 ++
 10 files changed, 691 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190

2018-07-11 Thread Pete

On 07/10/2018 09:38 AM, Martin Raiber wrote:

> This is probably a known issue. See
> https://www.spinics.net/lists/linux-btrfs/msg75647.html
> You could apply the patch in this thread and mount with enospc_debug to
> confirm it is the same issue.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

OK, I've applied the patch, by hand, and hopefully put it in the right
place.  Need to learn to patch better.

Booted the (rebuilt with make ; make modules_install syslinux etc)
kernel with the option enospc_debug for the two btrfs file systems (1st
entry for each in fstb.  I was not expecting to get the issue to appear
quickly as it took several days to hit previously.  However, on checking
I see another error, not sure if it is related, still is in extent-tree.c.

https://drive.google.com/file/d/1K12MfpWFB1aHSXBga1Rym5terbmHeDfI/view?usp=sharing

Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"

2018-07-11 Thread Chris Murphy

On Wed, Jul 11, 2018 at 9:37 AM, Udo Waechter  wrote:
> Hello everyone,
>
> I have a corrupted filesystem which I can't seem to recover.
>
> The machine is:
> Debian Linux, kernel 4.9 and btrfs-progs v4.13.3
>
> I have a HDD RAID5 with LVM and the volume in question is a LVM volume.
> On top of that I had a RAID1 SSD cache with lvm-cache.
>
> Yesterday both! SSDs died within minutes. This lead to the corruped
> filesystem that I have now.
>
> I hope I followed the procedure correctly.
>
> What I tried so far:
> * "mount -o usebackuproot,ro " and "nospace_cache" "clear_cache" and all
> permutations of these mount options
>
> I'm getting:
>
> [96926.830400] BTRFS info (device dm-2): trying to use backup root at
> mount time
> [96926.830406] BTRFS info (device dm-2): disk space caching is enabled
> [96926.927978] BTRFS error (device dm-2): parent transid verify failed
> on 321269628928 wanted 3276017 found 3275985
> [96926.938619] BTRFS error (device dm-2): parent transid verify failed
> on 321269628928 wanted 3276017 found 3275985
> [96926.940705] BTRFS error (device dm-2): failed to recover balance: -5
> [96926.985801] BTRFS error (device dm-2): open_ctree failed
>
> The weird thing is that I can't really find information about the
> "failed to recover balance: -5" error. - There was no rebalancing
> running when during the crash.
>
> * btrfs-find-root: https://pastebin.com/qkjnSUF7 - It bothers me that I
> don't see any "good generations" as described here:
> https://btrfs.wiki.kernel.org/index.php/Restore
>
> * "btrfs rescue" - it starts, then goes to "looping on XYZ" then stops
>
> * "btrfs rescue super-recover -v" gives:
>
> All Devices:
> Device: id = 1, name = /dev/vg00/...
> Before Recovering:
> [All good supers]:
> device name = /dev/vg00/...
> superblock bytenr = 65536
>
> device name = /dev/vg00/...
> superblock bytenr = 67108864
>
> device name = /dev/vg00/...
> superblock bytenr = 274877906944
>
> [All bad supers]:
>
> All supers are valid, no need to recover
>
>
> * Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it
> turns out that might have been a bad idea
>
>
> * Also, a "btrfs  check --init-extent-tree" - https://pastebin.com/jATDCFZy
>
> The volume contained qcow2 images for VMs. I need only one of those,
> since one piece of important software decided to not do backups :(
>
> Any help is highly appreciated.

You should ask for help sooner. It's much harder to give advice after
you've modified the file system multiple times since the original
problem happened. But maybe someone has ideas on the way forward,
other than 'btrfs restore' which is the offline scrape tool.
https://btrfs.wiki.kernel.org/index.php/Restore

There's a bunch of fixes since btrfs-progs 4.13 and 4.17 which is now
current. But anyway with lvmcache and the SSDs dying, it sounds like
there are too many transaction commits to Btrfs that are lost in the
failed lvmcache.

Also, gmail considers your email phishing. So something with your mail
is misconfigured for use on lists.

"This message has a from address in zoide.net but has failed
zoide.net's required tests for authentication.  Learn more"

My best guess from the header is that dmarc is set by your email
provider to fail, and while many mail clients ignore this, Google
honors it. And it's the dmarc fail that makes it incompatible with
email lists because lists always rewrite the email posting (they add
footers and rewrite headers).

Authentication-Results: mx.google.com;
   dkim=neutral (body hash did not verify) header.i=@zoide.net
header.s=mx header.b=vATMNdwx;
   spf=pass (google.com: best guess record for domain of
linux-btrfs-ow...@vger.kernel.org designates 209.132.180.67 as
permitted sender) smtp.mailfrom=linux-btrfs-ow...@vger.kernel.org;
   dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE) header.from=zoide.net


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check mode normal still hard crash-hanging systems

2018-07-11 Thread Marc MERLIN

On Wed, Jul 11, 2018 at 11:09:56AM -0600, Chris Murphy wrote:
> On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN  wrote:
> > Thanks to Su and Qu, I was able to get my filesystem to a point that
> > it's mountable.
> > I then deleted loads of snapshots and I'm down to 26.
> >
> > IT now looks like this:
> > gargamel:~# btrfs fi show /mnt/mnt
> > Label: 'dshelf2'  uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> > Total devices 1 FS bytes used 12.30TiB
> > devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2
> >
> > gargamel:~# btrfs fi df /mnt/mnt
> > Data, single: total=13.57TiB, used=12.19TiB
> > System, DUP: total=32.00MiB, used=1.55MiB
> > Metadata, DUP: total=124.50GiB, used=115.62GiB
> > Metadata, single: total=216.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> > Problems
> > 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the
> > server, despite my deleting lots of snapshots.
> > Is it because I have too many files then?
> 
> I think originally needs most of metdata in memory.
> 
> I'm not understanding why btrfs check won't use swap like at least
> xfs_repair and pretty sure e2fsck will as well.
> 
> Using 128G swap on nvme with original check is still gonna be faster
> than lowmem mode.

Yeah, that's been also a concern/question of mine all these years, even if
Su isn't working on that code, and likely is the wrong person to ask.
Personally, my take is that if btrfs wants to be taken seriously, at the
very least its fsck tool should not hard crash a system you run it on.
(and it really does the worst kind of hard crash I've ever seen, OOM can't
trigger fast enough, linux doesn't panic, so it can't self reboot either, 
it just hard dies and hangs)

Maybe David knows?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check lowmem, take 2

2018-07-11 Thread Chris Murphy

On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN  wrote:
> Thanks to Su and Qu, I was able to get my filesystem to a point that
> it's mountable.
> I then deleted loads of snapshots and I'm down to 26.
>
> IT now looks like this:
> gargamel:~# btrfs fi show /mnt/mnt
> Label: 'dshelf2'  uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Total devices 1 FS bytes used 12.30TiB
> devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2
>
> gargamel:~# btrfs fi df /mnt/mnt
> Data, single: total=13.57TiB, used=12.19TiB
> System, DUP: total=32.00MiB, used=1.55MiB
> Metadata, DUP: total=124.50GiB, used=115.62GiB
> Metadata, single: total=216.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> Problems
> 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the
> server, despite my deleting lots of snapshots.
> Is it because I have too many files then?

I think originally needs most of metdata in memory.

I'm not understanding why btrfs check won't use swap like at least
xfs_repair and pretty sure e2fsck will as well.

Using 128G swap on nvme with original check is still gonna be faster
than lowmem mode.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: use customized batch size for total_bytes_pinned

2018-07-11 Thread Ethan Lien

In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t. So
fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
we can use a really large batch size and have nearly no penalty in most
cases.


[Test]
We test the patch on a 4-cores x86 machine:
1. falloate a 16GiB size test file.
2. take snapshot (so all following writes will be cow write).
3. run a 180 sec, 4 jobs, 4K random write fio on test file.

We also add a temporary lockdep class on percpu_counter's spin lock used
by total_bytes_pinned to track lock_stat.


[Results]
unpatched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu:82 82
0.21   0.61  29.46   0.36 298340
  635973   0.09  11.01  173476.25   0.27


patched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu: 1  1
0.62   0.62   0.62   0.62  13601
   31542   0.14   9.61   11016.90   0.35


[Analysis]
Since the spin lock only protect a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock get
transferred between different cpus, so the patch can really recude
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes")

Signed-off-by: Ethan Lien 
---

V2:
Rewrite commit comments.
Add lock_stat test.
Pull dirty_metadata_bytes out to a separate patch.

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 46 --
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 * time the transaction commits.
 */
struct percpu_counter total_bytes_pinned;
+   s32 total_bytes_pinned_batch;
 
struct list_head list;
/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..937113534ef4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, 
s64 num_bytes,
 
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned, num_bytes,
+   space_info->total_bytes_pinned_batch);
 }
 
 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(_info->total_bytes_pinned,
-  -head->num_bytes);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+

Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"

2018-07-11 Thread Udo Waechter

Hello everyone,

I have a corrupted filesystem which I can't seem to recover.

The machine is:
Debian Linux, kernel 4.9 and btrfs-progs v4.13.3

I have a HDD RAID5 with LVM and the volume in question is a LVM volume.
On top of that I had a RAID1 SSD cache with lvm-cache.

Yesterday both! SSDs died within minutes. This lead to the corruped
filesystem that I have now.

I hope I followed the procedure correctly.

What I tried so far:
* "mount -o usebackuproot,ro " and "nospace_cache" "clear_cache" and all
permutations of these mount options

I'm getting:

[96926.830400] BTRFS info (device dm-2): trying to use backup root at
mount time
[96926.830406] BTRFS info (device dm-2): disk space caching is enabled
[96926.927978] BTRFS error (device dm-2): parent transid verify failed
on 321269628928 wanted 3276017 found 3275985
[96926.938619] BTRFS error (device dm-2): parent transid verify failed
on 321269628928 wanted 3276017 found 3275985
[96926.940705] BTRFS error (device dm-2): failed to recover balance: -5
[96926.985801] BTRFS error (device dm-2): open_ctree failed

The weird thing is that I can't really find information about the
"failed to recover balance: -5" error. - There was no rebalancing
running when during the crash.

* btrfs-find-root: https://pastebin.com/qkjnSUF7 - It bothers me that I
don't see any "good generations" as described here:
https://btrfs.wiki.kernel.org/index.php/Restore

* "btrfs rescue" - it starts, then goes to "looping on XYZ" then stops

* "btrfs rescue super-recover -v" gives:

All Devices:
Device: id = 1, name = /dev/vg00/...
Before Recovering:
[All good supers]:
device name = /dev/vg00/...
superblock bytenr = 65536

device name = /dev/vg00/...
superblock bytenr = 67108864

device name = /dev/vg00/...
superblock bytenr = 274877906944

[All bad supers]:

All supers are valid, no need to recover


* Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it
turns out that might have been a bad idea


* Also, a "btrfs  check --init-extent-tree" - https://pastebin.com/jATDCFZy

The volume contained qcow2 images for VMs. I need only one of those,
since one piece of important software decided to not do backups :(

Any help is highly appreciated.

Many thanks,
udo.



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3 2/2] btrfs: get fs_devices pointer form btrfs_scan_one_device

2018-07-11 Thread Anand Jain





On 07/11/2018 09:22 AM, Gu Jinxiang wrote:

Instead of pointer to btrfs_fs_devices as an arg in
btrfs_scan_one_device, better to make it as a return value.


Yep this was in the list to fix. However I didn't like the idea
to return the btrfs_fs_devices pointer, instead return the
btrfs_device pointer, so that we can still retrieve its fs_devices.

Thanks, Anand


Signed-off-by: Gu Jinxiang 
---

Changelog:
v3: as comment by robot, use PTR_ERR_OR_ZERO, and rebase to misc-next.
v2: as comment by Nikolay, use ERR_CAST instead of cast type manually.

  fs/btrfs/super.c   | 29 ++---
  fs/btrfs/volumes.c | 14 +++---
  fs/btrfs/volumes.h |  4 ++--
  3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 78b5d51c7bc7..20e1ee338a95 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -916,11 +916,13 @@ static int btrfs_parse_early_options(const char *options, 
fmode_t flags,
error = -ENOMEM;
goto out;
}
-   error = btrfs_scan_one_device(device_name,
-   flags, holder, _devices);
+   fs_devices = btrfs_scan_one_device(device_name,
+   flags, holder);
kfree(device_name);
-   if (error)
+   if (IS_ERR(fs_devices)) {
+   error = PTR_ERR(fs_devices);
goto out;
+   }
}
}
  
@@ -1537,9 +1539,11 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,

return ERR_PTR(error);
}
  
-	error = btrfs_scan_one_device(device_name, mode, fs_type, _devices);

-   if (error)
+   fs_devices = btrfs_scan_one_device(device_name, mode, fs_type);
+   if (IS_ERR(fs_devices)) {
+   error = PTR_ERR(fs_devices);
goto error_sec_opts;
+   }
  
  	/*

 * Setup a dummy root and fs_info for test/set super.  This is because
@@ -2220,7 +2224,7 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
unsigned long arg)
  {
struct btrfs_ioctl_vol_args *vol;
-   struct btrfs_fs_devices *fs_devices;
+   struct btrfs_fs_devices *fs_devices = NULL;
int ret = -ENOTTY;
  
  	if (!capable(CAP_SYS_ADMIN))

@@ -2232,14 +2236,17 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
  
  	switch (cmd) {

case BTRFS_IOC_SCAN_DEV:
-   ret = btrfs_scan_one_device(vol->name, FMODE_READ,
-   _root_fs_type, _devices);
+   fs_devices = btrfs_scan_one_device(vol->name, FMODE_READ,
+   _root_fs_type);
+   ret = PTR_ERR_OR_ZERO(fs_devices);
break;
case BTRFS_IOC_DEVICES_READY:
-   ret = btrfs_scan_one_device(vol->name, FMODE_READ,
-   _root_fs_type, _devices);
-   if (ret)
+   fs_devices = btrfs_scan_one_device(vol->name, FMODE_READ,
+   _root_fs_type);
+   if (IS_ERR(fs_devices)) {
+   ret = PTR_ERR(fs_devices);
break;
+   }
ret = !(fs_devices->num_devices == fs_devices->total_devices);
break;
case BTRFS_IOC_GET_SUPPORTED_FEATURES:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index af2704de9ff9..6a6321e41f1b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1212,14 +1212,14 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
   * and we are not allowed to call set_blocksize during the scan. The 
superblock
   * is read via pagecache
   */
-int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
- struct btrfs_fs_devices **fs_devices_ret)
+struct btrfs_fs_devices *btrfs_scan_one_device(const char *path, fmode_t flags,
+   void *holder)
  {
struct btrfs_super_block *disk_super;
struct btrfs_device *device;
struct block_device *bdev;
struct page *page;
-   int ret = 0;
+   struct btrfs_fs_devices *ret = NULL;
u64 bytenr;
  
  	/*

@@ -1233,19 +1233,19 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
  
  	bdev = blkdev_get_by_path(path, flags, holder);

if (IS_ERR(bdev))
-   return PTR_ERR(bdev);
+   return ERR_CAST(bdev);
  
  	if (btrfs_read_disk_super(bdev, bytenr, , _super)) {

-   ret = -EINVAL;
+   ret = ERR_PTR(-EINVAL);
goto error_bdev_put;
}
  
  	mutex_lock(_mutex);

device = device_list_add(path,

Re: [PATCH v3 1/2] btrfs: make fs_devices to be a local variable

2018-07-11 Thread Anand Jain





On 07/11/2018 09:22 AM, Gu Jinxiang wrote:

fs_devices is always passed to btrfs_scan_one_device which
overrides it. And in the call stack below fs_devices is passed to
btrfs_scan_one_device from btrfs_mount_root.
And in btrfs_mount_root the output fs_devices of this call stack
is not used.
btrfs_mount_root
-> btrfs_parse_early_options
->btrfs_scan_one_device
So, there is no necessary to pass fs_devices from btrfs_mount_root,
use a local variable in btrfs_parse_early_options is enough.

Signed-off-by: Gu Jinxiang 


Other than two nit below.

Reviewed-by: Anand Jain 


---

Changelog:
v3: rebase to misc-next.
v2: deal with Nikolay's comment, make changelog more clair.

  fs/btrfs/super.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e04bcf0b0ed4..78b5d51c7bc7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -884,11 +884,12 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
   * only when we need to allocate a new super block.
   */
  static int btrfs_parse_early_options(const char *options, fmode_t flags,
-   void *holder, struct btrfs_fs_devices **fs_devices)
+   void *holder)


 While here pls indent the 2nd line argument to be below the
 const char options.



  {
substring_t args[MAX_OPT_ARGS];
char *device_name, *opts, *orig, *p;
int error = 0;
+   struct btrfs_fs_devices *fs_devices = NULL;


 Its a good idea to align the declarations to avoid space wastage,

char *device_name, *opts, *orig, *p;
 +  struct btrfs_fs_devices *fs_devices = NULL;
int error = 0;

Thanks, Anand


if (!options)
return 0;
@@ -916,7 +917,7 @@ static int btrfs_parse_early_options(const char *options, 
fmode_t flags,
goto out;
}
error = btrfs_scan_one_device(device_name,
-   flags, holder, fs_devices);
+   flags, holder, _devices);
kfree(device_name);
if (error)
goto out;
@@ -1524,8 +1525,7 @@ static struct dentry *btrfs_mount_root(struct 
file_system_type *fs_type,
if (!(flags & SB_RDONLY))
mode |= FMODE_WRITE;
  
-	error = btrfs_parse_early_options(data, mode, fs_type,

- _devices);
+   error = btrfs_parse_early_options(data, mode, fs_type);
if (error) {
return ERR_PTR(error);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] volume and uuid_mutex cleanups

2018-07-11 Thread Anand Jain




Hi David,

Here I have put together a set of volume related patches which were
sent to the ML as independent patches earlier. These have been
reviewed and tested. Please pull.

 g...@github.com:asj/btrfs-devel.git misc-next-for-kdave

-
[Anand:2]
6049bd5e9694 btrfs: add helper function check device delete able
8c96747831b0 btrfs: add helper btrfs_num_devices() to deduce num_devices
dd61850ee7cf btrfs: warn for num_devices below 0
17c285ada2e4 btrfs: use the assigned fs_devices instead of the dereference
e2f7c8a0f67b btrfs: do device clone using the btrfs_scan_one_device
89325c85d655 btrfs: fix race between free_stale_devices and close_fs_devices
0dfd68121520 btrfs: drop uuid_mutex in btrfs_free_extra_devids()

[David:2]
6fa6985bd169 btrfs: fix mount and ioctl device scan ioctl race
e9f25a7b239d btrfs: reorder initialization before the mount locks uuid_mutex
2c5058cdf788 btrfs: lift uuid_mutex to callers of btrfs_parse_early_options
8ffc96e797bb btrfs: lift uuid_mutex to callers of btrfs_open_devices
39a2036c1d13 btrfs: lift uuid_mutex to callers of btrfs_scan_one_device

[Anand:1]
e735e867d314 btrfs: fix btrfs_free_stale_devices() with needed locks
bdc6cc879388 btrfs: btrfs_free_stale_devices() rename local variables
0dd1ff5cc6be btrfs: fix device_list_add() missing device_list_mutex()
622f0a7c31fe btrfs: do btrfs_free_stale_devices() outside of 
device_list_add()


[David:1]
7302fc024079 btrfs: restore uuid_mutex in btrfs_open_devices

[Nikolay]
4da856347110 btrfs: drop pending list in device close
-

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

About hung task on generic/041

2018-07-11 Thread Lu Fengqi

Hi,

When I run generic/041 with v4.18-rc3 (turn on kasan and hung task
detection), btrfs-transaction kthread will trigger the hung task timeout
(stall at wait_event in btrfs_commit_transaction). At the same time, you
can see that xfs_io -c fsync will occupy 100% of the CPU. I am not sure
whether this is a problem. Any suggestion?

[Wed Jul 11 15:50:08 2018] INFO: task btrfs-transacti:1053 blocked for more 
than 120 seconds.
[Wed Jul 11 15:50:08 2018]   Not tainted 4.18.0-rc3-custom #14
[Wed Jul 11 15:50:08 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Wed Jul 11 15:50:08 2018] btrfs-transacti D0  1053  2 0x8000
[Wed Jul 11 15:50:08 2018] Call Trace:
[Wed Jul 11 15:50:08 2018]  ? __schedule+0x5b2/0x1380
[Wed Jul 11 15:50:08 2018]  ? check_flags.part.23+0x240/0x240
[Wed Jul 11 15:50:08 2018]  ? firmware_map_remove+0x187/0x187
[Wed Jul 11 15:50:08 2018]  ? ___preempt_schedule+0x16/0x18
[Wed Jul 11 15:50:08 2018]  ? mark_held_locks+0x6e/0x90
[Wed Jul 11 15:50:08 2018]  ? _raw_spin_unlock_irqrestore+0x59/0x70
[Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
[Wed Jul 11 15:50:08 2018]  ? _raw_spin_unlock_irqrestore+0x46/0x70
[Wed Jul 11 15:50:08 2018]  ? prepare_to_wait_event+0x191/0x410
[Wed Jul 11 15:50:08 2018]  ? prepare_to_wait_exclusive+0x210/0x210
[Wed Jul 11 15:50:08 2018]  ? print_usage_bug+0x3a0/0x3a0
[Wed Jul 11 15:50:08 2018]  ? do_raw_spin_unlock+0x10f/0x1e0
[Wed Jul 11 15:50:08 2018]  ? do_raw_spin_trylock+0x120/0x120
[Wed Jul 11 15:50:08 2018]  schedule+0xca/0x260
[Wed Jul 11 15:50:08 2018]  ? rcu_lockdep_current_cpu_online+0x12b/0x160
[Wed Jul 11 15:50:08 2018]  ? __schedule+0x1380/0x1380
[Wed Jul 11 15:50:08 2018]  ? ___might_sleep+0x126/0x370
[Wed Jul 11 15:50:08 2018]  ? init_wait_entry+0xc7/0x100
[Wed Jul 11 15:50:08 2018]  ? __wake_up_locked_key_bookmark+0x20/0x20
[Wed Jul 11 15:50:08 2018]  ? __btrfs_run_delayed_items+0x1e5/0x280 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? __might_sleep+0x31/0xd0
[Wed Jul 11 15:50:08 2018]  btrfs_commit_transaction+0x122a/0x1640 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? wait_woken+0x150/0x150
[Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
[Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
[Wed Jul 11 15:50:08 2018]  ? deref_stack_reg+0xe0/0xe0
[Wed Jul 11 15:50:08 2018]  ? __module_text_address+0x63/0xa0
[Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
[Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x161/0x240 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? is_module_text_address+0x2b/0x50
[Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x1d9/0x240 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? kernel_text_address+0x5a/0x100
[Wed Jul 11 15:50:08 2018]  ? deactivate_slab.isra.27+0x64f/0x7a0
[Wed Jul 11 15:50:08 2018]  ? __save_stack_trace+0x82/0x100
[Wed Jul 11 15:50:08 2018]  ? kasan_kmalloc+0x142/0x170
[Wed Jul 11 15:50:08 2018]  ? kmem_cache_alloc+0xfc/0x2e0
[Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x1d9/0x240 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? kthread+0x1b9/0x1e0
[Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
[Wed Jul 11 15:50:08 2018]  ? deactivate_slab.isra.27+0x64f/0x7a0
[Wed Jul 11 15:50:08 2018]  ? mark_lock+0x149/0xa80
[Wed Jul 11 15:50:08 2018]  ? init_object+0x6b/0x80
[Wed Jul 11 15:50:08 2018]  ? print_usage_bug+0x3a0/0x3a0
[Wed Jul 11 15:50:08 2018]  ? ___slab_alloc+0x62a/0x690
[Wed Jul 11 15:50:08 2018]  ? ___slab_alloc+0x62a/0x690
[Wed Jul 11 15:50:08 2018]  ? __lock_is_held+0x8c/0xe0
[Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
[Wed Jul 11 15:50:08 2018]  ? rcu_lockdep_current_cpu_online+0x12b/0x160
[Wed Jul 11 15:50:08 2018]  ? rcu_oom_callback+0x40/0x40
[Wed Jul 11 15:50:08 2018]  ? __lock_is_held+0x8c/0xe0
[Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? rcu_read_lock_sched_held+0x8f/0xa0
[Wed Jul 11 15:50:08 2018]  ? btrfs_record_root_in_trans+0x1f/0xa0 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? start_transaction+0x26b/0x930 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? btrfs_commit_transaction+0x1640/0x1640 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? check_flags.part.23+0x240/0x240
[Wed Jul 11 15:50:08 2018]  ? lock_downgrade+0x380/0x380
[Wed Jul 11 15:50:08 2018]  ? do_raw_spin_unlock+0x10f/0x1e0
[Wed Jul 11 15:50:08 2018]  ? do_raw_spin_unlock+0x10f/0x1e0
[Wed Jul 11 15:50:08 2018]  ? do_raw_spin_trylock+0x120/0x120
[Wed Jul 11 15:50:08 2018]  transaction_kthread+0x219/0x240 [btrfs]
[Wed Jul 11 15:50:08 2018]  ? btrfs_cleanup_transaction+0x6f0/0x6f0 [btrfs]
[Wed Jul 11 15:50:08 2018]  kthread+0x1b9/0x1e0
[Wed Jul 11 15:50:08 2018]  ? kthread_flush_work_fn+0x10/0x10
[Wed Jul 11 15:50:08 2018]  ret_from_fork+0x27/0x50
[Wed Jul 11 15:50:08 2018] 
   Showing all locks held in the system:

[DOC] BTRFS Volume operations, Device Lists and Locks all in one page

2018-07-11 Thread Anand Jain





BTRFS Volume operations, Device Lists and Locks all in one page:

Devices are managed in two contexts, the scan context and the mounted 
context. In scan context the threads originate from the btrfs_control 
ioctl and in the mounted context the threads originates from the mount 
point ioctl.
Apart from these two context, there also can be two transient state 
where device state are transitioning from the scan to the mount context 
or from the mount to the scan context.


Device List and Locks:-

 Count: btrfs_fs_devices::num_devices
 List : btrfs_fs_devices::devices -> btrfs_devices::dev_list
 Lock : btrfs_fs_devices::device_list_mutex

 Count: btrfs_fs_devices::rw_devices
 List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list
 Lock : btrfs_fs_info::chunk_mutex

 Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP

FSID List and Lock:-

 Count : None
 HEAD  : Global::fs_uuids -> btrfs_fs_devices::fs_list
 Lock  : Global::uuid_mutex


After the fs_devices is mounted, the btrfs_fs_devices::opened > 0.

In the scan context we have the following device operations..

Device SCAN:-  which creates the btrfs_fs_devices and its corresponding 
btrfs_device entries, also checks and frees the duplicate device entries.

Lock: uuid_mutex
  SCAN
  if (found_duplicate && btrfs_fs_devices::opened == 0)
 Free_duplicate
Unlock: uuid_mutex

Device READY:- check if the volume is ready. Also does an implicit scan 
and duplicate device free as in Device SCAN.

Lock: uuid_mutex
  SCAN
  if (found_duplicate && btrfs_fs_devices::opened == 0)
 Free_duplicate
  Check READY
Unlock: uuid_mutex

Device FORGET:- (planned) free a given or all unmounted devices and 
empty fs_devices if any.

Lock: uuid_mutex
  if (found_duplicate && btrfs_fs_devices::opened == 0)
Free duplicate
Unlock: uuid_mutex

Device mount operation -> A Transient state leading to the mounted context
Lock: uuid_mutex
 Find, SCAN, btrfs_fs_devices::opened++
Unlock: uuid_mutex

Device umount operation -> A transient state leading to the unmounted 
context or scan context

Lock: uuid_mutex
  btrfs_fs_devices::opened--
Unlock: uuid_mutex


In the mounted context we have the following device operations..

Device Rename through SCAN:- This is a special case where the device 
path gets renamed after its been mounted. (Ubuntu changes the boot path 
during boot up so we need this feature). Currently, this is part of 
Device SCAN as above. And we need the locks as below, because the 
dynamic disappearing device might cleanup the btrfs_device::name

Lock: btrfs_fs_devices::device_list_mutex
   Rename
Unlock: btrfs_fs_devices::device_list_mutex

Commit Transaction:- Write All supers.
Lock: btrfs_fs_devices::device_list_mutex
  Write all super of btrfs_devices::dev_list
Unlock: btrfs_fs_devices::device_list_mutex

Device add:- Add a new device to the existing mounted volume.
set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
Lock: btrfs_fs_devices::device_list_mutex
Lock: btrfs_fs_info::chunk_mutex
   List_add btrfs_devices::dev_list
   List_add btrfs_devices::dev_alloc_list
Unlock: btrfs_fs_info::chunk_mutex
Unlock: btrfs_fs_devices::device_list_mutex

Device remove:- Remove a device from the mounted volume.
set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
Lock: btrfs_fs_devices::device_list_mutex
Lock: btrfs_fs_info::chunk_mutex
   List_del btrfs_devices::dev_list
   List_del btrfs_devices::dev_alloc_list
Unlock: btrfs_fs_info::chunk_mutex
Unlock: btrfs_fs_devices::device_list_mutex

Device Replace:- Replace a device.
set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
Lock: btrfs_fs_devices::device_list_mutex
Lock: btrfs_fs_info::chunk_mutex
   List_update btrfs_devices::dev_list
   List_update btrfs_devices::dev_alloc_list
Unlock: btrfs_fs_info::chunk_mutex
Unlock: btrfs_fs_devices::device_list_mutex

Sprouting:- Add a RW device to the mounted RO seed device, so to make 
the mount point writable.

The following steps are used to hold the seed and sprout fs_devices.
(first two steps are not necessary for the sprouting, they are there to 
ensure the seed device remains scanned, and it might change)

. Clone the (mounted) fs_devices, lets call it as old_devices
. Now add old_devices to fs_uuids (yeah, there is duplicate fsid in the 
list but we change the other fsid before we release the uuid_mutex, so 
its fine).


. Alloc a new fs_devices, lets call it as seed_devices
. Copy fs_devices into the seed_devices
. Move fs_deviecs devices list into seed_devices
. Bring seed_devices to under fs_devices (fs_devices->seed = seed_devices)
. Assign a new FSID to the fs_devices and add the new writable device to 
the fs_devices.


In the unmounted context the fs_devices::seed is always NULL.
We alloc the fs_devices::seed only at the time of mount and or at 
sprouting. And free at the time of umount or if the seed device is 
replaced or deleted.


Locks: Sprouting:
Lock: uuid_mutex <-- because fsid rename and Device SCAN
Reuses Device Add code

Locks: Splitting:

Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page

Why original mode doesn't use swap? (Original: Re: btrfs check lowmem, take 2)

[PATCH v14.8 09/14] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

[PATCH v14.8 03/14] btrfs: dedupe: Introduce dedupe framework and its header

[PATCH v14.8 11/14] btrfs: dedupe: Inband in-memory only de-duplication implement

[PATCH v14.8 14/14] btrfs: dedupe: Introduce new reconfigure ioctl

[PATCH v14.8 07/14] btrfs: delayed-ref: Add support for increasing data ref under spinlock

[PATCH v14.8 13/14] btrfs: relocation: Enhance error handling to avoid BUG_ON

[PATCH v14.8 04/14] btrfs: dedupe: Introduce function to initialize dedupe info

[PATCH v14.8 12/14] btrfs: dedupe: Add ioctl for inband deduplication

[PATCH v14.8 00/14] Btrfs In-band De-duplication

[PATCH v14.8 02/14] btrfs: Introduce COMPRESS reserve type to fix false enospc for compression

[PATCH v14.8 10/14] btrfs: ordered-extent: Add support for dedupe

[PATCH v14.8 01/14] btrfs: introduce type based delalloc metadata reserve

[PATCH v14.8 06/14] btrfs: dedupe: Introduce function to remove hash from in-memory tree

[PATCH v14.8 08/14] btrfs: dedupe: Introduce function to search for an existing hash

[PATCH v10.3 4/5] btrfs-progs: dedupe: Add status subcommand

[PATCH v10.3 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication

[PATCH v10.3 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand

[PATCH v10.3 1/5] btrfs-progs: Basic framework for dedupe-inband command group

[PATCH v10.3 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group

[PATCH v10.3 0/5] In-band de-duplication for btrfs-progs

Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190

Re: Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"

Re: btrfs check mode normal still hard crash-hanging systems

Re: btrfs check lowmem, take 2

[PATCH v2] btrfs: use customized batch size for total_bytes_pinned

Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"

Re: [PATCH v3 2/2] btrfs: get fs_devices pointer form btrfs_scan_one_device

Re: [PATCH v3 1/2] btrfs: make fs_devices to be a local variable

[PULL] volume and uuid_mutex cleanups

About hung task on generic/041

[DOC] BTRFS Volume operations, Device Lists and Locks all in one page

33 matches

Site Navigation

Mail list logo

Footer information