Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Qu Wenruo

First of all.

The "crossing stripe boundary" error message itself is *HARMLESS* for 
recent kernels.


It only means, that metadata extent won't be checked by scrub on recent 
kernels.
Because scrub by its codes, has a limitation that, it can only check 
tree blocks which are inside a 64K block.


Old kernel won't have anything wrong, until that tree block is being 
scrubbed.

When scrubbed, old kernel just BUG_ON().

Now recent kernel will handle such limitation by checking extent 
allocation and avoid crossing boundary, so new created fs with new 
kernel won't cause such error message at all.


But for old created fs, the problem can't be avoided, but at least, new 
kernels will not BUG_ON() when you scrub these extents, they just get 
ignored (not that good, but at least no BUG_ON).


And new fsck will check such case, gives such warning.

Overall, you're OK if you are using recent kernels.

Marc Haber wrote on 2016/03/29 08:43 +0200:

On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:

Did you convert this filesystem from ext4 (or ext3)?


No.


You hadn't mentioned what version of btrfs-progs you're using, and that is
somewhat important for recovery.  I'm not sure if current versions of btrfs
check can fix this issue, but I know for a fact that older versions (prior
to at least 4.1) can not fix it.


4.1 for creation and btrfs check.


I assume that you have run older kernel on it, like v4.1 or v4.2.

In those old kernels, it lacks the check to avoid such extent allocation 
check.





As far as what the kernel is involved with, the easy way to check is if it's
operating on a mounted filesystem or not.  If it only operates on mounted
filesystems, it almost certainly goes through the kernel, if it only
operates on unmounted filesystems, it's almost certainly done in userspace
(except dev scan and technically fi show).


Then btrfs check is a userspace-only matter, as it wants the fs
unmounted, and it is irrelevant that I did btrfs check from a rescue
system with an older kernel, 3.16 if I recall correctly.


Not recommended to use older kernel to RW mount or use older fsck to do 
repair.
As it's possible that older kernel/btrfsck may allocate extent that 
cross the 64K boundary.





2. Regarding general support:  If you're using an enterprise distribution
(RHEL, SLES, CentOS, OEL, or something similar), you are almost certainly
going to get better support from your vendor than from the mailing list or
IRC.


My "productive" desktops (fan is one of them) run Debian unstable with
a current vanilla kernel. At the moment, I can't use 4.5 because it
acts up with KVM.  When I need a rescue system, I use grml, which
unfortunately hasn't released since November 2014 and is still with
kernel 3.16


To fix your problem(make these error message just disappear, even they 
are harmless on recent kernels), the most easy one, is to balance your 
metadata.


As I explained, the bug only lies in metadata, and balance will allocate 
new tree blocks, then copy old data into new locations.


In the allocation process of recent kernel, it will avoid such cross 
boundary, and to fix your problem.


But if you are using old kernels, don't scrub your metadata.

Thanks,
Qu


Greetings
Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Marc Haber
On Wed, Mar 30, 2016 at 03:00:19PM +0800, Qu Wenruo wrote:
> Marc Haber wrote on 2016/03/29 08:43 +0200:
> >On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
> >>Did you convert this filesystem from ext4 (or ext3)?
> >
> >No.
> >
> >>You hadn't mentioned what version of btrfs-progs you're using, and that is
> >>somewhat important for recovery.  I'm not sure if current versions of btrfs
> >>check can fix this issue, but I know for a fact that older versions (prior
> >>to at least 4.1) can not fix it.
> >
> >4.1 for creation and btrfs check.
> 
> I assume that you have run older kernel on it, like v4.1 or v4.2.

No, the productive system was always on a reasonably recent kernel. I
guess that this instance of btrfs has never been mounted on anything
older than 4.4.4. The rescue system I used to btrfs check (4.4-1 from
Debian unstable, I updated btrfs-tools on the rescue system before
going btrfs check) had kernel 3.16, but I have never actually mounted
the btrfs there.

> >Then btrfs check is a userspace-only matter, as it wants the fs
> >unmounted, and it is irrelevant that I did btrfs check from a rescue
> >system with an older kernel, 3.16 if I recall correctly.
> 
> Not recommended to use older kernel to RW mount or use older fsck to do
> repair.

Oldest kernel that has mounted this btrfs is 4.4.4, fsck that touched
the fs is 4.4. I'm trying to get hold of btrfs-tools 4.5.

> >My "productive" desktops (fan is one of them) run Debian unstable with
> >a current vanilla kernel. At the moment, I can't use 4.5 because it
> >acts up with KVM.  When I need a rescue system, I use grml, which
> >unfortunately hasn't released since November 2014 and is still with
> >kernel 3.16
> 
> To fix your problem(make these error message just disappear, even they are
> harmless on recent kernels), the most easy one, is to balance your metadata.

This does not work on kernel 4.4.6 with tools 4.4. Truckloads of
kernel traces, "WARNING: CPU: 5 PID: 31021 at
fs/btrfs/extent-tree.c:7897 btrfs_alloc_tree_block+0xeb/0x3d6
[btrfs]()", "BTRFS: block rsv returned -28", full trace is in this
thread.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 16/19] btrfs: dedupe: Add support for on-disk hash search

2016-03-30 Thread Qu Wenruo
Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 134 +++---
 fs/btrfs/dedupe.h |   1 +
 2 files changed, 118 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a274c1c..f2c2dde 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -651,6 +651,79 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
return 0;
 }
 
+ /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   u8 *buf = NULL;
+   u64 hash_key;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   buf = kmalloc(hash_len, GFP_NOFS);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   memcpy(&hash_key, hash + hash_len - 8, 8);
+   key.objectid = hash_key;
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
+   if (ret < 0)
+   goto out;
+   WARN_ON(ret == 0);
+   while (1) {
+   struct extent_buffer *node;
+   struct btrfs_dedupe_hash_item *hash_item;
+   int slot;
+
+   ret = btrfs_previous_item(dedupe_root, path, hash_key,
+ BTRFS_DEDUPE_HASH_ITEM_KEY);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, &key, slot);
+
+   if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
+   memcmp(&key.objectid, hash + hash_len - 8, 8))
+   break;
+   hash_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedupe_hash_item);
+   read_extent_buffer(node, buf, (unsigned long)(hash_item + 1),
+  hash_len);
+   if (!memcmp(buf, hash, hash_len)) {
+   ret = 1;
+   *bytenr_ret = key.offset;
+   *num_bytes_ret = btrfs_dedupe_hash_len(node, hash_item);
+   break;
+   }
+   }
+out:
+   kfree(buf);
+   btrfs_free_path(path);
+   return ret;
+}
+
 /*
  * Caller must ensure the corresponding ref head is not being run.
  */
@@ -681,9 +754,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, 
u8 *hash)
return NULL;
 }
 
-static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
-   struct inode *inode, u64 file_pos,
-   struct btrfs_dedupe_hash *hash)
+/* Wapper for different backends, caller needs to hold dedupe_info->lock */
+static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
+ u8 *hash, u64 *bytenr_ret,
+ u32 *num_bytes_ret)
+{
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   struct inmem_hash *found_hash;
+   int ret;
+
+   found_hash = inmem_search_hash(dedupe_info, hash);
+   if (found_hash) {
+   ret = 1;
+   *bytenr_ret = found_hash->bytenr;
+   *num_bytes_ret = found_hash->num_bytes;
+   } else {
+   ret = 0;
+   *bytenr_ret = 0;
+   *num_bytes_ret = 0;
+   }
+   return ret;
+   } else if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+   return ondisk_search_hash(dedupe_info, hash, bytenr_ret,
+ num_bytes_ret);
+   }
+   return -EINVAL;
+}
+
+static int generic_search(struct btrfs_dedupe_info *dedupe_info,
+ struct inode *inode, u64 file_pos,
+ struct btrfs_dedupe_hash *hash)
 {
int ret;
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -693,9 +793,9 @@ static int inmem_search(struct btrfs_dedupe_info 
*dedupe_info,
struct btrfs_delayed_ref_head *insert_head;
struct btrfs_delayed_data_ref *insert_dref;
struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
-   struct inmem_hash *found_hash;
int free_insert = 1;
u64 

[PATCH v9 08/19] btrfs: ordered-extent: Add support for dedupe

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ordered-data.c | 44 
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0de7da5..ef24ad1 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* Hash hit must go through dedupe routine at all cost, even dedupe
+* is disabled. As its delayed ref is already increased.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, &entry->flags);
 
@@ -250,15 +277,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +302,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -577,6 +612,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(&sum->list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 23c9605..8a54476 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 @@ struct btrfs_ordered_extent {
struct completion completion;

[PATCH v9 11/19] btrfs: dedupe: add an inode nodedupe flag

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable
online data dedupelication for specified files.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h | 1 +
 fs/btrfs/ioctl.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 85044bf..0e8933c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2381,6 +2381,7 @@ do {  
 \
 #define BTRFS_INODE_NOATIME(1 << 9)
 #define BTRFS_INODE_DIRSYNC(1 << 10)
 #define BTRFS_INODE_COMPRESS   (1 << 11)
+#define BTRFS_INODE_NODEDUPE   (1 << 12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2d1ed93..c48afcb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -161,7 +161,8 @@ void btrfs_update_iflags(struct inode *inode)
 /*
  * Inherit flags from the parent inode.
  *
- * Currently only the compression flags and the cow flags are inherited.
+ * Currently only the compression flags, dedupe flags and the cow flags
+ * are inherited.
  */
 void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 {
@@ -186,6 +187,9 @@ void btrfs_inherit_iflags(struct inode *inode, struct inode 
*dir)
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
}
 
+   if (flags & BTRFS_INODE_NODEDUPE)
+   BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+
btrfs_update_iflags(inode);
 }
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 03/19] btrfs: dedupe: Introduce function to add hash into in-memory tree

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 2211588..4e8455e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+   if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+   GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -152,3 +160,146 @@ enable:
fs_info->dedupe_enabled = 1;
return ret;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(&hash->hash_node, parent, p);
+   rb_insert_color(&hash->hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(&hash->bytenr_node, parent, p);
+   rb_insert_color(&hash->bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(&hash->lru_list);
+   rb_erase(&hash->hash_node, &dedupe_info->hash_root);
+   rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 type = dedupe_info->hash_type;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(type);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+   mutex_lock(&dedupe_info->lock);
+
+   ret = inmem_insert_bytenr(&dedupe_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(&dedupe_info->hash_root, ihash,
+   btrfs_dedupe_sizes[type]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(&ihash->lru_list, &dedupe_info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+  

[PATCH v9 05/19] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2016-03-30 Thread Qu Wenruo
For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 30 +++---
 fs/btrfs/delayed-ref.h |  8 
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action)
+{
+   head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
+   qrecord, bytenr, num_bytes, ref_root, reserved,
+   action, 1);
+   add_delayed_data_ref(fs_info, trans, head_ref, &dref->node, bytenr,
+   num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
-   bytenr, num_bytes, ref_root, reserved,
-   action, 1);
-
-   add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
-  num_bytes, parent, ref_root, owner, offset,
-  action);
+   btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+   bytenr, num_bytes, parent, ref_root, owner, offset,
+   reserved, action);
spin_unlock(&delayed_refs->lock);
 
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index c24b653..2765858 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct 
btrfs_delayed_ref_node *ref)
}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
   u64 ref_root, int level, int action,
   struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes,
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 13/19] btrfs: dedupe: add per-file online dedupe control

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce inode_need_dedupe() to implement per-file online dedupe control.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/inode.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96790d0..c80fd74 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -708,6 +708,18 @@ static void end_dedupe_extent(struct inode *inode, u64 
start,
}
 }
 
+static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info,
+   struct inode *inode)
+{
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+   return 0;
+
+   return 1;
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -1680,7 +1692,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
-   } else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+   } else if (!inode_need_compress(inode) &&
+  !inode_need_dedupe(fs_info, inode)) {
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1, NULL);
} else {
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 00/19] Btrfs dedupe framework

2016-03-30 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160330

This March 30th patchset update mostly addresses the patchset structure
comment from David:
1) Change the patchset sequence
   Not If only apply the first 14 patches, it can provide the full
   backward compatible in-memory only dedupe backend.

   Only starts from patch 15, on-disk format will be changed.

   So patch 1~14 is going to be pushed for next merge window, while I'll
   still submit them all for review purpose.

   This also makes us have more time to further polish the ondisk format.

2) Fold small fixes into its original patch

On-disk format change comment from Chris will be addressed in next
iteration soon.

This updated version of inband de-duplication has the following features:
1) ONE unified dedup framework.
   Most of its code is hidden quietly in dedup.c and export the minimal
   interfaces for its caller.
   Reviewer and further developer would benefit from the unified
   framework.

2) TWO different back-end with different trade-off
   One is the improved version of previous Fujitsu in-memory only dedup.
   The other one is enhanced dedup implementation from Liu Bo.
   Changed its tree structure to handle bytenr -> hash search for
   deleting hash, without the hideous data backref hack.

3) Support compression with dedupe
   Now dedupe can work with compression.
   Means that, a dedupe miss case can be compressed, and dedupe hit case
   can also reuse compressed file extents.

4) Ioctl interface with persist dedup status
   Advised by David, now we use ioctl to enable/disable dedup.

   And we now have dedup status, recorded in the first item of dedup
   tree.
   Just like quota, once enabled, no extra ioctl is needed for next
   mount.

5) Ability to disable dedup for given dirs/files
   It works just like the compression prop method, by adding a new
   xattr.

TODO:
1) Add extent-by-extent comparison for faster but more conflicting algorithm
   Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
   CPU may even be a bottleneck other than IO.
   But for faster hash, it will definitely cause conflicts, so we need
   extent comparison before we introduce new dedup algorithm.

2) Misc end-user related helpers
   Like handy and easy to implement dedup rate report.
   And method to query in-memory hash size for those "non-exist" users who
   want to use 'dedup enable -l' option but didn't ever know how much
   RAM they have.

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.

Qu Wenruo (8):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  btrfs: dedupe: Add support for on-disk hash search
  btrfs: dedupe: Add support to delete hash for on-disk backend
  btrfs: dedupe: Add support for adding hash for on-disk backend
  btrfs: dedupe: Preparation for compress-dedupe co-work

Wang Xiaoguang (11):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an exist

[PATCH v9 15/19] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

2016-03-30 Thread Qu Wenruo
Since we will introduce a new on-disk based dedupe method, introduce new
interfaces to resume previous dedupe setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c  | 197 -
 fs/btrfs/dedupe.h  |  13 
 fs/btrfs/disk-io.c |  25 ++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 232 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index cfb7fea..a274c1c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,8 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 #include "qgroup.h"
+#include "disk-io.h"
+#include "locking.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info, u16 type,
return 0;
 }
 
+static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
+   struct btrfs_dedupe_info *dedupe_info)
+{
+   struct btrfs_root *dedupe_root;
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct btrfs_dedupe_status_item *status;
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(fs_info->tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+   dedupe_root = btrfs_create_tree(trans, fs_info,
+  BTRFS_DEDUPE_TREE_OBJECTID);
+   if (IS_ERR(dedupe_root)) {
+   ret = PTR_ERR(dedupe_root);
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+   dedupe_info->dedupe_root = dedupe_root;
+
+   key.objectid = 0;
+   key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+   key.offset = 0;
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+ sizeof(*status));
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+
+   status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_dedupe_status_item);
+   btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
+dedupe_info->blocksize);
+   btrfs_set_dedupe_status_limit(path->nodes[0], status,
+   dedupe_info->limit_nr);
+   btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
+   dedupe_info->hash_type);
+   btrfs_set_dedupe_status_backend(path->nodes[0], status,
+   dedupe_info->backend);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+   btrfs_free_path(path);
+   if (ret == 0)
+   btrfs_commit_transaction(trans, fs_info->tree_root);
+   return ret;
+}
+
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
  u16 backend, u64 blocksize, u64 limit_nr,
  u64 limit_mem, u64 *ret_limit)
 {
+   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+
if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
blocksize < fs_info->tree_root->sectorsize ||
@@ -140,8 +201,12 @@ static int check_dedupe_parameter(struct btrfs_fs_info 
*fs_info, u16 hash_type,
*ret_limit = min(tmp, limit_nr);
}
}
-   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+   /* Ondisk backend must use RO compat feature */
+   if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE))
+   return -EOPNOTSUPP;
*ret_limit = 0;
+   }
return 0;
 }
 
@@ -150,11 +215,16 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, 
u16 type, u16 backend,
 {
struct btrfs_dedupe_info *dedupe_info;
u64 limit = 0;
+   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+   int create_tree;
int ret = 0;
 
/* only one limit is accepted for enable*/
if (limit_nr && limit_mem)
return -EINVAL;
+   /* enable and disable may modify ondisk data, so block RO fs*/
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
 
ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
 limit_nr, limit_mem, &limit);
@@ -179,9 +249,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
}
 
 enable:
+   create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE;
+
ret = init_dedupe_info(&dedupe_

[PATCH v9 14/19] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-03-30 Thread Qu Wenruo
Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h | 62 +++-
 fs/btrfs/dedupe.h|  5 
 fs/btrfs/disk-io.c   |  6 +
 fs/btrfs/relocation.c|  3 ++-
 include/trace/events/btrfs.h |  3 ++-
 5 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e8933c..b19c1f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -538,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP   \
-   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |  \
+BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET   0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
@@ -960,6 +964,42 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+   __le64 blocksize;
+   __le64 limit_nr;
+   __le16 hash_type;
+   __le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ */
+struct btrfs_dedupe_hash_item {
+   /* length of dedupe range */
+   __le32 len;
+
+   /* Hash follows */
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -2168,6 +2208,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY   230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY 231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY   232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3265,6 +3312,19 @@ static inline unsigned long btrfs_leaf_data(struct 
extent_buffer *l)
return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+  blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+  limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+  hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+  backend, 16);
+
+/* btrfs_dedupe_hash_item */
+BTRFS_SETGET_FUNCS(dedupe_hash_len, struct btrfs_dedupe_hash_item, len, 32);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f5d2b45..1ac1bcb 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -60,6 +60,8 @@ struct btrfs_dedupe_hash {
u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
/* dedupe blocksize */
u64 blocksize;
@@ -75,6 +77,9 @@ struct btrfs_dedupe_info {
struct list_head lru_list;
u64 limit_nr;
u64 current_nr;
+
+   /* for persist data like dedup-hash and dedupe status */
+   struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed6a6fd..c7eda03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc"   },
{ .id = BTRFS_UUID_TREE_OBJECTID,   .name_stem = "uuid" },
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, .name_stem = "free-space" },
+   { .id = BTRFS_DEDUPE_TREE_OBJECTID, .name_stem = "dedupe"   },
{ .id = 0,  .name_stem = "tree" },
 };
 
@@

[PATCH v9 17/19] btrfs: dedupe: Add support to delete hash for on-disk backend

2016-03-30 Thread Qu Wenruo
Now on-disk backend can delete hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 100 ++
 1 file changed, 100 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index f2c2dde..19fe5ee 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -500,6 +500,104 @@ static int inmem_del(struct btrfs_dedupe_info 
*dedupe_info, u64 bytenr)
return 0;
 }
 
+/*
+ * If prepare_del is given, this will setup search_slot() for delete.
+ * Caller needs to do proper locking.
+ *
+ * Return > 0 for found.
+ * Return 0 for not found.
+ * Return < 0 for error.
+ */
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedupe_info *dedupe_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del)
+{
+   struct btrfs_key key;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   int ret;
+   int ins_len = 0;
+   int cow = 0;
+
+   if (prepare_del) {
+   if (WARN_ON(trans == NULL))
+   return -EINVAL;
+   cow = 1;
+   ins_len = -1;
+   }
+
+   key.objectid = bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(trans, dedupe_root, &key, path,
+   ins_len, cow);
+
+   if (ret < 0)
+   return ret;
+   /*
+* Although it's almost impossible, it's still possible that
+* the last 64bits are all 1.
+*/
+   if (ret == 0)
+   return 1;
+
+   ret = btrfs_previous_item(dedupe_root, path, bytenr,
+ BTRFS_DEDUPE_BYTENR_ITEM_KEY);
+   if (ret < 0)
+   return ret;
+   if (ret > 0)
+   return 0;
+   return 1;
+}
+
+static int ondisk_del(struct btrfs_trans_handle *trans,
+ struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   key.offset = 0;
+
+   mutex_lock(&dedupe_info->lock);
+
+   ret = ondisk_search_bytenr(trans, dedupe_info, path, bytenr, 1);
+   if (ret <= 0)
+   goto out;
+
+   btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+   ret = btrfs_del_item(trans, dedupe_root, path);
+   btrfs_release_path(path);
+   if (ret < 0)
+   goto out;
+   /* Search for hash item and delete it */
+   key.objectid = key.offset;
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = bytenr;
+
+   ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
+   if (WARN_ON(ret > 0)) {
+   ret = -ENOENT;
+   goto out;
+   }
+   if (ret < 0)
+   goto out;
+   ret = btrfs_del_item(trans, dedupe_root, path);
+
+out:
+   btrfs_free_path(path);
+   mutex_unlock(&dedupe_info->lock);
+   return ret;
+}
+
 /* Remove a dedupe hash from dedupe tree */
 int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr)
@@ -514,6 +612,8 @@ int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 
if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
return inmem_del(dedupe_info, bytenr);
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   return ondisk_del(trans, dedupe_info, bytenr);
return -EINVAL;
 }
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 09/19] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-03-30 Thread Qu Wenruo
Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  18 
 fs/btrfs/inode.c   | 235 ++---
 fs/btrfs/relocation.c  |  16 
 3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 
if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedup
+* hash.
+*/
+   btrfs_dedupe_del(trans, fs_info, node->bytenr);
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedupe_del(trans, info, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, extent_root,
+   ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, extent_root, 
ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..96790d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
   struct page *locked_page,
   u64 start, u64 end, int *page_started,
-  unsigned long *nr_written, int unlock);
+  unsigned long *nr_written, int unlock,
+  struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
   u64 len, u64 orig_start,
   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
struct page **pages;
unsigned long nr_pages;
int compress_type;
+   struct btrfs_dedupe_hash *hash;
struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 u64 compressed_size,
 struct page **pages,
 unsigned long nr_pages,
-int compress_type)
+int compress_type,
+struct btrfs_dedupe_hash *hash)
 {
struct async_extent *async_e

[PATCH v9 12/19] btrfs: dedupe: add a property handler for online dedupe

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

We use btrfs extended attribute "btrfs.dedupe" to record per-file online
dedupe status, so add a dedupe property handler.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/props.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 3699212..a430886 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -42,6 +42,11 @@ static int prop_compression_apply(struct inode *inode,
  size_t len);
 static const char *prop_compression_extract(struct inode *inode);
 
+static int prop_dedupe_validate(const char *value, size_t len);
+static int prop_dedupe_apply(struct inode *inode, const char *value,
+size_t len);
+static const char *prop_dedupe_extract(struct inode *inode);
+
 static struct prop_handler prop_handlers[] = {
{
.xattr_name = XATTR_BTRFS_PREFIX "compression",
@@ -50,6 +55,13 @@ static struct prop_handler prop_handlers[] = {
.extract = prop_compression_extract,
.inheritable = 1
},
+   {
+   .xattr_name = XATTR_BTRFS_PREFIX "dedupe",
+   .validate = prop_dedupe_validate,
+   .apply = prop_dedupe_apply,
+   .extract = prop_dedupe_extract,
+   .inheritable = 1
+   },
 };
 
 void __init btrfs_props_init(void)
@@ -426,4 +438,33 @@ static const char *prop_compression_extract(struct inode 
*inode)
return NULL;
 }
 
+static int prop_dedupe_validate(const char *value, size_t len)
+{
+   if (!strncmp("disable", value, len))
+   return 0;
+
+   return -EINVAL;
+}
+
+static int prop_dedupe_apply(struct inode *inode, const char *value, size_t 
len)
+{
+   if (len == 0) {
+   BTRFS_I(inode)->flags &= ~BTRFS_INODE_NODEDUPE;
+   return 0;
+   }
+
+   if (!strncmp("disable", value, len)) {
+   BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static const char *prop_dedupe_extract(struct inode *inode)
+{
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+   return "disable";
 
+   return NULL;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 04/19] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 105 ++
 1 file changed, 105 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 4e8455e..a229ded 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(&dedupe_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(&dedupe_info->lock);
+   list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(&dedupe_info->lock);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   /* Here we don't want to increase refs of dedupe_info */
+   fs_info->dedupe_enabled = 0;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished write to complete dedupe routine
+* As disable operation is not a frequent operation, we are
+* OK to use heavy but safe sync_filesystem().
+*/
+   down_read(&fs_info->sb->s_umount);
+   ret = sync_filesystem(fs_info->sb);
+   up_read(&fs_info->sb->s_umount);
+   if (ret < 0)
+   return ret;
+
+   fs_info->dedupe_info = NULL;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 10/19] btrfs: dedupe: Add ioctl for inband dedupelication

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interface are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/dedupe.c  | 48 +
 fs/btrfs/dedupe.h  | 15 +++
 fs/btrfs/disk-io.c |  3 +++
 fs/btrfs/ioctl.c   | 66 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 23 
 7 files changed, 158 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 022ab61..85044bf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -508,6 +508,7 @@ struct btrfs_super_block {
  * ones specified below then we will fail to mount
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0)
+#define BTRFS_FEATURE_COMPAT_RO_DEDUPE (1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF   (1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL << 1)
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index bdaea3a..cfb7fea 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_type = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   return;
+   }
+   mutex_lock(&dedupe_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_type = dedupe_info->hash_type;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(&dedupe_info->lock);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -371,6 +398,27 @@ static void inmem_destroy(struct btrfs_dedupe_info 
*dedupe_info)
mutex_unlock(&dedupe_info->lock);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index e5d0d34..f5d2b45 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -103,6 +103,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 type)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
u64 blocksize, u64 limit_nr, u64 limit_mem);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -110,6 +119,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
+/*
  * Calculate hash for dedup.
  * Caller must ensure [start, start + dedupe_bs) has valid data.
  */
di

[PATCH v9 06/19] btrfs: dedupe: Introduce function to search for an existing hash

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 185 ++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a229ded..9175a5f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -408,3 +409,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = &dedupe_info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_type = dedupe_info->hash_type;
+   int hash_len = btrfs_dedupe_sizes[hash_type];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(&entry->lru_list);
+   list_add(&entry->lru_list, &dedupe_info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (root->fs_info->quota_enabled &&
+   is_fstree(root->root_key.objectid)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(&dedupe_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   delayed_refs = &trans->transaction->delayed_refs;
+
+   spin_lock(&delayed_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+   insert_dref, insert_head, insert_qrecord,
+   bytenr, num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos, 0,
+   BTRFS_ADD_DELAYED_REF);
+   spin_unlock(&delayed_refs->lock);
+
+   /* add_delayed_data_ref_locked will free unused memory */
+   free_insert = 

[PATCH v9 01/19] btrfs: dedupe: Introduce dedupe framework and its header

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |   5 ++
 fs/btrfs/dedupe.h  | 134 +
 fs/btrfs/disk-io.c |   1 +
 3 files changed, 140 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..022ab61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1860,6 +1860,11 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+
+   /* Inband de-duplication related structures*/
+   unsigned int dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 000..40f4808
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,134 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include 
+#include 
+#include 
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUPE_BACKEND_INMEMORY  0
+#define BTRFS_DEDUPE_BACKEND_ONDISK1
+
+/* Only support inmemory yet, so count is still only 1 */
+#define BTRFS_DEDUPE_BACKEND_COUNT 1
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_MIN (16 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT (128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUPE_HASH_SHA256   0
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_type;
+
+   struct crypto_shash *dedupe_driver;
+   struct mutex lock;
+
+   /* following members are only used in in-memory dedupe mode */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+   u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash);
+
+/* Add a dedupe hash 

[PATCH v9 07/19] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 49 +
 1 file changed, 49 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9175a5f..bdaea3a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -593,3 +593,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   struct {
+   struct shash_desc desc;
+   char ctx[crypto_shash_descsize(tfm)];
+   } sdesc;
+   u64 dedupe_bs;
+   u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   sdesc.desc.tfm = tfm;
+   sdesc.desc.flags = 0;
+   ret = crypto_shash_init(&sdesc.desc);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_CACHE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(&sdesc.desc, d, sectorsize);
+   kunmap(p);
+   page_cache_release(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(&sdesc.desc, hash->hash);
+   return ret;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 18/19] btrfs: dedupe: Add support for adding hash for on-disk backend

2016-03-30 Thread Qu Wenruo
Now on-disk backend can add hash now.

Since all needed on-disk backend functions are added, also allow on-disk
backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
only) to 2 (inmemory + ondisk).

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 83 +++
 fs/btrfs/dedupe.h |  3 +-
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 19fe5ee..f1d1255 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -437,6 +437,87 @@ out:
return 0;
 }
 
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedupe_info *dedupe_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del);
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+ struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_dedupe_hash *hash)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   struct btrfs_key key;
+   struct btrfs_dedupe_hash_item *hash_item;
+   u64 bytenr;
+   u32 num_bytes;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   mutex_lock(&dedupe_info->lock);
+
+   ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_release_path(path);
+
+   ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+   if (ret < 0)
+   goto out;
+   /* Same hash found, don't re-add to save dedupe tree space */
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   /* Insert hash->bytenr item */
+   memcpy(&key.objectid, hash->hash + hash_len - 8, 8);
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = hash->bytenr;
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+   sizeof(*hash_item) + hash_len);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+  struct btrfs_dedupe_hash_item);
+   btrfs_set_dedupe_hash_len(path->nodes[0], hash_item, hash->num_bytes);
+   write_extent_buffer(path->nodes[0], hash->hash,
+   (unsigned long)(hash_item + 1), hash_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+   btrfs_release_path(path);
+
+   /* Then bytenr->hash item */
+   key.objectid = hash->bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   memcpy(&key.offset, hash->hash + hash_len - 8, 8);
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key, hash_len);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   write_extent_buffer(path->nodes[0], hash->hash,
+   btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+   hash_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+   mutex_unlock(&dedupe_info->lock);
+   btrfs_free_path(path);
+   return ret;
+}
+
 int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info,
 struct btrfs_dedupe_hash *hash)
@@ -458,6 +539,8 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 
if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
return inmem_add(dedupe_info, hash);
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   return ondisk_add(trans, dedupe_info, hash);
return -EINVAL;
 }
 
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index bfcacd7..1573456 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -31,8 +31,7 @@
 #define BTRFS_DEDUPE_BACKEND_INMEMORY  0
 #define BTRFS_DEDUPE_BACKEND_ONDISK1
 
-/* Only support inmemory yet, so count is still only 1 */
-#define BTRFS_DEDUPE_BACKEND_COUNT 1
+#define BTRFS_DEDUPE_BACKEND_COUNT 2
 
 /* Dedup block size limit and default value */
 #define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024)
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Qu Wenruo



Marc Haber wrote on 2016/03/30 09:18 +0200:

On Wed, Mar 30, 2016 at 03:00:19PM +0800, Qu Wenruo wrote:

Marc Haber wrote on 2016/03/29 08:43 +0200:

On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:

Did you convert this filesystem from ext4 (or ext3)?


No.


You hadn't mentioned what version of btrfs-progs you're using, and that is
somewhat important for recovery.  I'm not sure if current versions of btrfs
check can fix this issue, but I know for a fact that older versions (prior
to at least 4.1) can not fix it.


4.1 for creation and btrfs check.


I assume that you have run older kernel on it, like v4.1 or v4.2.


No, the productive system was always on a reasonably recent kernel. I
guess that this instance of btrfs has never been mounted on anything
older than 4.4.4. The rescue system I used to btrfs check (4.4-1 from
Debian unstable, I updated btrfs-tools on the rescue system before
going btrfs check) had kernel 3.16, but I have never actually mounted
the btrfs there.


Then btrfs check is a userspace-only matter, as it wants the fs
unmounted, and it is irrelevant that I did btrfs check from a rescue
system with an older kernel, 3.16 if I recall correctly.


Not recommended to use older kernel to RW mount or use older fsck to do
repair.


Oldest kernel that has mounted this btrfs is 4.4.4, fsck that touched
the fs is 4.4. I'm trying to get hold of btrfs-tools 4.5.


Oh, I just forgot to ask for the btrfs-progs version.

The stripe crossing boundary output used to be false alert, as I forgot 
the to "-1" when checking the extent end position.


Didn't remember the exact version, but updating to 4.5 will never be a 
bad idea.



My "productive" desktops (fan is one of them) run Debian unstable with
a current vanilla kernel. At the moment, I can't use 4.5 because it
acts up with KVM.  When I need a rescue system, I use grml, which
unfortunately hasn't released since November 2014 and is still with
kernel 3.16


To fix your problem(make these error message just disappear, even they are
harmless on recent kernels), the most easy one, is to balance your metadata.


This does not work on kernel 4.4.6 with tools 4.4. Truckloads of
kernel traces, "WARNING: CPU: 5 PID: 31021 at
fs/btrfs/extent-tree.c:7897 btrfs_alloc_tree_block+0xeb/0x3d6
[btrfs]()", "BTRFS: block rsv returned -28", full trace is in this
thread.


That's ENOSPC, seems to be another problem.

Did your btrfs have enough *unallocated* space?

Thanks,
Qu


Greetings
Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 02/19] btrfs: dedupe: Introduce function to initialize dedupe info

2016-03-30 Thread Qu Wenruo
From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/dedupe.c | 154 ++
 fs/btrfs/dedupe.h |  16 +-
 3 files changed, 169 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 000..2211588
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,154 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+   u16 backend, u64 blocksize, u64 limit)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_type = type;
+   dedupe_info->backend = backend;
+   dedupe_info->blocksize = blocksize;
+   dedupe_info->limit_nr = limit;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(&dedupe_info->lru_list);
+   mutex_init(&dedupe_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
+ u16 backend, u64 blocksize, u64 limit_nr,
+ u64 limit_mem, u64 *ret_limit)
+{
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize))
+   return -EINVAL;
+   /*
+* For new backend and hash type, we return special return code
+* as they can be easily expended.
+*/
+   if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+   return -EOPNOTSUPP;
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+   return -EOPNOTSUPP;
+
+   /* Backend specific check */
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   if (!limit_nr && !limit_mem)
+   *ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+   else {
+   u64 tmp = (u64)-1;
+
+   if (limit_mem) {
+   tmp = limit_mem / (sizeof(struct inmem_hash) +
+   btrfs_dedupe_hash_size(hash_type));
+   /* Too small limit_mem to fill a hash item */
+   if (!tmp)
+   return -EINVAL;
+   }
+   if (!limit_nr)
+   limit_nr = (u64)-1;
+
+   *ret_limit = min(tmp, limit_nr);
+   }
+   }
+   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   

[PATCH v9 19/19] btrfs: dedupe: Preparation for compress-dedupe co-work

2016-03-30 Thread Qu Wenruo
For dedupe to work with compression, new members recording compression
algorithm and on-disk extent length are needed.

Add them for later compress-dedupe co-work.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h| 11 -
 fs/btrfs/dedupe.c   | 64 +++--
 fs/btrfs/dedupe.h   |  2 ++
 fs/btrfs/inode.c|  2 ++
 fs/btrfs/ordered-data.c |  2 ++
 5 files changed, 67 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b19c1f1..88702e1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -984,9 +984,14 @@ struct btrfs_dedupe_status_item {
  * Used for hash <-> bytenr search
  */
 struct btrfs_dedupe_hash_item {
-   /* length of dedupe range */
+   /* length of dedupe range in memory */
__le32 len;
 
+   /* length of dedupe range on disk */
+   __le32 disk_len;
+
+   u8 compression;
+
/* Hash follows */
 } __attribute__ ((__packed__));
 
@@ -3324,6 +3329,10 @@ BTRFS_SETGET_FUNCS(dedupe_status_backend, struct 
btrfs_dedupe_status_item,
 
 /* btrfs_dedupe_hash_item */
 BTRFS_SETGET_FUNCS(dedupe_hash_len, struct btrfs_dedupe_hash_item, len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_disk_len, struct btrfs_dedupe_hash_item,
+  disk_len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_compression, struct btrfs_dedupe_hash_item,
+  compression, 8);
 
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index f1d1255..25e5e1d 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -31,6 +31,8 @@ struct inmem_hash {
 
u64 bytenr;
u32 num_bytes;
+   u32 disk_num_bytes;
+   u8 compression;
 
u8 hash[];
 };
@@ -397,6 +399,8 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
/* Copy the data out */
ihash->bytenr = hash->bytenr;
ihash->num_bytes = hash->num_bytes;
+   ihash->disk_num_bytes = hash->disk_num_bytes;
+   ihash->compression = hash->compression;
memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
 
mutex_lock(&dedupe_info->lock);
@@ -442,7 +446,8 @@ static int ondisk_search_bytenr(struct btrfs_trans_handle 
*trans,
struct btrfs_path *path, u64 bytenr,
int prepare_del);
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
- u64 *bytenr_ret, u32 *num_bytes_ret);
+ u64 *bytenr_ret, u32 *num_bytes_ret,
+ u32 *disk_num_bytes_ret, u8 *compression);
 static int ondisk_add(struct btrfs_trans_handle *trans,
  struct btrfs_dedupe_info *dedupe_info,
  struct btrfs_dedupe_hash *hash)
@@ -471,7 +476,8 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
}
btrfs_release_path(path);
 
-   ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+   ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes,
+NULL, NULL);
if (ret < 0)
goto out;
/* Same hash found, don't re-add to save dedupe tree space */
@@ -493,6 +499,10 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
   struct btrfs_dedupe_hash_item);
btrfs_set_dedupe_hash_len(path->nodes[0], hash_item, hash->num_bytes);
+   btrfs_set_dedupe_hash_disk_len(path->nodes[0], hash_item,
+  hash->disk_num_bytes);
+   btrfs_set_dedupe_hash_compression(path->nodes[0], hash_item,
+ hash->compression);
write_extent_buffer(path->nodes[0], hash->hash,
(unsigned long)(hash_item + 1), hash_len);
btrfs_mark_buffer_dirty(path->nodes[0]);
@@ -840,7 +850,8 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
  * Return <0 for error
  */
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
- u64 *bytenr_ret, u32 *num_bytes_ret)
+ u64 *bytenr_ret, u32 *num_bytes_ret,
+ u32 *disk_num_bytes_ret, u8 *compression_ret)
 {
struct btrfs_path *path;
struct btrfs_key key;
@@ -896,8 +907,19 @@ static int ondisk_search_hash(struct btrfs_dedupe_info 
*dedupe_info, u8 *hash,
   hash_len);
if (!memcmp(buf, hash, hash_len)) {
ret = 1;
-   *bytenr_ret = key.offset;
-   *num_bytes_ret = btrfs_dedupe_hash_len(node, hash_item);
+   if (bytenr_ret)
+   *bytenr_ret = key.offset;
+  

Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Marc Haber
On Wed, Mar 30, 2016 at 04:03:17PM +0800, Qu Wenruo wrote:
> Did your btrfs have enough *unallocated* space?

87 Gig out of a total 200 Gig Device size. I guess that should be
enough for a rebalance of 2,8 Gig Metadata.

Greetings
Ma "please excuse my cynism" rc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2016-03-30 Thread Sanidhya Solanki
subscribe linux-btrfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix file loss caused by fsync after rename and new inode

2016-03-30 Thread fdmanana
From: Filipe Manana 

If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
at log tree replay time we end up removing inode A completely. If inode A
is a directory then all its files are gone too.

This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:

   # Scenario 1

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir -p /mnt/a/x
   echo "hello" > /mnt/a/x/foo
   echo "world" > /mnt/a/x/bar
   sync
   mv /mnt/a/x /mnt/a/y
   mkdir /mnt/a/x
   xfs_io -c fsync /mnt/a/x
   

   The next time the fs is mounted, log tree replay happens and
   the directory "y" does not exist nor do the files "foo" and
   "bar" exist anywhere (neither in "y" nor in "x", nor the root
   nor anywhere).

   # Scenario 2

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/a
   echo "hello" > /mnt/a/foo
   sync
   mv /mnt/a/foo /mnt/a/bar
   echo "world" > /mnt/a/foo
   xfs_io -c fsync /mnt/a/foo
   

   The next time the fs is mounted, log tree replay happens and the
   file "bar" does not exists anymore. A file with the name "foo"
   exists and it matches the second file we created.

Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).

Two test cases for fstests are being submitted upstream.

Cc: sta...@vger.kernel.org
Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 121 
 1 file changed, 121 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 24d03c7..1142e77 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4415,6 +4415,111 @@ static int btrfs_log_trailing_hole(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+/*
+ * When we are logging a new inode X, check if it doesn't have a reference that
+ * matches the reference from some other inode Y which is a directory that was
+ * created in a past transaction and was renamed. If we don't do this, then at
+ * log replay time we can lose files inside the directory. Example:
+ *
+ * mkdir /mnt/x
+ * echo "hello world" > /mnt/x/foobar
+ * sync
+ * mv /mnt/x /mnt/y
+ * mkdir /mnt/x # or touch /mnt/x
+ * xfs_io -c fsync /mnt/x
+ * 
+ * mount fs, trigger log replay
+ *
+ * After the log replay procedure, we would lose the first directory and all 
its
+ * files.
+ * For the case where inode Y is not a directory we simply end up losing it:
+ *
+ * echo "123" > /mnt/foo
+ * sync
+ * mv /mnt/foo /mnt/bar
+ * echo "abc" > /mnt/foo
+ * xfs_io -c fsync /mnt/foo
+ * 
+ */
+static int btrfs_check_ref_name_override(struct extent_buffer *eb,
+const int slot,
+const struct btrfs_key *key,
+struct inode *inode)
+{
+   int ret;
+   struct btrfs_path *search_path;
+   char *name = NULL;
+   u32 name_len = 0;
+   u32 item_size = btrfs_item_size_nr(eb, slot);
+   u32 cur_offset = 0;
+   unsigned long ptr = btrfs_item_ptr_offset(eb, slot);
+
+   search_path = btrfs_alloc_path();
+   if (!search_path)
+   return -ENOMEM;
+   search_path->search_commit_root = 1;
+   search_path->skip_locking = 1;
+
+   while (cur_offset < item_size) {
+   u64 parent;
+   u32 this_name_len;
+   u32 this_len;
+   unsigned long name_ptr;
+   struct btrfs_dir_item *di;
+
+   if (key->type == BTRFS_INODE_REF_KEY) {
+   struct btrfs_inode_ref *iref;
+
+   iref = (struct btrfs_inode_ref *)(ptr + cur_offset);
+   parent = key->offset;
+   this_name_len = btrfs_inode_ref_name_len(eb, iref);
+   name_ptr = (unsigned long)(iref + 1);
+   this_len = sizeof(*iref) + this_name_len;
+   } else {
+   struct btrfs_inode_extref *extref;
+
+   extref = (struct btrfs_inode_extref *)(ptr +
+  cur_offset);
+   parent = btrfs_inode_extref_parent(eb, extref);
+   this_name_len = btrfs_inode_extref_name_len(eb, extref);
+   name_ptr = (unsigned long)&extref->name;
+   this_len = sizeof(*extref) + this_name_len;
+   }
+
+   if (this_name_len > name_len) {
+   char *new_name;
+
+   new_name = krealloc(name, this_name_len, GFP_NOFS);
+   if (!new_name) {
+   ret =

[PATCH 2/2] fstests: generic test for fsync after renaming file

2016-03-30 Thread fdmanana
From: Filipe Manana 

Test that if we rename a file, create a new file that has the old name
of the other file and is a child of the same parent directory, fsync the
new inode, power fail and mount the filesystem, we do not lose the first
file and that file has the name it was renamed to.

This test is motivated by an issue found in btrfs which is fixed by the
following patch for the linux kernel:

  "Btrfs: fix file loss caused by fsync after rename and new inode"

Signed-off-by: Filipe Manana 
---
 tests/generic/341 | 90 +++
 tests/generic/341.out | 15 +
 tests/generic/group   |  1 +
 3 files changed, 106 insertions(+)
 create mode 100755 tests/generic/341
 create mode 100644 tests/generic/341.out

diff --git a/tests/generic/341 b/tests/generic/341
new file mode 100755
index 000..b70bd95
--- /dev/null
+++ b/tests/generic/341
@@ -0,0 +1,90 @@
+#! /bin/bash
+# FSQA Test No. 341
+#
+# Test that if we rename a file, create a new file that has the old name of the
+# other file and is a child of the same parent directory, fsync the new inode,
+# power fail and mount the filesystem, we do not lose the first file and that
+# file has the name it was renamed to.
+#
+#---
+#
+# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+mkdir $SCRATCH_MNT/a
+$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 16K" $SCRATCH_MNT/a/foo | _filter_xfs_io
+# Make sure everything done so far is durably persisted.
+sync
+
+# Now rename file foo to bar and create a new file named foo under the same
+# directory. After a power failure we must see the two files.
+mv $SCRATCH_MNT/a/foo $SCRATCH_MNT/a/bar
+$XFS_IO_PROG -f -c "pwrite -S 0xba 0 16K" $SCRATCH_MNT/a/foo | _filter_xfs_io
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
+
+echo "File digests before log replay:"
+md5sum $SCRATCH_MNT/a/foo | _filter_scratch
+md5sum $SCRATCH_MNT/a/bar | _filter_scratch
+
+# Simulate a power failure and mount again the filesystem to trigger replay of
+# its journal/log.
+_flakey_drop_and_remount
+
+echo "Directory a/ contents after log replay:"
+ls -R $SCRATCH_MNT/a | _filter_scratch
+
+echo "File digests after log replay:"
+# Must match what we got before the power failure.
+md5sum $SCRATCH_MNT/a/foo | _filter_scratch
+md5sum $SCRATCH_MNT/a/bar | _filter_scratch
+
+_unmount_flakey
+status=0
+exit
diff --git a/tests/generic/341.out b/tests/generic/341.out
new file mode 100644
index 000..29c3566
--- /dev/null
+++ b/tests/generic/341.out
@@ -0,0 +1,15 @@
+QA output created by 341
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File digests before log replay:
+9e5d56a1f9b2c93589f9d55480f971a1  SCRATCH_MNT/a/foo
+48c940ba3b8671d3d6ea74e4ccad8ca3  SCRATCH_MNT/a/bar
+Directory a/ contents after log replay:
+SCRATCH_MNT/a:
+bar
+foo
+File digests after log replay:
+9e5d56a1f9b2c93589f9d55480f971a1  SCRATCH_MNT/a/foo
+48c940ba3b8671d3d6ea74e4ccad8ca3  SCRATCH_MNT/a/bar
diff --git a/tests/generic/group b/tests/generic/group
index baaffdf..3ece496 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -343,3 +343,4 @@
 338 auto quick rw
 339 auto dir
 340 auto quick metadata
+341 auto quick metadata
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] fstests: generic test for fsync after renaming directory

2016-03-30 Thread fdmanana
From: Filipe Manana 

Test that if we rename a directory, create a new file or directory that
has the old name of our former directory and is a child of the same
parent directory, fsync the new inode, power fail and mount the
filesystem, we see our first directory with the new name and no files
under it were lost.

This test is motivated by an issue found in btrfs which is fixed by the
following patch for the linux kernel:

  "Btrfs: fix file loss caused by fsync after rename and new inode"

Signed-off-by: Filipe Manana 
---
 tests/generic/340 | 93 +++
 tests/generic/340.out | 21 
 tests/generic/group   |  1 +
 3 files changed, 115 insertions(+)
 create mode 100755 tests/generic/340
 create mode 100644 tests/generic/340.out

diff --git a/tests/generic/340 b/tests/generic/340
new file mode 100755
index 000..6fe6ee7
--- /dev/null
+++ b/tests/generic/340
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FSQA Test No. 340
+#
+# Test that if we rename a directory, create a new file or directory that has
+# the old name of our former directory and is a child of the same parent
+# directory, fsync the new inode, power fail and mount the filesystem, we see
+# our first directory with the new name and no files under it were lost.
+#
+#---
+#
+# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+mkdir -p $SCRATCH_MNT/a/x
+$XFS_IO_PROG -f -c "pwrite -S 0xaf 0 32K" $SCRATCH_MNT/a/x/foo | _filter_xfs_io
+$XFS_IO_PROG -f -c "pwrite -S 0xba 0 32K" $SCRATCH_MNT/a/x/bar | _filter_xfs_io
+# Make sure everything done so far is durably persisted.
+sync
+
+echo "File digests before power failure:"
+md5sum $SCRATCH_MNT/a/x/foo | _filter_scratch
+md5sum $SCRATCH_MNT/a/x/bar | _filter_scratch
+
+# Now rename directory x to y and create a new directory that is also named x.
+# Then fsync the new directory. After a power failure, we must see directories
+# y and x and directory y has the same files (and with the same content) it had
+# before the power failure.
+mv $SCRATCH_MNT/a/x $SCRATCH_MNT/a/y
+mkdir $SCRATCH_MNT/a/x
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/x
+
+# Simulate a power failure and mount again the filesystem to trigger replay of
+# its journal/log.
+_flakey_drop_and_remount
+
+echo "Directory a/ contents after log replay:"
+ls -R $SCRATCH_MNT/a | _filter_scratch
+
+echo "File digests after log replay:"
+# Must match what we got before the power failure.
+md5sum $SCRATCH_MNT/a/y/foo | _filter_scratch
+md5sum $SCRATCH_MNT/a/y/bar | _filter_scratch
+
+_unmount_flakey
+status=0
+exit
diff --git a/tests/generic/340.out b/tests/generic/340.out
new file mode 100644
index 000..f2fe4ca
--- /dev/null
+++ b/tests/generic/340.out
@@ -0,0 +1,21 @@
+QA output created by 340
+wrote 32768/32768 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File digests before power failure:
+b6ef98c3df97dfc5ff17266311c2fb9e  SCRATCH_MNT/a/x/foo
+41107c24d306bdc4fecac4007e9aa214  SCRATCH_MNT/a/x/bar
+Directory a/ contents after log replay:
+SCRATCH_MNT/a:
+x
+y
+
+SCRATCH_MNT/a/x:
+
+SCRATCH_MNT/a/y:
+bar
+foo
+File digests after log replay:
+b6ef98c3df97dfc5ff17266311c2fb9e  SCRATCH_MNT/a/y/foo
+41107c24d306bdc4fecac4007e9aa214  SCRATCH_MNT/a/y/bar
diff --git a/tests/generic/group b/tests/generic/group
index cd2a2b7..baaffdf 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -342,3 +342,4 @@
 337 auto quick metadata
 338 auto quick rw
 339 aut

Re: [PATCH v9 15/19] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

2016-03-30 Thread kbuild test robot
Hi Qu,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160330]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160330-160940
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: i386-randconfig-s1-201613 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/built-in.o: In function `check_dedupe_parameter':
>> dedupe.c:(.text+0x3675e5): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] Btrfs: fix unreplayable log after snapshot deletion and parent re-creation

2016-03-30 Thread Filipe Manana
On Thu, Mar 24, 2016 at 5:06 PM,   wrote:
> From: Filipe Manana 
>
> If we delete a snapshot, delete its parent directory, create a new
> directory with the same name as that parent and then fsync either that
> new directory or some file inside it, we end up with a log tree that
> is not possible to replay because the log replay procedure interprets
> the snapshot's directory item as a regular entry and not as a root
> item, resulting in the following failure and trace when mounting the
> filesystem:
>
> [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, 
> inode 257 parent 257
> [52174.512570] [ cut here ]
> [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 
> __btrfs_unlink_inode+0x178/0x351 [btrfs]()
> [52174.514681] BTRFS: Transaction aborted (error -2)
> [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay 
> crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport 
> tpm evdev i2c_piix4 proc
> [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: GW   
> 4.5.0-rc6-btrfs-next-27+ #1
> [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by 
> qemu-project.org 04/01/2014
> [52174.524053]   8801df2a7710 81264e93 
> 8801df2a7758
> [52174.524053]  0009 8801df2a7748 81051618 
> a03591cd
> [52174.524053]  fffe 88015e6e5000 88016dbc3c88 
> 88016dbc3c88
> [52174.524053] Call Trace:
> [52174.524053]  [] dump_stack+0x67/0x90
> [52174.524053]  [] warn_slowpath_common+0x99/0xb2
> [52174.524053]  [] ? __btrfs_unlink_inode+0x178/0x351 
> [btrfs]
> [52174.524053]  [] warn_slowpath_fmt+0x48/0x50
> [52174.524053]  [] __btrfs_unlink_inode+0x178/0x351 [btrfs]
> [52174.524053]  [] ? iput+0xb0/0x284
> [52174.524053]  [] btrfs_unlink_inode+0x1c/0x3d [btrfs]
> [52174.524053]  [] check_item_in_log+0x1fe/0x29b [btrfs]
> [52174.524053]  [] replay_dir_deletes+0x167/0x1cf [btrfs]
> [52174.524053]  [] fixup_inode_link_count+0x289/0x2aa 
> [btrfs]
> [52174.524053]  [] fixup_inode_link_counts+0xcb/0x105 
> [btrfs]
> [52174.524053]  [] btrfs_recover_log_trees+0x258/0x32c 
> [btrfs]
> [52174.524053]  [] ? replay_one_extent+0x511/0x511 [btrfs]
> [52174.524053]  [] open_ctree+0x1dd4/0x21b9 [btrfs]
> [52174.524053]  [] btrfs_mount+0x97e/0xaed [btrfs]
> [52174.524053]  [] ? trace_hardirqs_on+0xd/0xf
> [52174.524053]  [] mount_fs+0x67/0x131
> [52174.524053]  [] vfs_kern_mount+0x6c/0xde
> [52174.524053]  [] btrfs_mount+0x1ac/0xaed [btrfs]
> [52174.524053]  [] ? trace_hardirqs_on+0xd/0xf
> [52174.524053]  [] ? lockdep_init_map+0xb9/0x1b3
> [52174.524053]  [] mount_fs+0x67/0x131
> [52174.524053]  [] vfs_kern_mount+0x6c/0xde
> [52174.524053]  [] do_mount+0x8a6/0x9e8
> [52174.524053]  [] ? strndup_user+0x3f/0x59
> [52174.524053]  [] SyS_mount+0x77/0x9f
> [52174.524053]  [] entry_SYSCALL_64_fastpath+0x12/0x6b
> [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
>
> So when we delete a directory we need to propagate its last_unlink_trans
> value (updated on snapshot deletion) to its parent and then check at
> fsync time for it and fallback for a transaction commit.
>
> A test case for fstests follows.
>
>   seq=`basename $0`
>   seqres=$RESULT_DIR/$seq
>   echo "QA output created by $seq"
>   tmp=/tmp/$$
>   status=1  # failure is the default!
>   trap "_cleanup; exit \$status" 0 1 2 3 15
>
>   _cleanup()
>   {
>   _cleanup_flakey
>   cd /
>   rm -f $tmp.*
>   }
>
>   # get standard environment, filters and checks
>   . ./common/rc
>   . ./common/filter
>   . ./common/dmflakey
>
>   # real QA test starts here
>   _supported_fs btrfs
>   _supported_os Linux
>   _require_scratch
>   _require_dm_target flakey
>   _require_metadata_journaling $SCRATCH_DEV
>
>   rm -f $seqres.full
>
>   _populate_fs()
>   {
>   _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
>   $SCRATCH_MNT/testdir/snap
>   _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap
>   rmdir $SCRATCH_MNT/testdir
>   mkdir $SCRATCH_MNT/testdir
>   }
>
>   _scratch_mkfs >>$seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
>
>   mkdir $SCRATCH_MNT/testdir
>   _populate_fs
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
>   _flakey_drop_and_remount
>
>   echo "Filesystem contents after the first log replay:"
>   ls -R $SCRATCH_MNT | _filter_scratch
>
>   # Now do the same as before but instead of doing an fsync against the 
> directory,
>   # do an fsync against a file inside the directory.
>
>   _populate_fs
>   touch $SCRATCH_MNT/testdir/foobar
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foobar
>   _flakey_drop_and_remount
>
>   echo "Filesystem contents after the second log replay:"
>   ls -R $SCRATCH_MNT | _filter_scratch
>
>   _unmount_flakey
>   status=0
>   exit
>
> Signed-off-by: Filipe Manana 

This patch is no longer needed. It is replaced by the following new patch:

https://patchwork.kernel.org/patch

Re: [PATCH 2/2] fstests: generic test for fsync after renaming file

2016-03-30 Thread Filipe Manana
On Wed, Mar 30, 2016 at 10:39 AM,   wrote:
> From: Filipe Manana 
>
> Test that if we rename a file, create a new file that has the old name
> of the other file and is a child of the same parent directory, fsync the
> new inode, power fail and mount the filesystem, we do not lose the first
> file and that file has the name it was renamed to.
>
> This test is motivated by an issue found in btrfs which is fixed by the
> following patch for the linux kernel:
>
>   "Btrfs: fix file loss caused by fsync after rename and new inode"
>
> Signed-off-by: Filipe Manana 

Forgot to mention, but this time it's not only btrfs failing this test
(miracle). With a 4.5 kernel f2fs is also failing (but ext3/4, xfs and
reiserfs pass for example), as the file that was renamed is lost
(fails the same way as btrfs does).

> ---
>  tests/generic/341 | 90 
> +++
>  tests/generic/341.out | 15 +
>  tests/generic/group   |  1 +
>  3 files changed, 106 insertions(+)
>  create mode 100755 tests/generic/341
>  create mode 100644 tests/generic/341.out
>
> diff --git a/tests/generic/341 b/tests/generic/341
> new file mode 100755
> index 000..b70bd95
> --- /dev/null
> +++ b/tests/generic/341
> @@ -0,0 +1,90 @@
> +#! /bin/bash
> +# FSQA Test No. 341
> +#
> +# Test that if we rename a file, create a new file that has the old name of 
> the
> +# other file and is a child of the same parent directory, fsync the new 
> inode,
> +# power fail and mount the filesystem, we do not lose the first file and that
> +# file has the name it was renamed to.
> +#
> +#---
> +#
> +# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
> +# Author: Filipe Manana 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   _cleanup_flakey
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +_require_metadata_journaling $SCRATCH_DEV
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_init_flakey
> +_mount_flakey
> +
> +mkdir $SCRATCH_MNT/a
> +$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 16K" $SCRATCH_MNT/a/foo | _filter_xfs_io
> +# Make sure everything done so far is durably persisted.
> +sync
> +
> +# Now rename file foo to bar and create a new file named foo under the same
> +# directory. After a power failure we must see the two files.
> +mv $SCRATCH_MNT/a/foo $SCRATCH_MNT/a/bar
> +$XFS_IO_PROG -f -c "pwrite -S 0xba 0 16K" $SCRATCH_MNT/a/foo | _filter_xfs_io
> +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
> +
> +echo "File digests before log replay:"
> +md5sum $SCRATCH_MNT/a/foo | _filter_scratch
> +md5sum $SCRATCH_MNT/a/bar | _filter_scratch
> +
> +# Simulate a power failure and mount again the filesystem to trigger replay 
> of
> +# its journal/log.
> +_flakey_drop_and_remount
> +
> +echo "Directory a/ contents after log replay:"
> +ls -R $SCRATCH_MNT/a | _filter_scratch
> +
> +echo "File digests after log replay:"
> +# Must match what we got before the power failure.
> +md5sum $SCRATCH_MNT/a/foo | _filter_scratch
> +md5sum $SCRATCH_MNT/a/bar | _filter_scratch
> +
> +_unmount_flakey
> +status=0
> +exit
> diff --git a/tests/generic/341.out b/tests/generic/341.out
> new file mode 100644
> index 000..29c3566
> --- /dev/null
> +++ b/tests/generic/341.out
> @@ -0,0 +1,15 @@
> +QA output created by 341
> +wrote 16384/16384 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 16384/16384 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +File digests before log replay:
> +9e5d56a1f9b2c93589f9d55480f971a1  SCRATCH_MNT/a/foo
> +48c940ba3b8671d3d6ea74e4ccad8ca3  SCRATCH_MNT/a/bar
> +Directory a/ contents after log replay:
> +SCRATCH_MNT/a:
> +bar
> +foo
> +File digests after log replay:
> +9e5d56a1f9b2c93589f9d5548

Re: [PATCH v9 02/19] btrfs: dedupe: Introduce function to initialize dedupe info

2016-03-30 Thread kbuild test robot
Hi Wang,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160330]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160330-160940
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: i386-randconfig-s1-201613 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160330-160940 HEAD 
a69abdf2c788b7de14a7dc0c237c1df545d13175 builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   fs/built-in.o: In function `btrfs_dedupe_enable':
>> (.text+0x366ac1): undefined reference to `btrfs_dedupe_disable'
   fs/built-in.o: In function `btrfs_dedupe_enable':
   (.text+0x366b8b): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 11/12] btrfs: introduce helper functions to perform hot replace

2016-03-30 Thread Anand Jain



Hi,

 You are missing the patch set which includes
   https://patchwork.kernel.org/patch/8659651/

 btrfs: refactor btrfs_dev_replace_start for reuse


Thanks, Anand


On 03/29/2016 10:45 PM, kbuild test robot wrote:

Hi Anand,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-Introduce-a-new-function-to-check-if-all-chunks-a-OK-for-degraded-mount/20160329-222724
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: sparc64-allmodconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=sparc64

All error/warnings (new ones prefixed by >>):

fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start':

fs/btrfs/dev-replace.c:962:8: warning: passing argument 2 of 
'btrfs_dev_replace_start' from incompatible pointer type

  ret = btrfs_dev_replace_start(root, tgt_path,
^
fs/btrfs/dev-replace.c:308:5: note: expected 'struct 
btrfs_ioctl_dev_replace_args *' but argument is of type 'char *'
 int btrfs_dev_replace_start(struct btrfs_root *root,
 ^

fs/btrfs/dev-replace.c:962:8: error: too many arguments to function 
'btrfs_dev_replace_start'

  ret = btrfs_dev_replace_start(root, tgt_path,
^
fs/btrfs/dev-replace.c:308:5: note: declared here
 int btrfs_dev_replace_start(struct btrfs_root *root,
 ^

vim +/btrfs_dev_replace_start +962 fs/btrfs/dev-replace.c

956 if (btrfs_get_spare_device(&tgt_path)) {
957 btrfs_err(root->fs_info,
958 "No spare device found/configured in the 
kernel");
959 return -EINVAL;
960 }
961 
  > 962  ret = btrfs_dev_replace_start(root, tgt_path,
963 src_device->devid,
964 rcu_str_deref(src_device->name),
965 
BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-03-30 Thread Alex Lyakas
Thanks for your comments, Qu.

Alex.


On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo  wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.
>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo 
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>> Most of its code is hidden quietly in dedup.c and export the minimal
>>> interfaces for its caller.
>>> Reviewer and further developer would benefit from the unified
>>> framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>> One is the improved version of previous Fujitsu in-memory only dedup.
>>> The other one is enhanced dedup implementation from Liu Bo.
>>> Changed its tree structure to handle bytenr -> hash search for
>>> deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>> Now dedupe can work with compression.
>>> Means that, a dedupe miss case can be compressed, and dedupe hit case
>>> can also reuse compressed file extents.
>>>
>>> 4) Ioctl interface with persist dedup status
>>> Advised by David, now we use ioctl to enable/disable dedup.
>>>
>>> And we now have dedup status, recorded in the first item of dedup
>>> tree.
>>> Just like quota, once enabled, no extra ioctl is needed for next
>>> mount.
>>>
>>> 5) Ability to disable dedup for given dirs/files
>>> It works just like the compression prop method, by adding a new
>>> xattr.
>>>
>>> TODO:
>>> 1) Add extent-by-extent comparison for faster but more conflicting
>>> algorithm
>>> Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>> CPU may even be a bottleneck other than IO.
>>> But for faster hash, it will definitely cause conflicts, so we need
>>> extent comparison before we introduce new dedup algorithm.
>>>
>>> 2) Misc end-user related helpers
>>> Like handy and easy to implement dedup rate report.
>>> And method to query in-memory hash size for

Re: Global hotspare functionality

2016-03-30 Thread Austin S. Hemmelgarn

On 2016-03-29 16:26, Chris Murphy wrote:

On Tue, Mar 29, 2016 at 1:59 PM, Austin S. Hemmelgarn
 wrote:

On 2016-03-29 15:24, Yauhen Kharuzhy wrote:


On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:



   No. No. No please don't do that, it would lead to trouble in handing
   slow devices. I purposely didn't do it.



Hmm. Can you explain please? Sometimes admins may want to have
autoreplacement working automatically if drive was failed and removed
before unmounting and remounting again. The simplest way to achieve this —
add spare and always mount FS with 'degraded' option (we need to use
this option in any case if we have root fs on RAID, for instance, to
avoiding non-bootable state). So, if the autoreplacement code will check
for
missing drives also, this will working without user intervention. To
allow user to decide if he wants autoreplacement, we can add mount
option like '(no)hotspare' (I have done this already for our project and
will send patch after rebasing onto your new series). Yes, there are
side effects exists if you want to make some experiments with missing
drives in FS, but you can disable autoreplacement for such case.

If you know about any pitfalls in such scenarios, please point me to
them, I am newbie in FS-related kernel things.


If a disk is particularly slow to start up for some reason (maybe it's going
bad, maybe it's just got a slow interconnect (think SD cards), maybe it's
just really cold so the bearings seizing up), then this would potentially
force it out of the array when it shouldn't be.

That said, having things set to always allow degraded mounts is _extremely
dangerous_.  If the user does not know anything failed, they also can't know
they need to get anything fixed.  While notification could be used, it also
introduces a period of time where the user is at risk of data loss without
them having explicitly agreed to this risk (by manually telling it to mount
degraded).


I agree, certainly replace should not be automatic by default. And I'm
unconvinced this belongs in kernel code anyway because it's a matter
of policy. Policy stuff goes in user space, where capability to
achieve the policy goes in the kernel.

A reasonable exception is bad device ejection (e.g. mdadm faulty).

Considering spinning devices take a long time to rebuild already and
this probably won't change, a policy I'd like to see upon a drive
going bad (totally vanishing, or producing many read or write errors):
1. Bad device is ejected, volume is degraded.
2. Consider chunks with one remaining stripe (one copy) as degraded.
3. Degraded chunks are read only, so COW changes to non-degraded chunks.
4. Degraded metadata chunks are replicated elsewhere, happens right away.
5. Implied by 4, degraded data chunks aren't immediately replicated
but any change are, via COW.
6. Option, by policy, to immediately start replicating degraded data
chunks - either with existing storage or hot spare, which is also a
policy choice.

In particular, I'd like to see the single stripe metadata chunks
replicated soon so in case there's another device failure the entire
volume doesn't implode. Yes there's some data loss, still better than
100% data loss.
I've actually considered multiple times writing a daemon in Python to do 
this.  In general, I agree that it's mostly policy, and thus should be 
in userspace.  At the very least though, we really should have something 
in the kernel that we can watch from userspace (be it with select, 
epoll, inotify, fanotify, or something else) to tell us when a state 
change happens or the filesystem, as right now the only way I can see to 
do this is to poll the mount options.



I could possibly understand doing this for something that needs to be
guaranteed to come on line when powered on,  but **only** if it notifies
responsible parties that there was a problem **and** it is explicitly
documented, and even then I'd be wary of doing this unless there was
something in place to handle the possibility of false positives (yes, they
do happen), and to make certain that the failed hardware got replaced as
soon as possible.


Exactly. And I think it's safer to be more aggressive with (fairly)
immediate metadata replication to remaining devices, than it is with
data.

I'm considering this behavior for both single volume setups, as well
as multiple bricks in a cluster. And admittedly it's probably
cheaper/easier to just get n-way copies of metadata than the above
scheme I've written.

And even then, you would still have people with big arrays who would 
want the metadata re-striped immediately on a device failure.  I will, 
however, be extremely happy when n-way replication hits, as I then will 
not need to stack BTRFS raid1 on top of LVM RAID1 to get higher order 
replication levels.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: page allocation failure

2016-03-30 Thread David Sterba
On Tue, Mar 29, 2016 at 07:04:10PM -1000, Jean-Denis Girard wrote:
> Hi list,
> 
> I just started to use send / receive for backups to another drive.
> That's a great feature, but unfortunately I'm getting page allocation
> failure, see below.
> 
> My backup script does something like this for 11 sub-volumes:
>   btrfs subvolume snapshot -r vol /snaps
>   btrfs fi sync /snaps
>   btrfs send -p /snaps/vol_old /snaps/vol | btrfs receive -v /mnt/backup
>   btrfs fi sync /mnt/backup
>   btrfs subvolume delete -c /snaps/vol_old
>   mv /snaps/vol /snaps/vol_old
>   btrfs subvolume delete -c /backup/vol_old
>   btrfs subvolume snapshot -r :backup/vol \
>   /backup/vol_$(date +'%Y%m%d')
>   btrfs fi sync /backup
>   mv /backup/vol /backup/vol_old
> 
> This is on a up-to-date Fedora 23 system, with kernel
> 4.4.6-300.fc23.x86_64, and btrfs-progs v4.4.1 (recompiled on the
> system). The system is mostly idle when the error happens. The backup
> file system seems clean: btrfs check or scrub report no errors.
> 
> [ 3734.651439] btrfs: page allocation failure: order:4, mode:0x2404040

Order 4 is 64k, and most probably it's the allocation of a nodesize, the
IP offset in the function is close to beginning, there are two other
allocations that are served from the slab.

So do you have a filesystem with a 64k nodesize? Just checking.

The memory is fragmented so a contiguous 64k cannot be found, what we
can do is a fallback to vmalloc, that can assemble th 64k memory from
smaller pages. I'll send a patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


good documentation on btrfs internals and on disk layout

2016-03-30 Thread sri
Hi,

I could find very limited documentation related to on disk layout of btrfs 
and how all trees are related to each other. Except wiki which has very 
specific top level details I couldn't able to find more details on btrfs.

FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
layout of the file systems.

Could anybody please provide pointers for the same for better 
understanding of btrfs on disk layout and how each tree interacts provided 
multiple disks are configured for btrfs.

Thank you in advance

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Henk Slager
On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo  wrote:
> First of all.
>
> The "crossing stripe boundary" error message itself is *HARMLESS* for recent
> kernels.
>
> It only means, that metadata extent won't be checked by scrub on recent
> kernels.
> Because scrub by its codes, has a limitation that, it can only check tree
> blocks which are inside a 64K block.
>
> Old kernel won't have anything wrong, until that tree block is being
> scrubbed.
> When scrubbed, old kernel just BUG_ON().
>
> Now recent kernel will handle such limitation by checking extent allocation
> and avoid crossing boundary, so new created fs with new kernel won't cause
> such error message at all.
>
> But for old created fs, the problem can't be avoided, but at least, new
> kernels will not BUG_ON() when you scrub these extents, they just get
> ignored (not that good, but at least no BUG_ON).
>
> And new fsck will check such case, gives such warning.
>
> Overall, you're OK if you are using recent kernels.
>
> Marc Haber wrote on 2016/03/29 08:43 +0200:
>>
>> On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>>>
>>> Did you convert this filesystem from ext4 (or ext3)?
>>
>>
>> No.
>>
>>> You hadn't mentioned what version of btrfs-progs you're using, and that
>>> is
>>> somewhat important for recovery.  I'm not sure if current versions of
>>> btrfs
>>> check can fix this issue, but I know for a fact that older versions
>>> (prior
>>> to at least 4.1) can not fix it.
>>
>>
>> 4.1 for creation and btrfs check.
>
>
> I assume that you have run older kernel on it, like v4.1 or v4.2.
>
> In those old kernels, it lacks the check to avoid such extent allocation
> check.
>
>>
>>> As far as what the kernel is involved with, the easy way to check is if
>>> it's
>>> operating on a mounted filesystem or not.  If it only operates on mounted
>>> filesystems, it almost certainly goes through the kernel, if it only
>>> operates on unmounted filesystems, it's almost certainly done in
>>> userspace
>>> (except dev scan and technically fi show).
>>
>>
>> Then btrfs check is a userspace-only matter, as it wants the fs
>> unmounted, and it is irrelevant that I did btrfs check from a rescue
>> system with an older kernel, 3.16 if I recall correctly.
>
>
> Not recommended to use older kernel to RW mount or use older fsck to do
> repair.
> As it's possible that older kernel/btrfsck may allocate extent that cross
> the 64K boundary.
>
>>
>>> 2. Regarding general support:  If you're using an enterprise distribution
>>> (RHEL, SLES, CentOS, OEL, or something similar), you are almost certainly
>>> going to get better support from your vendor than from the mailing list
>>> or
>>> IRC.
>>
>>
>> My "productive" desktops (fan is one of them) run Debian unstable with
>> a current vanilla kernel. At the moment, I can't use 4.5 because it
>> acts up with KVM.  When I need a rescue system, I use grml, which
>> unfortunately hasn't released since November 2014 and is still with
>> kernel 3.16
>
>
> To fix your problem(make these error message just disappear, even they are
> harmless on recent kernels), the most easy one, is to balance your metadata.

I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
filesystem mentioned in here:
http://www.spinics.net/lists/linux-btrfs/msg51405.html

but still   bad metadata [ ),  crossing stripe boundary   messages,
double amount compared to 2 months ago

Kernel operating this fs has always been maximum 1 month behind
'Latest Stable Kernel' (kernel.org terminology)

> As I explained, the bug only lies in metadata, and balance will allocate new
> tree blocks, then copy old data into new locations.
>
> In the allocation process of recent kernel, it will avoid such cross
> boundary, and to fix your problem.
>
> But if you are using old kernels, don't scrub your metadata.
>
> Thanks,
> Qu
>>
>>
>> Greetings
>> Marc
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fallback to vmalloc in btrfs_compare_tree

2016-03-30 Thread David Sterba
The allocation of node could fail if the memory is too fragmented for a
given node size, practically observed with 64k.

http://article.gmane.org/gmane.comp.file-systems.btrfs/54689

Reported-by: Jean-Denis Girard 
Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 77592931ab4f..ec7928a27aaa 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -5361,10 +5362,13 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
goto out;
}
 
-   tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL);
+   tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL | __GFP_NOWARN);
if (!tmp_buf) {
-   ret = -ENOMEM;
-   goto out;
+   tmp_buf = vmalloc(left_root->nodesize);
+   if (!tmp_buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
}
 
left_path->search_commit_root = 1;
@@ -5565,7 +5569,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 out:
btrfs_free_path(left_path);
btrfs_free_path(right_path);
-   kfree(tmp_buf);
+   kvfree(tmp_buf);
return ret;
 }
 
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attempt to mount after crash during rebalance hard crashes server

2016-03-30 Thread Warren, Daniel
Sorry, I had about 3.5MB if xterm buffer, including my test to see if
I would get a panic with the old kernel i had left in grub - I grabbed
the wrong panic.

running 4.4.6 ( which deb packages as 4.4.0 for some reason - I was
confused) I am able to capture this on a mount attempt before my ssh
connection fails:

Mar 30 09:51:38 ds4-ls0 kernel: [67178.590745] BTRFS info (device
dm-45): disk space caching is enabled
Mar 30 09:51:38 ds4-ls0 systemd[1]: systemd-udevd.service: Got
notification message from PID 338 (WATCHDOG=1)
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: seq 3514 queued, 'add' 'bdi'
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: Validate module index
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: Check if link
configuration needs reloading.
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: seq 3514 forked new worker [7411]
Mar 30 09:51:38 ds4-ls0 systemd-udevd[7411]: seq 3514 running
Mar 30 09:51:38 ds4-ls0 systemd-udevd[7411]: passed device to netlink
monitor 0x55c10d5c79b0
Mar 30 09:51:38 ds4-ls0 systemd-udevd[7411]: seq 3514 processed
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: cleanup idle workers
Mar 30 09:51:38 ds4-ls0 systemd-udevd[7411]: Unload module index
Mar 30 09:51:38 ds4-ls0 systemd-udevd[7411]: Unloaded link
configuration context.
Mar 30 09:51:38 ds4-ls0 systemd-udevd[338]: worker [7411] exited
Mar 30 09:51:38 ds4-ls0 kernel: [67178.841517] BTRFS info (device
dm-45): bdev /dev/dm-31 errs: wr 13870290, rd 9, flush 2798850,
corrupt 0, gen 0
Mar 30 09:52:09 ds4-ls0 kernel: [67207.430391] BUG: unable to handle
kernel NULL pointer dereference at 01f0
Mar 30 09:52:09 ds4-ls0 kernel: [67207.477511] IP:
[] can_overcommit+0x1e/0xf0 [btrfs]
Mar 30 09:52:09 ds4-ls0 kernel: [67207.516215] PGD 0


I ran check last night - the output is about 23MB - don't know if that
is useful, or where to look.

I only posted at the recommendation of someone in IRC, in hopes to be
helpful, as a kernel panic seems an extreme result of a corrupted FS.

This machine is an off site copy of a file archive, I need to either
fix or recreate it to maintain redundancy, but the up-time
requirements are basically 0.

The old kernel is the result of this machine being built when it was
and then basically left as a black box.

If poking at this is not of use to anybody I'll just run check
--repair and see what I get.

Daniel Warren
Unix System Admin,Compliance Infrastructure Architect, ITServices
MCMC LLC


On Tue, Mar 29, 2016 at 6:55 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Warren, Daniel posted on Tue, 29 Mar 2016 16:21:28 -0400 as excerpted:
>
>> I'm running 4.4.0 from deb sid
>
> Correction.
>
> According to the kernel panic you posted at...
>
> http://pastebin.com/aBF6XmzA
>
> ... you're running kernel 3.16.something.
>
> You might be running btrfs-progs userspace 4.4.0, but on mounted
> filesystems it's the kernel code that counts, not the userspace code.
>
> Btrfs is still stabilizing, and kernel 3.16 is ancient history.  On this
> list we're forward focused and track mainline.  If your distro supports
> btrfs on that old a kernel, that's their business, but we don't track
> what patches they may or may not have backported and thus can't really
> support it here very well, so in that case, you really should be looking
> to your distro for that support, as they know what they've backported and
> what they haven't, and are thus in a far better position to provide that
> support.
>
> On this list, meanwhile, we recommend one of two kernel tracks, both
> mainline, current or LTS.  On current we recommend and provide the best
> support for the latest two kernel series.  With 4.5 out that's 4.5 and
> 4.4.
>
> On the LTS track, the former position was similar, the latest two LTS
> kernel series, with 4.4 being the latest and 4.1 the previous one.
> However, as btrfs has matured, now the second LTS series back, 3.18,
> wasn't bad, and while we still really recommend the last couple LTS
> series, we do recognize that some people will still be on 3.18 and we
> still do our best to support them as well.
>
> But before 3.18, and on non-mainline-LTS kernels more than two back, so
> currently 4.4, while we'll still do the best we can, unless it's a known
> issue recognizable on sight, very often that best is simply to ask that
> people upgrade to something reasonably current and report back with their
> results then, if the problem remains.
>
> As for btrfs-progs userspace, during normal operations, most of the time
> the userspace code simply calls the appropriate kernel functionality to
> do the real work, so userspace version isn't as important.  Mkfs.btrfs is
> an exception, and of course once the filesystem is having issues and
> you're using btrfs check or btrfs restore, along with other tools, to try
> to diagnose and fix the problem or at least to recover files off the
> unmountable filesystem, /then/ it's userspace code doing the work, and
> the userspace version becomes far more important.  And userspace is
> writt

Re: [PATCH] btrfs-progs: mkfs: fix an error when using DUP on multidev fs

2016-03-30 Thread David Sterba
On Fri, Mar 25, 2016 at 10:55:34AM +0900, Satoru Takeuchi wrote:
> To accept DUP on multidev fs, in addition to the following
> commit, we need to mark DUP as an allowed data/metadata
> profile.
> 
> commit 42f1279bf8e9 ("btrfs-progs: mkfs: allow DUP on multidev fs, only warn")
> 
> * actual result
> 
>   =
>   # ./mkfs.btrfs -f -m DUP -d DUP /dev/sdb1 /dev/sdb2
>   btrfs-progs v4.5-24-ga35b7e6
>   See http://btrfs.wiki.kernel.org for more information.
> 
>   WARNING: DUP is not recommended on filesystem with multiple devices
>   ERROR: unable to create FS with metadata profile DUP (have 2 devices but 1 
> devices are required)
>   =
> 
> * expected result
> 
>   =
>   # ./mkfs.btrfs -f -m dup -d dup /dev/sdb1 /dev/sdb2
>   WARNING: DUP is not recommended on filesystem with multiple devices
>   btrfs-progs v4.5-25-g1a10a3c
>   See http://btrfs.wiki.kernel.org for more information.
> 
>   Label:  (null)
>   UUID:   010d72ff-c87c-4516-8916-5e635719d110
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:28.87GiB
>   Block group profiles:
> Data: DUP   1.01GiB
> Metadata: DUP   1.01GiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  2
>   Devices:
>  IDSIZE  PATH
>   1   953.00MiB  /dev/sdb1
>   227.94GiB  /dev/sdb2
> 
>   ==
> 
> Signed-off-by: Satoru Takeuchi 

Applied, mkfs tests updated. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Infinite loop in in btrfs_find_space_cluster() with btrfs_free_cluster::lock held

2016-03-30 Thread Ilya Dryomov
Hi,

We are hitting the attached lockup on a somewhat regular basis during
nightly tests.  Looks like a bunch of CPUs spin in find_free_extent()
on btrfs_free_cluster::lock, which is held by writer, who seems to be
stuck in an endless loop in btrfs_find_space_cluster(), trying to
cleanup bitmaps list.  Smells like a list corruption to me?

The kernel is ancient in btrfs terms - ubuntu's 3.13.0-83-generic, but
the surroundings look sufficiently similar to upstream and given recent
patches like 1b9b922a3a60 ("Btrfs: check for empty bitmap list in
setup_cluster_bitmaps") I thought this might be relevant for upstream.

Thanks,

Ilya
[74750.641965] CPU: 6 PID: 12768 Comm: btrfs-transacti Not tainted 
3.13.0-83-generic #127-Ubuntu
[74750.650976] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 
09/07/2015
[74750.658951] task: 88074b0e8000 ti: 88083c564000 task.ti: 
88083c564000
[74750.666936] RIP: 0010:[]  [] 
_raw_spin_lock+0x37/0x50
[74750.675645] RSP: 0018:88083c565b50  EFLAGS: 0206
[74750.681469] RAX: 18ca RBX: 88083c565b48 RCX: 09a0
[74750.689126] RDX: 09a6 RSI: 09a6 RDI: 88084f6671b0
[74750.696784] RBP: 88083c565b50 R08: 88083c565cef R09: 0001
[74750.704466] R10: a019c9e6 R11: ea001ff7f6c0 R12: 8807e829d4e0
[74750.712140] R13: a02073ab R14: 88083c565ae0 R15: 8807e829d4e0
[74750.719811] FS:  () GS:88087fd8() 
knlGS:
[74750.728450] CS:  0010 DS:  ES:  CR0: 80050033
[74750.734744] CR2: 7f4d2b6a9060 CR3: 01c0e000 CR4: 001407e0
[74750.742437] Stack:
[74750.745001]  88083c565c28 a01b07a3 88074873a000 
880530131d20
[74750.753033]  880791982000 88083c565cef 1000 
00040002
[74750.761065]  a01a3a97 0020 880523f4a800 
0001
[74750.769105] Call Trace:
[74750.772137]  [] find_free_extent+0x213/0xc30 [btrfs]
[74750.779259]  [] ? btrfs_del_items+0x367/0x470 [btrfs]
[74750.786476]  [] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
[74750.793869]  [] __btrfs_prealloc_file_range+0xe5/0x380 
[btrfs]
[74750.801876]  [] btrfs_prealloc_file_range_trans+0x30/0x40 
[btrfs]
[74750.810136]  [] btrfs_write_dirty_block_groups+0x4d3/0x620 
[btrfs]
[74750.818469]  [] commit_cowonly_roots+0x151/0x213 [btrfs]
[74750.825940]  [] btrfs_commit_transaction+0x483/0x970 
[btrfs]
[74750.833765]  [] transaction_kthread+0x1b5/0x240 [btrfs]
[74750.841156]  [] ? btrfs_cleanup_transaction+0x550/0x550 
[btrfs]
[74750.849244]  [] kthread+0xd2/0xf0
[74750.854729]  [] ? kthread_create_on_node+0x1c0/0x1c0
[74750.861861]  [] ret_from_fork+0x58/0x90
[74750.867868]  [] ? kthread_create_on_node+0x1c0/0x1c0

[74758.651947] CPU: 5 PID: 13299 Comm: kworker/u16:1 Not tainted 
3.13.0-83-generic #127-Ubuntu
[74758.662554] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 
09/07/2015
[74758.672309] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-15)
[74758.681114] task: 880734153000 ti: 88078ad98000 task.ti: 
88078ad98000
[74758.690802] RIP: 0010:[]  [] 
_raw_spin_lock+0x32/0x50
[74758.701202] RSP: 0018:88078ad99728  EFLAGS: 0202
[74758.708662] RAX: 4bf5 RBX: 811f0610 RCX: 09a0
[74758.717910] RDX: 09a4 RSI: 09a4 RDI: 88084f6671b0
[74758.727114] RBP: 88078ad99728 R08: 88078ad998b7 R09: 0001
[74758.736311] R10:  R11: ea001b38bc00 R12: 88084e918770
[74758.745503] R13: 88084e9188c0 R14: 8114f8ee R15: 88078ad99698
[74758.754665] FS:  () GS:88087fd4() 
knlGS:
[74758.764805] CS:  0010 DS:  ES:  CR0: 80050033
[74758.772582] CR2: 7f4d2b695000 CR3: 01c0e000 CR4: 001407e0
[74758.781766] Stack:
[74758.785786]  88078ad99800 a01b07a3 8807e6bc5000 
880853034e70
[74758.795307]  88078ad997b8 88078ad998b7 880853034e40 
00040002
[74758.804838]  88078ad99778 0020 880523f4a800 
0001
[74758.814337] Call Trace:
[74758.818811]  [] find_free_extent+0x213/0xc30 [btrfs]
[74758.827389]  [] ? alloc_extent_state+0x21/0xc0 [btrfs]
[74758.836147]  [] ? __lookup_extent_mapping+0xa0/0x150 
[btrfs]
[74758.845413]  [] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
[74758.854224]  [] cow_file_range+0x135/0x430 [btrfs]
[74758.862593]  [] run_delalloc_range+0x312/0x350 [btrfs]
[74758.871310]  [] ? 
find_lock_delalloc_range.constprop.43+0x1b9/0x1f0 [btrfs]
[74758.881857]  [] ? ata_scsi_queuecmd+0x133/0x400
[74758.889974]  [] __extent_writepage+0x2f4/0x760 [btrfs]
[74758.898703]  [] ? btrfs_add_delayed_iput+0x61/0xc0 [btrfs]
[74758.907763]  [] ? __blk_run_queue+0x33/0x40
[74758.915511]  [] ? find_get_pages_tag+0xd1/0x180
[74758.923600]  [] ? kmem_cache_alloc_trace+0x3c/0x1f0
[74758.932051]  [] 
ext

Re: btrfs: page allocation failure

2016-03-30 Thread Jean-Denis Girard
Hi David,


Le 30/03/2016 03:50, David Sterba a écrit :
> On Tue, Mar 29, 2016 at 07:04:10PM -1000, Jean-Denis Girard wrote:
>>
>> [ 3734.651439] btrfs: page allocation failure: order:4, mode:0x2404040
> 
> Order 4 is 64k, and most probably it's the allocation of a nodesize, the
> IP offset in the function is close to beginning, there are two other
> allocations that are served from the slab.
> 
> So do you have a filesystem with a 64k nodesize? Just checking.

Yes, I do. Both the main filesystem and the backup filesystem were
created with btrfs -4.4 using: mkfs.btrfs --nodesize 64K ...

> The memory is fragmented so a contiguous 64k cannot be found, what we
> can do is a fallback to vmalloc, that can assemble th 64k memory from
> smaller pages. I'll send a patch.

Great!


Thanks,
-- 
Jean-Denis Girard

SysNuxSystèmes   Linux   en   Polynésie   française
http://www.sysnux.pf/ Tél: +689 40.50.10.40 / GSM: +689 87.79.75.27



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fallback to vmalloc in btrfs_compare_tree

2016-03-30 Thread Liu Bo
On Wed, Mar 30, 2016 at 04:05:43PM +0200, David Sterba wrote:
> The allocation of node could fail if the memory is too fragmented for a
> given node size, practically observed with 64k.

It's not a critical path.  Why not use vmalloc directly?

Thanks,

-liubo

> 
> http://article.gmane.org/gmane.comp.file-systems.btrfs/54689
> 
> Reported-by: Jean-Denis Girard 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 77592931ab4f..ec7928a27aaa 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -5361,10 +5362,13 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
>   goto out;
>   }
>  
> - tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL);
> + tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL | __GFP_NOWARN);
>   if (!tmp_buf) {
> - ret = -ENOMEM;
> - goto out;
> + tmp_buf = vmalloc(left_root->nodesize);
> + if (!tmp_buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
>   }
>  
>   left_path->search_commit_root = 1;
> @@ -5565,7 +5569,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
>  out:
>   btrfs_free_path(left_path);
>   btrfs_free_path(right_path);
> - kfree(tmp_buf);
> + kvfree(tmp_buf);
>   return ret;
>  }
>  
> -- 
> 2.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infinite loop in in btrfs_find_space_cluster() with btrfs_free_cluster::lock held

2016-03-30 Thread Liu Bo
On Wed, Mar 30, 2016 at 05:24:04PM +0200, Ilya Dryomov wrote:
> Hi,
> 
> We are hitting the attached lockup on a somewhat regular basis during
> nightly tests.  Looks like a bunch of CPUs spin in find_free_extent()
> on btrfs_free_cluster::lock, which is held by writer, who seems to be
> stuck in an endless loop in btrfs_find_space_cluster(), trying to
> cleanup bitmaps list.  Smells like a list corruption to me?

My objdump shows that find_free_extent() may wait on 
down_read(&space_info->groups_sem);

One possible thing is that there're too many entries in bitmap, and
list_for_each_entry just is just stuck there.

Are these stacks from "Blocked more than 120s"?

Thanks,

-liubo

> 
> The kernel is ancient in btrfs terms - ubuntu's 3.13.0-83-generic, but
> the surroundings look sufficiently similar to upstream and given recent
> patches like 1b9b922a3a60 ("Btrfs: check for empty bitmap list in
> setup_cluster_bitmaps") I thought this might be relevant for upstream.
> 
> Thanks,
> 
> Ilya

> [74750.641965] CPU: 6 PID: 12768 Comm: btrfs-transacti Not tainted 
> 3.13.0-83-generic #127-Ubuntu
> [74750.650976] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 
> 09/07/2015
> [74750.658951] task: 88074b0e8000 ti: 88083c564000 task.ti: 
> 88083c564000
> [74750.666936] RIP: 0010:[]  [] 
> _raw_spin_lock+0x37/0x50
> [74750.675645] RSP: 0018:88083c565b50  EFLAGS: 0206
> [74750.681469] RAX: 18ca RBX: 88083c565b48 RCX: 
> 09a0
> [74750.689126] RDX: 09a6 RSI: 09a6 RDI: 
> 88084f6671b0
> [74750.696784] RBP: 88083c565b50 R08: 88083c565cef R09: 
> 0001
> [74750.704466] R10: a019c9e6 R11: ea001ff7f6c0 R12: 
> 8807e829d4e0
> [74750.712140] R13: a02073ab R14: 88083c565ae0 R15: 
> 8807e829d4e0
> [74750.719811] FS:  () GS:88087fd8() 
> knlGS:
> [74750.728450] CS:  0010 DS:  ES:  CR0: 80050033
> [74750.734744] CR2: 7f4d2b6a9060 CR3: 01c0e000 CR4: 
> 001407e0
> [74750.742437] Stack:
> [74750.745001]  88083c565c28 a01b07a3 88074873a000 
> 880530131d20
> [74750.753033]  880791982000 88083c565cef 1000 
> 00040002
> [74750.761065]  a01a3a97 0020 880523f4a800 
> 0001
> [74750.769105] Call Trace:
> [74750.772137]  [] find_free_extent+0x213/0xc30 [btrfs]
> [74750.779259]  [] ? btrfs_del_items+0x367/0x470 [btrfs]
> [74750.786476]  [] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
> [74750.793869]  [] __btrfs_prealloc_file_range+0xe5/0x380 
> [btrfs]
> [74750.801876]  [] 
> btrfs_prealloc_file_range_trans+0x30/0x40 [btrfs]
> [74750.810136]  [] 
> btrfs_write_dirty_block_groups+0x4d3/0x620 [btrfs]
> [74750.818469]  [] commit_cowonly_roots+0x151/0x213 [btrfs]
> [74750.825940]  [] btrfs_commit_transaction+0x483/0x970 
> [btrfs]
> [74750.833765]  [] transaction_kthread+0x1b5/0x240 [btrfs]
> [74750.841156]  [] ? btrfs_cleanup_transaction+0x550/0x550 
> [btrfs]
> [74750.849244]  [] kthread+0xd2/0xf0
> [74750.854729]  [] ? kthread_create_on_node+0x1c0/0x1c0
> [74750.861861]  [] ret_from_fork+0x58/0x90
> [74750.867868]  [] ? kthread_create_on_node+0x1c0/0x1c0
> 
> [74758.651947] CPU: 5 PID: 13299 Comm: kworker/u16:1 Not tainted 
> 3.13.0-83-generic #127-Ubuntu
> [74758.662554] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 
> 09/07/2015
> [74758.672309] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-15)
> [74758.681114] task: 880734153000 ti: 88078ad98000 task.ti: 
> 88078ad98000
> [74758.690802] RIP: 0010:[]  [] 
> _raw_spin_lock+0x32/0x50
> [74758.701202] RSP: 0018:88078ad99728  EFLAGS: 0202
> [74758.708662] RAX: 4bf5 RBX: 811f0610 RCX: 
> 09a0
> [74758.717910] RDX: 09a4 RSI: 09a4 RDI: 
> 88084f6671b0
> [74758.727114] RBP: 88078ad99728 R08: 88078ad998b7 R09: 
> 0001
> [74758.736311] R10:  R11: ea001b38bc00 R12: 
> 88084e918770
> [74758.745503] R13: 88084e9188c0 R14: 8114f8ee R15: 
> 88078ad99698
> [74758.754665] FS:  () GS:88087fd4() 
> knlGS:
> [74758.764805] CS:  0010 DS:  ES:  CR0: 80050033
> [74758.772582] CR2: 7f4d2b695000 CR3: 01c0e000 CR4: 
> 001407e0
> [74758.781766] Stack:
> [74758.785786]  88078ad99800 a01b07a3 8807e6bc5000 
> 880853034e70
> [74758.795307]  88078ad997b8 88078ad998b7 880853034e40 
> 00040002
> [74758.804838]  88078ad99778 0020 880523f4a800 
> 0001
> [74758.814337] Call Trace:
> [74758.818811]  [] find_free_extent+0x213/0xc30 [btrfs]
> [74758.827389]  [] ? alloc_extent_state+0x21/0xc0 [btrfs]
> [74758.836147]  [] ? __lookup_extent_mapping+0xa0/0x150 
> [btrfs]
> [74758.845413]  [] btrfs_reserve_exten

Re: good documentation on btrfs internals and on disk layout

2016-03-30 Thread Liu Bo
On Wed, Mar 30, 2016 at 01:58:03PM +, sri wrote:
> Hi,
> 
> I could find very limited documentation related to on disk layout of btrfs 
> and how all trees are related to each other. Except wiki which has very 
> specific top level details I couldn't able to find more details on btrfs.
> 
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
> layout of the file systems.
> 
> Could anybody please provide pointers for the same for better 
> understanding of btrfs on disk layout and how each tree interacts provided 
> multiple disks are configured for btrfs.

There is a paper[1] about btrfs filesystem which covers all the details.

[1]: BTRFS: The Linux B-Tree Filesystem

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fallback to vmalloc in btrfs_compare_tree

2016-03-30 Thread David Sterba
On Wed, Mar 30, 2016 at 10:10:45AM -0700, Liu Bo wrote:
> On Wed, Mar 30, 2016 at 04:05:43PM +0200, David Sterba wrote:
> > The allocation of node could fail if the memory is too fragmented for a
> > given node size, practically observed with 64k.
> 
> It's not a critical path.  Why not use vmalloc directly?

We should try to avoid vmalloc if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infinite loop in in btrfs_find_space_cluster() with btrfs_free_cluster::lock held

2016-03-30 Thread Ilya Dryomov
On Wed, Mar 30, 2016 at 7:25 PM, Liu Bo  wrote:
> On Wed, Mar 30, 2016 at 05:24:04PM +0200, Ilya Dryomov wrote:
>> Hi,
>>
>> We are hitting the attached lockup on a somewhat regular basis during
>> nightly tests.  Looks like a bunch of CPUs spin in find_free_extent()
>> on btrfs_free_cluster::lock, which is held by writer, who seems to be
>> stuck in an endless loop in btrfs_find_space_cluster(), trying to
>> cleanup bitmaps list.  Smells like a list corruption to me?
>
> My objdump shows that find_free_extent() may wait on 
> down_read(&space_info->groups_sem);

It's spinning on a spinlock:

6228 if (last_ptr) {
6229 spin_lock(&last_ptr->lock);
6230 if (last_ptr->block_group)

>
> One possible thing is that there're too many entries in bitmap, and
> list_for_each_entry just is just stuck there.

I don't think so - look at the two journal_write splats.

>
> Are these stacks from "Blocked more than 120s"?

No, these are all soft lockups.  Sorry, I had to edit it to make it
readable and concentrated on stack traces.  I guess "BUG: soft lockup"
were mixed up with modules and object code all over the place.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fallback to vmalloc in btrfs_compare_tree

2016-03-30 Thread Jean-Denis Girard
Hi David,

Le 30/03/2016 04:05, David Sterba a écrit :
> The allocation of node could fail if the memory is too fragmented for a
> given node size, practically observed with 64k.
> 
> http://article.gmane.org/gmane.comp.file-systems.btrfs/54689
> 
> Reported-by: Jean-Denis Girard 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 77592931ab4f..ec7928a27aaa 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -5361,10 +5362,13 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
>   goto out;
>   }
>  
> - tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL);
> + tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL | __GFP_NOWARN);
>   if (!tmp_buf) {
> - ret = -ENOMEM;
> - goto out;
> + tmp_buf = vmalloc(left_root->nodesize);
> + if (!tmp_buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
>   }
>  
>   left_path->search_commit_root = 1;
> @@ -5565,7 +5569,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
>  out:
>   btrfs_free_path(left_path);
>   btrfs_free_path(right_path);
> - kfree(tmp_buf);
> + kvfree(tmp_buf);
>   return ret;
>  }
>  
> 

I adapted / applied the patch for kernel-4.4.6, rebooted and now the
backup completes without error, thanks a lot!

Tested by:  Jean-Denis Girard 


Thanks,
-- 
Jean-Denis Girard

SysNuxSystèmes   Linux   en   Polynésie   française
http://www.sysnux.pf/ Tél: +689 40.50.10.40 / GSM: +689 87.79.75.27
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-03-30 Thread Dave Stevens

Quoting Liu Bo :


On Wed, Mar 30, 2016 at 01:58:03PM +, sri wrote:

Hi,

I could find very limited documentation related to on disk layout of btrfs
and how all trees are related to each other. Except wiki which has very
specific top level details I couldn't able to find more details on btrfs.

FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
layout of the file systems.

Could anybody please provide pointers for the same for better
understanding of btrfs on disk layout and how each tree interacts provided
multiple disks are configured for btrfs.


There is a paper[1] about btrfs filesystem which covers all the details.

[1]: BTRFS: The Linux B-Tree Filesystem


and this is where it is:

http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf

D



Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
"As long as politics is the shadow cast on society by big business,
the attenuation of the shadow will not change the substance."

-- John Dewey





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fallocate mode flag for "unshare blocks"?

2016-03-30 Thread Darrick J. Wong
Hi all,

Christoph and I have been working on adding reflink and CoW support to
XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
that future file writes cannot ENOSPC, I extended the XFS fallocate
handler to unshare any shared blocks via the copy on write mechanism I
built for it.  However, Christoph shared the following concerns with
me about that interpretation:

> I know that I suggested unsharing blocks on fallocate, but it turns out
> this is causing problems.  Applications expect falloc to be a fast
> metadata operation, and copying a potentially large number of blocks
> is against that expextation.  This is especially bad for the NFS
> server, which should not be blocked for a long time in a synchronous
> operation.
> 
> I think we'll have to remove the unshare and just fail the fallocate
> for a reflinked region for now.  I still think it makes sense to expose
> an unshare operation, and we probably should make that another
> fallocate mode.

With that in mind, how do you all think we ought to resolve this?
Should we add a new fallocate mode flag that means "unshare the shared
blocks"?  Obviously, this unshare flag cannot be used in conjunction
with hole punching, zero range, insert range, or collapse range.  This
breaks the expectation that writing to a file after fallocate won't
ENOSPC.

Or is it ok that fallocate could block, potentially for a long time as
we stream cows through the page cache (or however unshare works
internally)?  Those same programs might not be expecting fallocate to
take a long time.

Can we do better than either solution?  It occurs to me that XFS does
unshare by reading the file data into the pagecache, marking the pages
dirty, and flushing the dirty pages; performance could be improved by
skipping the flush at the end.  We won't ENOSPC, because the XFS
delalloc system is careful enough to check that there are enough free
blocks to handle both the allocation and the metadata updates.  The
only gap in this scheme that I can see is if we fallocate, crash, and
upon restart the program then tries to write without retrying the
fallocate.  Can we trade some performance for the added requirement
that we must fallocate -> write -> fsync, and retry the trio if we
crash before the fsync returns?  I think that's already an implicit
requirement, so we might be ok here.

Opinions?  I rather like the last option, though I've only just
thought of it and have not had time to examine it thoroughly, and it's
specific to XFS. :)

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-03-30 Thread Hugo Mills
On Wed, Mar 30, 2016 at 01:58:03PM +, sri wrote:
> I could find very limited documentation related to on disk layout of btrfs 
> and how all trees are related to each other. Except wiki which has very 
> specific top level details I couldn't able to find more details on btrfs.
> 
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
> layout of the file systems.
> 
> Could anybody please provide pointers for the same for better 
> understanding of btrfs on disk layout and how each tree interacts provided 
> multiple disks are configured for btrfs.

   What are you intending to do? You'll need different things
depending on whether you are, for example, using the BTRFS_TREE_SEARCH
ioctl online to gather high-level information, or working your way
through the datapaths from the superblock right down to individual
bytes of a file for offline access.

   If you're using BTRFS_TREE_SEARCH, for example, you won't need to
know anything about the superblocks or the way that trees are
implemented. In fact, it's a good idea if you can avoid getting into
those details at all.

   The high-level view of how the data model fits together is at
[1]. Individual structures referenced in there are best examined in
ctree.h for the details, although there's a little more detailed
description at [2]. There's some documentation on the basic APIs used
for reading the btrees at [3]. If you really _have_ to access trees
yourself, the tree structure is at [4], but see my comment above about
that. The way that the FS-tree metadata is put together to make up
POSIX directory structures is at [5].

   After all that, you're down to looking at the data structures in
ctree.h, and grepping through the source code to see how they're used
(which is how [1] was written in the first place).

   Hugo.

[1] https://btrfs.wiki.kernel.org/index.php/Data_Structures
[2] https://btrfs.wiki.kernel.org/index.php/On-disk_Format
[3] https://btrfs.wiki.kernel.org/index.php/Code_documentation
[4] https://btrfs.wiki.kernel.org/index.php/Btrfs_design
[5] https://btrfs.wiki.kernel.org/index.php/Trees

-- 
Hugo Mills | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: fallocate mode flag for "unshare blocks"?

2016-03-30 Thread Austin S. Hemmelgarn

On 2016-03-30 14:27, Darrick J. Wong wrote:

Hi all,

Christoph and I have been working on adding reflink and CoW support to
XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
that future file writes cannot ENOSPC, I extended the XFS fallocate
handler to unshare any shared blocks via the copy on write mechanism I
built for it.  However, Christoph shared the following concerns with
me about that interpretation:


I know that I suggested unsharing blocks on fallocate, but it turns out
this is causing problems.  Applications expect falloc to be a fast
metadata operation, and copying a potentially large number of blocks
is against that expextation.  This is especially bad for the NFS
server, which should not be blocked for a long time in a synchronous
operation.

I think we'll have to remove the unshare and just fail the fallocate
for a reflinked region for now.  I still think it makes sense to expose
an unshare operation, and we probably should make that another
fallocate mode.


With that in mind, how do you all think we ought to resolve this?
Should we add a new fallocate mode flag that means "unshare the shared
blocks"?  Obviously, this unshare flag cannot be used in conjunction
with hole punching, zero range, insert range, or collapse range.  This
breaks the expectation that writing to a file after fallocate won't
ENOSPC.

Or is it ok that fallocate could block, potentially for a long time as
we stream cows through the page cache (or however unshare works
internally)?  Those same programs might not be expecting fallocate to
take a long time.
Nothing that I can find in the man-pages or API documentation for 
Linux's fallocate explicitly says that it will be fast.  There are bits 
that say it should be efficient, but that is not itself well defined 
(given context, I would assume it to mean that it doesn't use as much 
I/O as writing out that many bytes of zero data, not necessarily that it 
will return quickly).  We may have done a lot to make it fast, but that 
doesn't mean by any measure that we guarantee it anywhere (at least, we 
don't guarantee it anywhere I can find).


Can we do better than either solution?  It occurs to me that XFS does
unshare by reading the file data into the pagecache, marking the pages
dirty, and flushing the dirty pages; performance could be improved by
skipping the flush at the end.  We won't ENOSPC, because the XFS
delalloc system is careful enough to check that there are enough free
blocks to handle both the allocation and the metadata updates.  The
only gap in this scheme that I can see is if we fallocate, crash, and
upon restart the program then tries to write without retrying the
fallocate.  Can we trade some performance for the added requirement
that we must fallocate -> write -> fsync, and retry the trio if we
crash before the fsync returns?  I think that's already an implicit
requirement, so we might be ok here.
Most of the software I've seen that doesn't use fallocate like this is 
either doing odd things otherwise, or is just making sure it has space 
for temporary files, so I think it is probably safe to require this.


Opinions?  I rather like the last option, though I've only just
thought of it and have not had time to examine it thoroughly, and it's
specific to XFS. :)
Personally I'm indifferent about how we handle it, as long as it still 
maintains the normal semantics, and it works for reflinked ranges 
(seemingly arbitrary failures for a range in a file should be handled 
properly by an application, but that doesn't mean we shouldn't try to 
reduce their occurrence).


I would like to comment that it would be nice to have an fallocate 
option to force a range to become unshared, but I personally feel we 
should have that alongside the regular functionality, not in-place of it.


It's probably also worth noting that reflinks technically break 
expectations WRT FALLOC_FL_PUNCH_HOLE already.  Most apps I see that use 
PUNCH_HOLE seem to expect it to free space, which won't happen if the 
range is reflinked elsewhere.  There is of course nothing that says that 
it will free space, but that doesn't change user expectations.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: avoid overflowing f_bfree

2016-03-30 Thread Luis de Bethencourt
Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.

Signed-off-by: Luis de Bethencourt 
Suggested-by: David Sterba 
---
 fs/btrfs/super.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index bdca79c..93376d0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2101,6 +2101,11 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
/* Account global block reserve as used, it's in logical size already */
spin_lock(&block_rsv->lock);
buf->f_bfree -= block_rsv->size >> bits;
+   /* Mixed block groups accounting is not byte-accurate, avoid overflow */
+   if (buf->f_bfree >= block_rsv->size >> bits)
+   buf->f_bfree -= block_rsv->size >> bits;
+   else
+   buf->f_bfree = 0;
spin_unlock(&block_rsv->lock);
 
buf->f_bavail = div_u64(total_free_data, factor);
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: fix mixed block count of available space

2016-03-30 Thread Luis de Bethencourt
Metadata for mixed block is already accounted in total data and should not
be counted as part of the free metadata space.

Signed-off-by: Luis de Bethencourt 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=114281
---
 fs/btrfs/super.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 00b8f37..bdca79c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2051,6 +2051,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
int ret;
u64 thresh = 0;
+   int mixed = 0;
 
/*
 * holding chunk_muext to avoid allocating new chunks, holding
@@ -2076,8 +2077,17 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
}
}
}
-   if (found->flags & BTRFS_BLOCK_GROUP_METADATA)
-   total_free_meta += found->disk_total - found->disk_used;
+
+   /*
+* Metadata in mixed block goup profiles are accounted in data
+*/
+   if (!mixed && found->flags & BTRFS_BLOCK_GROUP_METADATA) {
+   if (found->flags & BTRFS_BLOCK_GROUP_DATA)
+   mixed = 1;
+   else
+   total_free_meta += found->disk_total -
+   found->disk_used;
+   }
 
total_used += found->disk_used;
}
@@ -2115,7 +2125,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
 */
thresh = 4 * 1024 * 1024;
 
-   if (total_free_meta - thresh < block_rsv->size)
+   if (!mixed && total_free_meta - thresh < block_rsv->size)
buf->f_bavail = 0;
 
buf->f_type = BTRFS_SUPER_MAGIC;
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: avoid overflowing f_bfree

2016-03-30 Thread Filipe Manana
On Wed, Mar 30, 2016 at 9:53 PM, Luis de Bethencourt
 wrote:
> Since mixed block groups accounting isn't byte-accurate and f_bree is an
> unsigned integer, it could overflow. Avoid this.
>
> Signed-off-by: Luis de Bethencourt 
> Suggested-by: David Sterba 
> ---
>  fs/btrfs/super.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index bdca79c..93376d0 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2101,6 +2101,11 @@ static int btrfs_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> /* Account global block reserve as used, it's in logical size already 
> */
> spin_lock(&block_rsv->lock);
> buf->f_bfree -= block_rsv->size >> bits;

You forgot to remove the line above, didn't you?

> +   /* Mixed block groups accounting is not byte-accurate, avoid overflow 
> */
> +   if (buf->f_bfree >= block_rsv->size >> bits)
> +   buf->f_bfree -= block_rsv->size >> bits;
> +   else
> +   buf->f_bfree = 0;
> spin_unlock(&block_rsv->lock);
>
> buf->f_bavail = div_u64(total_free_data, factor);
> --
> 2.5.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: avoid overflowing f_bfree

2016-03-30 Thread Luis de Bethencourt
On 30/03/16 22:48, Filipe Manana wrote:
> On Wed, Mar 30, 2016 at 9:53 PM, Luis de Bethencourt
>  wrote:
>> Since mixed block groups accounting isn't byte-accurate and f_bree is an
>> unsigned integer, it could overflow. Avoid this.
>>
>> Signed-off-by: Luis de Bethencourt 
>> Suggested-by: David Sterba 
>> ---
>>  fs/btrfs/super.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index bdca79c..93376d0 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -2101,6 +2101,11 @@ static int btrfs_statfs(struct dentry *dentry, struct 
>> kstatfs *buf)
>> /* Account global block reserve as used, it's in logical size 
>> already */
>> spin_lock(&block_rsv->lock);
>> buf->f_bfree -= block_rsv->size >> bits;
> 
> You forgot to remove the line above, didn't you?
> 

Shoot! Indeed I did, sorry. Thanks for noticing.

Sending version 2.

Luis

>> +   /* Mixed block groups accounting is not byte-accurate, avoid 
>> overflow */
>> +   if (buf->f_bfree >= block_rsv->size >> bits)
>> +   buf->f_bfree -= block_rsv->size >> bits;
>> +   else
>> +   buf->f_bfree = 0;
>> spin_unlock(&block_rsv->lock);
>>
>> buf->f_bavail = div_u64(total_free_data, factor);
>> --
>> 2.5.3
>>
>> --

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Global hotspare functionality

2016-03-30 Thread Yauhen Kharuzhy
On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote:
> Hi.
> 
> I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree 
> later)
> now with lockdep debugging enabled. At starting of replacement, lockdep 
> warning is displayed,
> because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock()
> block (GFP_NOFS can sleep).

Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without
rcu_read_lock():

int btrfs_auto_replace_start(struct btrfs_root *root,
struct btrfs_device *src_device)
{
int ret;
char *tgt_path;

if (btrfs_get_spare_device(&tgt_path)) {
btrfs_err(root->fs_info,
"No spare device found/configured in the kernel");
return -EINVAL;
}

ret = btrfs_dev_replace_start(root, tgt_path,
src_device->devid,
rcu_str_deref(src_device->name),
BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
if (ret)
btrfs_put_spare_device(tgt_path);

kfree(tgt_path);

return 0;
}

[  156.168133] ===
[  156.168963] [ INFO: suspicious RCU usage. ]
[  156.169822] 4.4.5-scst31x+ #20 Not tainted
[  156.170656] ---
[  156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() 
usage!
[  156.172920] 
[  156.172920] other info that might help us debug this:
[  156.172920] 
[  156.174825] 
[  156.174825] rcu_scheduler_active = 1, debug_locks = 0
[  156.176152] 1 lock held by btrfs-casualty/4807:
[  156.181917]  #0:  (&fs_info->casualty_mutex){+.+...}, at: 
[] casualty_kthread+0x64/0x390 [btrfs]
[  156.193511] 
[  156.193511] stack backtrace:
[  156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ 
#20
[  156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  156.219100]   88005d79fda0 813529e3 
88005e19c600
[  156.221216]  0001 88005d79fdd0 810d6407 

[  156.224287]   88005f4a0c00 88005da36000 
88005d79fe08
[  156.226375] Call Trace:
[  156.227078]  [] dump_stack+0x85/0xc2
[  156.228152]  [] lockdep_rcu_suspicious+0xd7/0x110
[  156.229418]  [] btrfs_auto_replace_start+0xa6/0xd0 [btrfs]
[  156.230714]  [] casualty_kthread+0x2c4/0x390 [btrfs]
[  156.231915]  [] ? casualty_kthread+0x19c/0x390 [btrfs]
[  156.233105]  [] ? btrfs_check_devices+0x200/0x200 [btrfs]
[  156.234339]  [] kthread+0xef/0x110
[  156.235309]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  156.236940]  [] ? kthread_create_on_node+0x200/0x200
[  156.239489]  [] ret_from_fork+0x3f/0x70
[  156.240533]  [] ? kthread_create_on_node+0x200/0x200


-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs: avoid overflowing f_bfree

2016-03-30 Thread Luis de Bethencourt
Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.

Signed-off-by: Luis de Bethencourt 
Suggested-by: David Sterba 
---
Hi,

Thanks to Filipe Manana for spotting a mistake in the first version of this
patch.

Luis

 fs/btrfs/super.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index bdca79c..fe03efb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2100,7 +2100,11 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
 
/* Account global block reserve as used, it's in logical size already */
spin_lock(&block_rsv->lock);
-   buf->f_bfree -= block_rsv->size >> bits;
+   /* Mixed block groups accounting is not byte-accurate, avoid overflow */
+   if (buf->f_bfree >= block_rsv->size >> bits)
+   buf->f_bfree -= block_rsv->size >> bits;
+   else
+   buf->f_bfree = 0;
spin_unlock(&block_rsv->lock);
 
buf->f_bavail = div_u64(total_free_data, factor);
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Loan With 2%..

2016-03-30 Thread Loan Center
We are introducing a give loans at a rate below 2%, and the time of repayment 
of the loan is 20 years. Contact us if you are interested so we can provide 
more details.

E-mail:jlo...@aol.com

Regarding
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Qu Wenruo



Henk Slager wrote on 2016/03/30 16:03 +0200:

On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo  wrote:

First of all.

The "crossing stripe boundary" error message itself is *HARMLESS* for recent
kernels.

It only means, that metadata extent won't be checked by scrub on recent
kernels.
Because scrub by its codes, has a limitation that, it can only check tree
blocks which are inside a 64K block.

Old kernel won't have anything wrong, until that tree block is being
scrubbed.
When scrubbed, old kernel just BUG_ON().

Now recent kernel will handle such limitation by checking extent allocation
and avoid crossing boundary, so new created fs with new kernel won't cause
such error message at all.

But for old created fs, the problem can't be avoided, but at least, new
kernels will not BUG_ON() when you scrub these extents, they just get
ignored (not that good, but at least no BUG_ON).

And new fsck will check such case, gives such warning.

Overall, you're OK if you are using recent kernels.

Marc Haber wrote on 2016/03/29 08:43 +0200:


On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:


Did you convert this filesystem from ext4 (or ext3)?



No.


You hadn't mentioned what version of btrfs-progs you're using, and that
is
somewhat important for recovery.  I'm not sure if current versions of
btrfs
check can fix this issue, but I know for a fact that older versions
(prior
to at least 4.1) can not fix it.



4.1 for creation and btrfs check.



I assume that you have run older kernel on it, like v4.1 or v4.2.

In those old kernels, it lacks the check to avoid such extent allocation
check.




As far as what the kernel is involved with, the easy way to check is if
it's
operating on a mounted filesystem or not.  If it only operates on mounted
filesystems, it almost certainly goes through the kernel, if it only
operates on unmounted filesystems, it's almost certainly done in
userspace
(except dev scan and technically fi show).



Then btrfs check is a userspace-only matter, as it wants the fs
unmounted, and it is irrelevant that I did btrfs check from a rescue
system with an older kernel, 3.16 if I recall correctly.



Not recommended to use older kernel to RW mount or use older fsck to do
repair.
As it's possible that older kernel/btrfsck may allocate extent that cross
the 64K boundary.




2. Regarding general support:  If you're using an enterprise distribution
(RHEL, SLES, CentOS, OEL, or something similar), you are almost certainly
going to get better support from your vendor than from the mailing list
or
IRC.



My "productive" desktops (fan is one of them) run Debian unstable with
a current vanilla kernel. At the moment, I can't use 4.5 because it
acts up with KVM.  When I need a rescue system, I use grml, which
unfortunately hasn't released since November 2014 and is still with
kernel 3.16



To fix your problem(make these error message just disappear, even they are
harmless on recent kernels), the most easy one, is to balance your metadata.


I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
filesystem mentioned in here:
http://www.spinics.net/lists/linux-btrfs/msg51405.html

but still   bad metadata [ ),  crossing stripe boundary   messages,
double amount compared to 2 months ago


Would you please give an example of the output?
So I can check if it's really crossing the boundary.

Thanks,
Qu


Kernel operating this fs has always been maximum 1 month behind
'Latest Stable Kernel' (kernel.org terminology)


As I explained, the bug only lies in metadata, and balance will allocate new
tree blocks, then copy old data into new locations.

In the allocation process of recent kernel, it will avoid such cross
boundary, and to fix your problem.

But if you are using old kernels, don't scrub your metadata.

Thanks,
Qu



Greetings
Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-30 Thread Liu Bo
On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Christoph and I have been working on adding reflink and CoW support to
> XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
> that future file writes cannot ENOSPC, I extended the XFS fallocate
> handler to unshare any shared blocks via the copy on write mechanism I
> built for it.  However, Christoph shared the following concerns with
> me about that interpretation:
> 
> > I know that I suggested unsharing blocks on fallocate, but it turns out
> > this is causing problems.  Applications expect falloc to be a fast
> > metadata operation, and copying a potentially large number of blocks
> > is against that expextation.  This is especially bad for the NFS
> > server, which should not be blocked for a long time in a synchronous
> > operation.
> > 
> > I think we'll have to remove the unshare and just fail the fallocate
> > for a reflinked region for now.  I still think it makes sense to expose
> > an unshare operation, and we probably should make that another
> > fallocate mode.

I'm expecting fallocate to be fast, too.

Well, btrfs fallocate doesn't allocate space if it's a shared one
because it thinks the space is already allocated.  So a later overwrite
over this shared extent may hit enospc errors.

> 
> With that in mind, how do you all think we ought to resolve this?
> Should we add a new fallocate mode flag that means "unshare the shared
> blocks"?  Obviously, this unshare flag cannot be used in conjunction
> with hole punching, zero range, insert range, or collapse range.  This
> breaks the expectation that writing to a file after fallocate won't
> ENOSPC.
> 
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.
> 
> Can we do better than either solution?  It occurs to me that XFS does
> unshare by reading the file data into the pagecache, marking the pages
> dirty, and flushing the dirty pages; performance could be improved by
> skipping the flush at the end.  We won't ENOSPC, because the XFS
> delalloc system is careful enough to check that there are enough free
> blocks to handle both the allocation and the metadata updates.  The
> only gap in this scheme that I can see is if we fallocate, crash, and
> upon restart the program then tries to write without retrying the
> fallocate.  Can we trade some performance for the added requirement
> that we must fallocate -> write -> fsync, and retry the trio if we
> crash before the fsync returns?  I think that's already an implicit
> requirement, so we might be ok here.
> 
> Opinions?  I rather like the last option, though I've only just
> thought of it and have not had time to examine it thoroughly, and it's
> specific to XFS. :)

I'd vote for another mode for 'unshare the shared blocks'.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: handle non-fatal errors in btrfs_qgroup_inherit()

2016-03-30 Thread Mark Fasheh
create_pending_snapshot() will go readonly on _any_ error return from
btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by
just making a snapshot and asking it to inherit from an invalid qgroup. For
example:

$ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo

Will cause a transaction abort.

Fix this by only throwing errors in btrfs_qgroup_inherit() when we know
going readonly is acceptable.

The following xfstests test case reproduces this bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1  # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
cd /
rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # remove previous $seqres.full before test
  rm -f $seqres.full

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs
  _scratch_mount
  _run_btrfs_util_prog quota enable $SCRATCH_MNT
  # The qgroup '1/10' does not exist and should be silently ignored
  _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT 
$SCRATCH_MNT/snap1

  _scratch_unmount

  echo "Silence is golden"

  status=0
  exit

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/qgroup.c | 54 --
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 994dab0..9e11955 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1851,8 +1851,10 @@ out:
 }
 
 /*
- * copy the acounting information between qgroups. This is necessary when a
- * snapshot or a subvolume is created
+ * Copy the acounting information between qgroups. This is necessary
+ * when a snapshot or a subvolume is created. Throwing an error will
+ * cause a transaction abort so we take extra care here to only error
+ * when a readonly fs is a reasonable outcome.
  */
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 srcid, u64 objectid,
@@ -1882,15 +1884,15 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
   2 * inherit->num_excl_copies;
for (i = 0; i < nums; ++i) {
srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
-   if (!srcgroup) {
-   ret = -EINVAL;
-   goto out;
-   }
 
-   if ((srcgroup->qgroupid >> 48) <= (objectid >> 48)) {
-   ret = -EINVAL;
-   goto out;
-   }
+   /*
+* Zero out invalid groups so we can ignore
+* them later.
+*/
+   if (!srcgroup ||
+   ((srcgroup->qgroupid >> 48) <= (objectid >> 48)))
+   *i_qgroups = 0ULL;
+
++i_qgroups;
}
}
@@ -1925,17 +1927,19 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
 */
if (inherit) {
i_qgroups = (u64 *)(inherit + 1);
-   for (i = 0; i < inherit->num_qgroups; ++i) {
+   for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) {
+   if (*i_qgroups == 0)
+   continue;
ret = add_qgroup_relation_item(trans, quota_root,
   objectid, *i_qgroups);
-   if (ret)
+   if (ret && ret != -EEXIST)
goto out;
ret = add_qgroup_relation_item(trans, quota_root,
   *i_qgroups, objectid);
-   if (ret)
+   if (ret && ret != -EEXIST)
goto out;
-   ++i_qgroups;
}
+   ret = 0;
}
 
 
@@ -1996,17 +2000,22 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
 
i_qgroups = (u64 *)(inherit + 1);
for (i = 0; i < inherit->num_qgroups; ++i) {
-   ret = add_relation_rb(quota_root->fs_info, objectid,
- *i_qgroups);
-   if (ret)
-   goto unlock;
+   if (*i_qgroups) {
+   ret = add_relation_rb(quota_root->fs_info, objectid,
+ *i_qgroups);
+   if (ret)
+   goto unlock;
+   }
++i_qgroups;
}
 
-   for (i = 0; i <  inherit->num_ref_copies; ++i) {
+   for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
  

Re: fallocate mode flag for "unshare blocks"?

2016-03-30 Thread Dave Chinner
On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.

Yes, it's perfectly fine for fallocate to block for long periods of
time. See what gfs2 does during preallocation of blocks - it ends up
calling sb_issue_zerout() because it doesn't have unwritten
extents, and hence can block for long periods of time

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-30 Thread Qu Wenruo



Kai Krakow wrote on 2016/03/28 12:02 +0200:

Changing subject to reflect the current topic...

Am Sun, 27 Mar 2016 21:55:40 +0800
schrieb Qu Wenruo :


I finally got copy&paste data:

# before mounting let's check the FS:

$ sudo btrfsck /dev/disk/by-label/usb-backup
Checking filesystem on /dev/disk/by-label/usb-backup
UUID: 1318ec21-c421-4e36-a44a-7be3d41f9c3f
checking extents
bad metadata [156041216, 156057600) crossing stripe boundary
bad metadata [181403648, 181420032) crossing stripe boundary
bad metadata [392167424, 392183808) crossing stripe boundary
bad metadata [783482880, 783499264) crossing stripe boundary
bad metadata [784924672, 784941056) crossing stripe boundary
bad metadata [130151612416, 130151628800) crossing stripe boundary
bad metadata [162826813440, 162826829824) crossing stripe boundary
bad metadata [162927083520, 162927099904) crossing stripe boundary
bad metadata [619740659712, 619740676096) crossing stripe boundary
bad metadata [619781947392, 619781963776) crossing stripe boundary
bad metadata [619795644416, 619795660800) crossing stripe boundary
bad metadata [619816091648, 619816108032) crossing stripe boundary
bad metadata [620011388928, 620011405312) crossing stripe boundary
bad metadata [890992459776, 890992476160) crossing stripe boundary
bad metadata [891022737408, 891022753792) crossing stripe boundary
bad metadata [891101773824, 891101790208) crossing stripe boundary
bad metadata [891301199872, 891301216256) crossing stripe boundary
bad metadata [1012219314176, 1012219330560) crossing stripe boundary
bad metadata [1017202409472, 1017202425856) crossing stripe boundary
bad metadata [1017365397504, 1017365413888) crossing stripe boundary
bad metadata [1020764422144, 1020764438528) crossing stripe boundary
bad metadata [1251103342592, 1251103358976) crossing stripe boundary
bad metadata [1251144695808, 1251144712192) crossing stripe boundary
bad metadata [1251147055104, 1251147071488) crossing stripe boundary
bad metadata [1259271225344, 1259271241728) crossing stripe boundary
bad metadata [1266223611904, 1266223628288) crossing stripe boundary
bad metadata [1304750063616, 130475008) crossing stripe boundary
bad metadata [1304790106112, 1304790122496) crossing stripe boundary
bad metadata [1304850792448, 1304850808832) crossing stripe boundary
bad metadata [1304869928960, 1304869945344) crossing stripe boundary
bad metadata [1305089540096, 1305089556480) crossing stripe boundary
bad metadata [1309561651200, 1309561667584) crossing stripe boundary
bad metadata [1309581443072, 1309581459456) crossing stripe boundary
bad metadata [1309583671296, 1309583687680) crossing stripe boundary
bad metadata [1309942808576, 1309942824960) crossing stripe boundary
bad metadata [1310050549760, 1310050566144) crossing stripe boundary
bad metadata [1313031585792, 1313031602176) crossing stripe boundary
bad metadata [1313232912384, 1313232928768) crossing stripe boundary
bad metadata [1555210764288, 1555210780672) crossing stripe boundary
bad metadata [1555395182592, 1555395198976) crossing stripe boundary
bad metadata [205057678, 2050576760832) crossing stripe boundary
bad metadata [2050803957760, 2050803974144) crossing stripe boundary
bad metadata [2050969108480, 2050969124864) crossing stripe
boundary


Already mentioned in another reply, this *seems* to be false alert.
Latest btrfs-progs would help.


No, btrfs-progs 4.5 reports those, too (as far as I understood, this
includes the fixes for bogus "bad metadata" errors, tho I thought this
has already been fixed in 4.2.1, I used 4.4.1). There were some nbytes
wrong errors before which I already repaired using "--repair". I think
that's okay, I had those in the past and it looks like btrfsck can
repair those now (and I don't have to delete and recreate the files).
It caused problems with "du" and "df" in the past, a problem that I'm
currently facing too. So I better fixed them.

With that done, the backup fs now only reports "bad metadata" which
have been there before space cache v2. Full output below.


checking free space tree cache and super generation don't match,
space cache will be invalidated checking fs roots

Err, I found a missing '\n' before "checking fs roots".


Copy and paste problem. Claws mail pretends to be smarter than me
- I missed to fix that one. ;-)


I was searching for the missing '\n' and hopes to find any chance to 
submit a new patch.

What a pity. :(




And it seems that fs roots and extent tree are all OK.

Quite surprising.
The only possible problem seems to be outdated space cache.

Maybe mount with "-o clear_cache" will help, but I don't think that's
the cause.


Helped, it automatically reverted the FS back to space cache v1 with
incompat flag cleared. (I wouldn't have enabled v2 if it wasn't
documented that this is possible)


checking csums
checking root refs
found 1860217443214 bytes used err is 0
total csum bytes: 1805105116
total tree bytes: 11793776640
total fs tree bytes: 8220835840
total ex

Re: [kbuild-all] [PATCH 11/12] btrfs: introduce helper functions to perform hot replace

2016-03-30 Thread Fengguang Wu
On Wed, Mar 30, 2016 at 06:13:43PM +0800, Anand Jain wrote:
> 
> 
> Hi,
> 
>  You are missing the patch set which includes
>https://patchwork.kernel.org/patch/8659651/
> 
>  btrfs: refactor btrfs_dev_replace_start for reuse

Sorry that comes in another patchset and the robot currently is not
smart enough to understand the relationship between 2 patchsets.

Thanks,
Fengguang

> On 03/29/2016 10:45 PM, kbuild test robot wrote:
> >Hi Anand,
> >
> >[auto build test ERROR on btrfs/next]
> >[also build test ERROR on v4.6-rc1 next-20160329]
> >[if your patch is applied to the wrong git tree, please drop us a note to 
> >help improving the system]
> >
> >url:
> >https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-Introduce-a-new-function-to-check-if-all-chunks-a-OK-for-degraded-mount/20160329-222724
> >base:   
> >https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
> >config: sparc64-allmodconfig (attached as .config)
> >reproduce:
> > wget 
> > https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
> >  -O ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > make.cross ARCH=sparc64
> >
> >All error/warnings (new ones prefixed by >>):
> >
> >fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start':
> >>>fs/btrfs/dev-replace.c:962:8: warning: passing argument 2 of 
> >>>'btrfs_dev_replace_start' from incompatible pointer type
> >  ret = btrfs_dev_replace_start(root, tgt_path,
> >^
> >fs/btrfs/dev-replace.c:308:5: note: expected 'struct 
> > btrfs_ioctl_dev_replace_args *' but argument is of type 'char *'
> > int btrfs_dev_replace_start(struct btrfs_root *root,
> > ^
> >>>fs/btrfs/dev-replace.c:962:8: error: too many arguments to function 
> >>>'btrfs_dev_replace_start'
> >  ret = btrfs_dev_replace_start(root, tgt_path,
> >^
> >fs/btrfs/dev-replace.c:308:5: note: declared here
> > int btrfs_dev_replace_start(struct btrfs_root *root,
> > ^
> >
> >vim +/btrfs_dev_replace_start +962 fs/btrfs/dev-replace.c
> >
> >956  if (btrfs_get_spare_device(&tgt_path)) {
> >957  btrfs_err(root->fs_info,
> >958  "No spare device found/configured in 
> > the kernel");
> >959  return -EINVAL;
> >960  }
> >961  
> >  > 962  ret = btrfs_dev_replace_start(root, tgt_path,
> >963  src_device->devid,
> >964  
> > rcu_str_deref(src_device->name),
> >965  
> > BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
> >
> >---
> >0-DAY kernel test infrastructureOpen Source Technology Center
> >https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> >
> ___
> kbuild-all mailing list
> kbuild-...@lists.01.org
> https://lists.01.org/mailman/listinfo/kbuild-all
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: handle non-fatal errors in btrfs_qgroup_inherit()

2016-03-30 Thread Qu Wenruo



Mark Fasheh wrote on 2016/03/30 17:57 -0700:

create_pending_snapshot() will go readonly on _any_ error return from
btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by
just making a snapshot and asking it to inherit from an invalid qgroup. For
example:

$ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo

Will cause a transaction abort.

Fix this by only throwing errors in btrfs_qgroup_inherit() when we know
going readonly is acceptable.

The following xfstests test case reproduces this bug:

   seq=`basename $0`
   seqres=$RESULT_DIR/$seq
   echo "QA output created by $seq"

   here=`pwd`
   tmp=/tmp/$$
   status=1 # failure is the default!
   trap "_cleanup; exit \$status" 0 1 2 3 15

   _cleanup()
   {
cd /
rm -f $tmp.*
   }

   # get standard environment, filters and checks
   . ./common/rc
   . ./common/filter

   # remove previous $seqres.full before test
   rm -f $seqres.full

   # real QA test starts here
   _supported_fs btrfs
   _supported_os Linux
   _require_scratch

   rm -f $seqres.full

   _scratch_mkfs
   _scratch_mount
   _run_btrfs_util_prog quota enable $SCRATCH_MNT
   # The qgroup '1/10' does not exist and should be silently ignored
   _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT 
$SCRATCH_MNT/snap1

   _scratch_unmount

   echo "Silence is golden"

   status=0
   exit

Signed-off-by: Mark Fasheh 


Reviewed-by: Qu Wenruo 

Looks good to me, and right current check is too restrict and will cause 
annoying abort_transaction.


Although silently ignore invalid assign is somewhat too casual, that's 
already much better than current abort_transaction.


Thanks,
Qu

---
  fs/btrfs/qgroup.c | 54 --
  1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 994dab0..9e11955 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1851,8 +1851,10 @@ out:
  }

  /*
- * copy the acounting information between qgroups. This is necessary when a
- * snapshot or a subvolume is created
+ * Copy the acounting information between qgroups. This is necessary
+ * when a snapshot or a subvolume is created. Throwing an error will
+ * cause a transaction abort so we take extra care here to only error
+ * when a readonly fs is a reasonable outcome.
   */
  int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 srcid, u64 objectid,
@@ -1882,15 +1884,15 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
   2 * inherit->num_excl_copies;
for (i = 0; i < nums; ++i) {
srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
-   if (!srcgroup) {
-   ret = -EINVAL;
-   goto out;
-   }

-   if ((srcgroup->qgroupid >> 48) <= (objectid >> 48)) {
-   ret = -EINVAL;
-   goto out;
-   }
+   /*
+* Zero out invalid groups so we can ignore
+* them later.
+*/
+   if (!srcgroup ||
+   ((srcgroup->qgroupid >> 48) <= (objectid >> 48)))
+   *i_qgroups = 0ULL;
+
++i_qgroups;
}
}
@@ -1925,17 +1927,19 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
 */
if (inherit) {
i_qgroups = (u64 *)(inherit + 1);
-   for (i = 0; i < inherit->num_qgroups; ++i) {
+   for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) {
+   if (*i_qgroups == 0)
+   continue;
ret = add_qgroup_relation_item(trans, quota_root,
   objectid, *i_qgroups);
-   if (ret)
+   if (ret && ret != -EEXIST)
goto out;
ret = add_qgroup_relation_item(trans, quota_root,
   *i_qgroups, objectid);
-   if (ret)
+   if (ret && ret != -EEXIST)
goto out;
-   ++i_qgroups;
}
+   ret = 0;
}


@@ -1996,17 +2000,22 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,

i_qgroups = (u64 *)(inherit + 1);
for (i = 0; i < inherit->num_qgroups; ++i) {
-   ret = add_relation_rb(quota_root->fs_info, objectid,
- *i_qgroups);
-   if (ret)
-   goto unlock;
+   if (*i_qgroups) {
+   ret = add_relation_rb(quota_root->fs_info, obj

[PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-03-30 Thread Qu Wenruo
At least 2 user from mail list reported btrfsck reported false alert of
"bad metadata [,) crossing stripe boundary".

While the reported number are all inside the same 64K boundary.
After some check, all the false alert have the same bytenr feature,
which can be divided by stripe size (64K).

The result seems to be initial 'max_size' can be 0, causing 'start' +
'max_size' - 1, to cross the stripe boundary.

Fix it by always update extent_record->cross_stripe when the
extent_record is updated, to avoid temporary false alert to be reported.

Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index d157075..ef23ddb 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4579,9 +4579,9 @@ static int add_extent_rec(struct cache_tree *extent_cache,
 * As now stripe_len is fixed to BTRFS_STRIPE_LEN, just check
 * it.
 */
-   if (metadata && check_crossing_stripes(rec->start,
-  rec->max_size))
-   rec->crossing_stripes = 1;
+   if (metadata)
+   rec->crossing_stripes = check_crossing_stripes(
+   rec->start, rec->max_size);
check_extent_type(rec);
maybe_free_extent_rec(extent_cache, rec);
return ret;
@@ -4641,8 +4641,8 @@ static int add_extent_rec(struct cache_tree *extent_cache,
}
 
if (metadata)
-   if (check_crossing_stripes(rec->start, rec->max_size))
-   rec->crossing_stripes = 1;
+   rec->crossing_stripes = check_crossing_stripes(rec->start,
+   rec->max_size);
check_extent_type(rec);
return ret;
 }
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-30 Thread Qu Wenruo



Henk Slager wrote on 2016/03/30 16:03 +0200:

On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo  wrote:

First of all.

The "crossing stripe boundary" error message itself is *HARMLESS* for recent
kernels.

It only means, that metadata extent won't be checked by scrub on recent
kernels.
Because scrub by its codes, has a limitation that, it can only check tree
blocks which are inside a 64K block.

Old kernel won't have anything wrong, until that tree block is being
scrubbed.
When scrubbed, old kernel just BUG_ON().

Now recent kernel will handle such limitation by checking extent allocation
and avoid crossing boundary, so new created fs with new kernel won't cause
such error message at all.

But for old created fs, the problem can't be avoided, but at least, new
kernels will not BUG_ON() when you scrub these extents, they just get
ignored (not that good, but at least no BUG_ON).

And new fsck will check such case, gives such warning.

Overall, you're OK if you are using recent kernels.

Marc Haber wrote on 2016/03/29 08:43 +0200:


On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:


Did you convert this filesystem from ext4 (or ext3)?



No.


You hadn't mentioned what version of btrfs-progs you're using, and that
is
somewhat important for recovery.  I'm not sure if current versions of
btrfs
check can fix this issue, but I know for a fact that older versions
(prior
to at least 4.1) can not fix it.



4.1 for creation and btrfs check.



I assume that you have run older kernel on it, like v4.1 or v4.2.

In those old kernels, it lacks the check to avoid such extent allocation
check.




As far as what the kernel is involved with, the easy way to check is if
it's
operating on a mounted filesystem or not.  If it only operates on mounted
filesystems, it almost certainly goes through the kernel, if it only
operates on unmounted filesystems, it's almost certainly done in
userspace
(except dev scan and technically fi show).



Then btrfs check is a userspace-only matter, as it wants the fs
unmounted, and it is irrelevant that I did btrfs check from a rescue
system with an older kernel, 3.16 if I recall correctly.



Not recommended to use older kernel to RW mount or use older fsck to do
repair.
As it's possible that older kernel/btrfsck may allocate extent that cross
the 64K boundary.




2. Regarding general support:  If you're using an enterprise distribution
(RHEL, SLES, CentOS, OEL, or something similar), you are almost certainly
going to get better support from your vendor than from the mailing list
or
IRC.



My "productive" desktops (fan is one of them) run Debian unstable with
a current vanilla kernel. At the moment, I can't use 4.5 because it
acts up with KVM.  When I need a rescue system, I use grml, which
unfortunately hasn't released since November 2014 and is still with
kernel 3.16



To fix your problem(make these error message just disappear, even they are
harmless on recent kernels), the most easy one, is to balance your metadata.


I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
filesystem mentioned in here:
http://www.spinics.net/lists/linux-btrfs/msg51405.html

but still   bad metadata [ ),  crossing stripe boundary   messages,
double amount compared to 2 months ago

Kernel operating this fs has always been maximum 1 month behind
'Latest Stable Kernel' (kernel.org terminology)


Would you please try the following patch?
https://patchwork.kernel.org/patch/8706891/

It is based on v4.5 and I think it should fix the false alert.

Thanks,
Qu




As I explained, the bug only lies in metadata, and balance will allocate new
tree blocks, then copy old data into new locations.

In the allocation process of recent kernel, it will avoid such cross
boundary, and to fix your problem.

But if you are using old kernels, don't scrub your metadata.

Thanks,
Qu



Greetings
Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-30 Thread Qu Wenruo



Qu Wenruo wrote on 2016/03/31 09:33 +0800:



Kai Krakow wrote on 2016/03/28 12:02 +0200:

Changing subject to reflect the current topic...

Am Sun, 27 Mar 2016 21:55:40 +0800
schrieb Qu Wenruo :


I finally got copy&paste data:

# before mounting let's check the FS:

$ sudo btrfsck /dev/disk/by-label/usb-backup
Checking filesystem on /dev/disk/by-label/usb-backup
UUID: 1318ec21-c421-4e36-a44a-7be3d41f9c3f
checking extents
bad metadata [156041216, 156057600) crossing stripe boundary
bad metadata [181403648, 181420032) crossing stripe boundary
bad metadata [392167424, 392183808) crossing stripe boundary
bad metadata [783482880, 783499264) crossing stripe boundary
bad metadata [784924672, 784941056) crossing stripe boundary
bad metadata [130151612416, 130151628800) crossing stripe boundary
bad metadata [162826813440, 162826829824) crossing stripe boundary
bad metadata [162927083520, 162927099904) crossing stripe boundary
bad metadata [619740659712, 619740676096) crossing stripe boundary
bad metadata [619781947392, 619781963776) crossing stripe boundary
bad metadata [619795644416, 619795660800) crossing stripe boundary
bad metadata [619816091648, 619816108032) crossing stripe boundary
bad metadata [620011388928, 620011405312) crossing stripe boundary
bad metadata [890992459776, 890992476160) crossing stripe boundary
bad metadata [891022737408, 891022753792) crossing stripe boundary
bad metadata [891101773824, 891101790208) crossing stripe boundary
bad metadata [891301199872, 891301216256) crossing stripe boundary
bad metadata [1012219314176, 1012219330560) crossing stripe boundary
bad metadata [1017202409472, 1017202425856) crossing stripe boundary
bad metadata [1017365397504, 1017365413888) crossing stripe boundary
bad metadata [1020764422144, 1020764438528) crossing stripe boundary
bad metadata [1251103342592, 1251103358976) crossing stripe boundary
bad metadata [1251144695808, 1251144712192) crossing stripe boundary
bad metadata [1251147055104, 1251147071488) crossing stripe boundary
bad metadata [1259271225344, 1259271241728) crossing stripe boundary
bad metadata [1266223611904, 1266223628288) crossing stripe boundary
bad metadata [1304750063616, 130475008) crossing stripe boundary
bad metadata [1304790106112, 1304790122496) crossing stripe boundary
bad metadata [1304850792448, 1304850808832) crossing stripe boundary
bad metadata [1304869928960, 1304869945344) crossing stripe boundary
bad metadata [1305089540096, 1305089556480) crossing stripe boundary
bad metadata [1309561651200, 1309561667584) crossing stripe boundary
bad metadata [1309581443072, 1309581459456) crossing stripe boundary
bad metadata [1309583671296, 1309583687680) crossing stripe boundary
bad metadata [1309942808576, 1309942824960) crossing stripe boundary
bad metadata [1310050549760, 1310050566144) crossing stripe boundary
bad metadata [1313031585792, 1313031602176) crossing stripe boundary
bad metadata [1313232912384, 1313232928768) crossing stripe boundary
bad metadata [1555210764288, 1555210780672) crossing stripe boundary
bad metadata [1555395182592, 1555395198976) crossing stripe boundary
bad metadata [205057678, 2050576760832) crossing stripe boundary
bad metadata [2050803957760, 2050803974144) crossing stripe boundary
bad metadata [2050969108480, 2050969124864) crossing stripe
boundary


Already mentioned in another reply, this *seems* to be false alert.
Latest btrfs-progs would help.


No, btrfs-progs 4.5 reports those, too (as far as I understood, this
includes the fixes for bogus "bad metadata" errors, tho I thought this
has already been fixed in 4.2.1, I used 4.4.1). There were some nbytes
wrong errors before which I already repaired using "--repair". I think
that's okay, I had those in the past and it looks like btrfsck can
repair those now (and I don't have to delete and recreate the files).
It caused problems with "du" and "df" in the past, a problem that I'm
currently facing too. So I better fixed them.

With that done, the backup fs now only reports "bad metadata" which
have been there before space cache v2. Full output below.


checking free space tree cache and super generation don't match,
space cache will be invalidated checking fs roots

Err, I found a missing '\n' before "checking fs roots".


Copy and paste problem. Claws mail pretends to be smarter than me
- I missed to fix that one. ;-)


I was searching for the missing '\n' and hopes to find any chance to
submit a new patch.
What a pity. :(




And it seems that fs roots and extent tree are all OK.

Quite surprising.
The only possible problem seems to be outdated space cache.

Maybe mount with "-o clear_cache" will help, but I don't think that's
the cause.


Helped, it automatically reverted the FS back to space cache v1 with
incompat flag cleared. (I wouldn't have enabled v2 if it wasn't
documented that this is possible)


checking csums
checking root refs
found 1860217443214 bytes used err is 0
total csum bytes: 1805105116
total tree bytes: 11793776

Re: [PATCH 1/2] fstests: generic test for fsync after renaming directory

2016-03-30 Thread Eryu Guan
On Wed, Mar 30, 2016 at 10:38:42AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we rename a directory, create a new file or directory that
> has the old name of our former directory and is a child of the same
> parent directory, fsync the new inode, power fail and mount the
> filesystem, we see our first directory with the new name and no files
> under it were lost.
> 
> This test is motivated by an issue found in btrfs which is fixed by the
> following patch for the linux kernel:
> 
>   "Btrfs: fix file loss caused by fsync after rename and new inode"
> 
> Signed-off-by: Filipe Manana 

Looks good to me, tested on ext4/3 xfs and btrfs, with 4.6-rc1 kernel,
btrfs failed as expected, ext4/3 and xfs all passed.

Reviewed-by: Eryu Guan 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] fstests: generic test for fsync after renaming file

2016-03-30 Thread Eryu Guan
On Wed, Mar 30, 2016 at 10:39:10AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we rename a file, create a new file that has the old name
> of the other file and is a child of the same parent directory, fsync the
> new inode, power fail and mount the filesystem, we do not lose the first
> file and that file has the name it was renamed to.
> 
> This test is motivated by an issue found in btrfs which is fixed by the
> following patch for the linux kernel:
> 
>   "Btrfs: fix file loss caused by fsync after rename and new inode"
> 
> Signed-off-by: Filipe Manana 

Looks good to me, tested on ext4/3 xfs and btrfs, with 4.6-rc1 kernel,
btrfs failed as expected, ext4/3 and xfs all passed.

Reviewed-by: Eryu Guan 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to cancel btrfs balance on unmounted filesystem

2016-03-30 Thread Marc Haber
Hi,

one of my problem btrfs instances went into a hung process state
while blancing metadata. This process is recorded in the file system
somehow and the balance restarts immediately after mounting the
filesystem with no chance to issue a btrfs balance cancel command
before the system hangs again.

Is there any possiblity to cancel the pending balance without mounting
the fs first?

I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581
to adress this in a more elegant way.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html