[PATCH v2 5/7] Btrfs: remove BUG() in add_data_reference

2017-08-07 Thread Liu Bo
Now that we have a helper to report invalid value of extent inline ref
type, we need to quit gracefully instead of throwing out a kernel panic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/relocation.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index f0bef3c..4806e78 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3774,7 +3774,10 @@ int add_data_references(struct reloc_control *rc,
ret = find_data_references(rc, extent_key,
   eb, dref, blocks);
} else {
-   BUG();
+   ret = -EINVAL;
+   WARN(1,
+"extent %llu slot %d has an invalid inline ref type\n",
+eb->start, path->slots[0]);
}
if (ret) {
err = ret;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/7] Btrfs: add one more sanity check for shared ref type

2017-08-07 Thread Liu Bo
Every shared ref has a parent tree block, which can be get from
btrfs_extent_inline_ref_offset().  And the tree block must be aligned
to the nodesize, so we'd know this inline ref is not valid if this
block's bytenr is not aligned to the nodesize, in which case, most
likely the ref type has been misused.

This adds the above mentioned check and also updates
print_extent_item() called by btrfs_print_leaf() to point out the
invalid ref while printing the tree structure.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 29 +
 fs/btrfs/print-tree.c  | 27 +--
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 80b4db3c..61603ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1158,19 +1158,40 @@ int btrfs_get_extent_inline_ref_type(struct 
extent_buffer *eb,
 enum btrfs_inline_ref_types is_data)
 {
int type = btrfs_extent_inline_ref_type(eb, iref);
+   u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
 
if (type == BTRFS_TREE_BLOCK_REF_KEY ||
type == BTRFS_SHARED_BLOCK_REF_KEY ||
type == BTRFS_SHARED_DATA_REF_KEY ||
type == BTRFS_EXTENT_DATA_REF_KEY) {
if (is_data == BTRFS_REF_TYPE_BLOCK) {
-   if (type == BTRFS_TREE_BLOCK_REF_KEY ||
-   type == BTRFS_SHARED_BLOCK_REF_KEY)
+   if (type == BTRFS_TREE_BLOCK_REF_KEY)
return type;
+   if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
+   ASSERT(eb->fs_info);
+   /*
+* Every shared one has parent tree
+* block, which must be aligned to
+* nodesize.
+*/
+   if (offset &&
+   IS_ALIGNED(offset, eb->fs_info->nodesize))
+   return type;
+   }
} else if (is_data == BTRFS_REF_TYPE_DATA) {
-   if (type == BTRFS_EXTENT_DATA_REF_KEY ||
-   type == BTRFS_SHARED_DATA_REF_KEY)
+   if (type == BTRFS_EXTENT_DATA_REF_KEY)
return type;
+   if (type == BTRFS_SHARED_DATA_REF_KEY) {
+   ASSERT(eb->fs_info);
+   /*
+* Every shared one has parent tree
+* block, which must be aligned to
+* nodesize.
+*/
+   if (offset &&
+   IS_ALIGNED(offset, eb->fs_info->nodesize))
+   return type;
+   }
} else {
ASSERT(is_data == BTRFS_REF_TYPE_ANY);
return type;
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 0e52e47..9f8c5ee 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -44,7 +44,7 @@ static void print_dev_item(struct extent_buffer *eb,
 static void print_extent_data_ref(struct extent_buffer *eb,
  struct btrfs_extent_data_ref *ref)
 {
-   pr_info("\t\textent data backref root %llu objectid %llu offset %llu 
count %u\n",
+   pr_cont("extent data backref root %llu objectid %llu offset %llu count 
%u\n",
   btrfs_extent_data_ref_root(eb, ref),
   btrfs_extent_data_ref_objectid(eb, ref),
   btrfs_extent_data_ref_offset(eb, ref),
@@ -63,6 +63,7 @@ static void print_extent_item(struct extent_buffer *eb, int 
slot, int type)
u32 item_size = btrfs_item_size_nr(eb, slot);
u64 flags;
u64 offset;
+   int ref_index = 0;
 
if (item_size < sizeof(*ei)) {
 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0
@@ -104,12 +105,20 @@ static void print_extent_item(struct extent_buffer *eb, 
int slot, int type)
iref = (struct btrfs_extent_inline_ref *)ptr;
type = btrfs_extent_inline_ref_type(eb, iref);
offset = btrfs_extent_inline_ref_offset(eb, iref);
+   pr_info("\t\tref#%d: ", ref_index++);
switch (type) {
case BTRFS_TREE_BLOCK_REF_KEY:
-   pr_info("\t\ttree block backref root %llu\n", offset);
+   pr_cont("tree block backref root %llu\n", offset);
break;
case BTRFS_SHARED_BLOCK_REF_KEY:
-   pr_info("\t\tshared block backref parent %llu\n", 
offset);
+   pr_cont("shared block backref parent %llu\n", offset);
+  

[PATCH v2 1/7] Btrfs: add a helper to retrive extent inline ref type

2017-08-07 Thread Liu Bo
An invalid value of extent inline ref type may be read from a
malicious image which may force btrfs to crash.

This adds a helper which does sanity check for the ref type, so we can
know if it's sane, return type if so, otherwise return an error.

v2: add enum type and return BTRFS_REF_TYPE_INVALID instead of -EINVAL

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h   | 11 +++
 fs/btrfs/extent-tree.c | 36 
 2 files changed, 47 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3f3eb7b..9866420 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2561,6 +2561,17 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
address_space *mapping)
 
 /* extent-tree.c */
 
+enum btrfs_inline_ref_types {
+   BTRFS_REF_TYPE_INVALID = 0,
+   BTRFS_REF_TYPE_BLOCK =   1,
+   BTRFS_REF_TYPE_DATA =2,
+   BTRFS_REF_TYPE_ANY = 3,
+};
+
+int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb,
+struct btrfs_extent_inline_ref *iref,
+enum btrfs_inline_ref_types is_data);
+
 u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes);
 
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e3b0b41..83a3cdc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1148,6 +1148,42 @@ static int convert_extent_item_v0(struct 
btrfs_trans_handle *trans,
 }
 #endif
 
+/*
+ * is_data == BTRFS_REF_TYPE_BLOCK, tree block type is required,
+ * is_data == BTRFS_REF_TYPE_DATA, data type is requried,
+ * is_data == BTRFS_REF_TYPE_ANY, either type is OK.
+ */
+int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb,
+struct btrfs_extent_inline_ref *iref,
+enum btrfs_inline_ref_types is_data)
+{
+   int type = btrfs_extent_inline_ref_type(eb, iref);
+
+   if (type == BTRFS_TREE_BLOCK_REF_KEY ||
+   type == BTRFS_SHARED_BLOCK_REF_KEY ||
+   type == BTRFS_SHARED_DATA_REF_KEY ||
+   type == BTRFS_EXTENT_DATA_REF_KEY) {
+   if (is_data == BTRFS_REF_TYPE_BLOCK) {
+   if (type == BTRFS_TREE_BLOCK_REF_KEY ||
+   type == BTRFS_SHARED_BLOCK_REF_KEY)
+   return type;
+   } else if (is_data == BTRFS_REF_TYPE_DATA) {
+   if (type == BTRFS_EXTENT_DATA_REF_KEY ||
+   type == BTRFS_SHARED_DATA_REF_KEY)
+   return type;
+   } else {
+   ASSERT(is_data == BTRFS_REF_TYPE_ANY);
+   return type;
+   }
+   }
+
+   btrfs_print_leaf(eb->fs_info, eb);
+   WARN(1, "eb %llu invalid extent inline ref type %d\n",
+eb->start, type);
+
+   return BTRFS_REF_TYPE_INVALID;
+}
+
 static u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset)
 {
u32 high_crc = ~(u32)0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] Btrfs: convert to use btrfs_get_extent_inline_ref_type

2017-08-07 Thread Liu Bo
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c | 11 +--
 fs/btrfs/extent-tree.c | 36 ++--
 fs/btrfs/relocation.c  | 13 +++--
 3 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index f723c11..1b3d9df 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1012,7 +1012,11 @@ static int __add_inline_refs(struct btrfs_path *path, 
u64 bytenr,
int type;
 
iref = (struct btrfs_extent_inline_ref *)ptr;
-   type = btrfs_extent_inline_ref_type(leaf, iref);
+   type = btrfs_get_extent_inline_ref_type(leaf, iref,
+   BTRFS_REF_TYPE_ANY);
+   if (type == BTRFS_REF_TYPE_INVALID)
+   return -EINVAL;
+
offset = btrfs_extent_inline_ref_offset(leaf, iref);
 
switch (type) {
@@ -1908,7 +1912,10 @@ static int __get_extent_inline_ref(unsigned long *ptr, 
struct extent_buffer *eb,
 
end = (unsigned long)ei + item_size;
*out_eiref = (struct btrfs_extent_inline_ref *)(*ptr);
-   *out_type = btrfs_extent_inline_ref_type(eb, *out_eiref);
+   *out_type = btrfs_get_extent_inline_ref_type(eb, *out_eiref,
+BTRFS_REF_TYPE_ANY);
+   if (*out_type == BTRFS_REF_TYPE_INVALID)
+   return -EINVAL;
 
*ptr += btrfs_extent_inline_ref_size(*out_type);
WARN_ON(*ptr > end);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 83a3cdc..80b4db3c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1453,12 +1453,18 @@ static noinline u32 extent_data_ref_count(struct 
btrfs_path *path,
struct btrfs_extent_data_ref *ref1;
struct btrfs_shared_data_ref *ref2;
u32 num_refs = 0;
+   int type;
 
leaf = path->nodes[0];
btrfs_item_key_to_cpu(leaf, , path->slots[0]);
if (iref) {
-   if (btrfs_extent_inline_ref_type(leaf, iref) ==
-   BTRFS_EXTENT_DATA_REF_KEY) {
+   /*
+* If type is invalid, we should have bailed out earlier than
+* this call.
+*/
+   type = btrfs_get_extent_inline_ref_type(leaf, iref, 
BTRFS_REF_TYPE_DATA);
+   ASSERT(type != BTRFS_REF_TYPE_INVALID);
+   if (type == BTRFS_EXTENT_DATA_REF_KEY) {
ref1 = (struct btrfs_extent_data_ref *)(>offset);
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
} else {
@@ -1619,6 +1625,7 @@ int lookup_inline_extent_backref(struct 
btrfs_trans_handle *trans,
int ret;
int err = 0;
bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
+   int needed;
 
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -1710,6 +1717,11 @@ int lookup_inline_extent_backref(struct 
btrfs_trans_handle *trans,
BUG_ON(ptr > end);
}
 
+   if (owner >= BTRFS_FIRST_FREE_OBJECTID)
+   needed = BTRFS_REF_TYPE_DATA;
+   else
+   needed = BTRFS_REF_TYPE_BLOCK;
+
err = -ENOENT;
while (1) {
if (ptr >= end) {
@@ -1717,7 +1729,12 @@ int lookup_inline_extent_backref(struct 
btrfs_trans_handle *trans,
break;
}
iref = (struct btrfs_extent_inline_ref *)ptr;
-   type = btrfs_extent_inline_ref_type(leaf, iref);
+   type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
+   if (type == BTRFS_REF_TYPE_INVALID) {
+   err = -EINVAL;
+   goto out;
+   }
+
if (want < type)
break;
if (want > type) {
@@ -1909,7 +1926,12 @@ void update_inline_extent_backref(struct btrfs_fs_info 
*fs_info,
if (extent_op)
__run_delayed_extent_op(extent_op, leaf, ei);
 
-   type = btrfs_extent_inline_ref_type(leaf, iref);
+   /*
+* If type is invalid, we should have bailed out after
+* lookup_inline_extent_backref().
+*/
+   type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_ANY);
+   ASSERT(type != BTRFS_REF_TYPE_INVALID);
 
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
dref = (struct btrfs_extent_data_ref *)(>offset);
@@ -3194,6 +3216,7 @@ static noinline int check_committed_ref(struct btrfs_root 
*root,
struct btrfs_extent_item *ei;
struct btrfs_key key;
u32 item_size;
+   int type;
int ret;
 
key.objectid = bytenr;
@@ -3235,8 +3258,9 @@ static noinline int check_committed_ref(struct btrfs_root 
*root,
 

[PATCH v2 6/7] Btrfs: remove BUG_ON in __add_tree_block

2017-08-07 Thread Liu Bo
The BUG_ON() can be triggered when the caller is processing an invalid
extent inline ref, e.g.

a shared data ref is offered instead of a extent data ref, such that
it tries to find a non-exist tree block and then btrfs_search_slot
returns 1 for no such item.

This replaces the BUG_ON() with a WARN() followed by calling
btrfs_print_leaf() to show more details about what's going on and
returning -EINVAL to upper callers.

Signed-off-by: Liu Bo 
---
 fs/btrfs/relocation.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 4806e78..cc6150e 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -32,6 +32,7 @@
 #include "free-space-cache.h"
 #include "inode-map.h"
 #include "qgroup.h"
+#include "print-tree.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -3485,7 +3486,15 @@ static int __add_tree_block(struct reloc_control *rc,
goto again;
}
}
-   BUG_ON(ret);
+   if (ret) {
+   ASSERT(ret == 1);
+   btrfs_print_leaf(rc->extent_root->fs_info, path->nodes[0]);
+   WARN(1,
+"tree block extent item (%llu) is not found in extent tree\n",
+bytenr);
+   ret = -EINVAL;
+   goto out;
+   }
 
ret = add_tree_block(rc, , path, blocks);
 out:
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] Btrfs: remove BUG() in print_extent_item

2017-08-07 Thread Liu Bo
btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so
here we really want to print the invalid value of ref type instead of
causing a kernel panic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/print-tree.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index fcae61e..0e52e47 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -121,7 +121,10 @@ static void print_extent_item(struct extent_buffer *eb, 
int slot, int type)
   offset, btrfs_shared_data_ref_count(eb, sref));
break;
default:
-   BUG();
+   btrfs_err(eb->fs_info,
+ "extent %llu has invalid ref type %d\n",
+ eb->start, type);
+   return;
}
ptr += btrfs_extent_inline_ref_size(type);
}
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/7] Btrfs: remove BUG() in btrfs_extent_inline_ref_size

2017-08-07 Thread Liu Bo
Now that btrfs_get_extent_inline_ref_type() can report if type is a
valid one and all callers can gracefully deal with that, we don't need
to crash here.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9866420..c915f33 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1799,7 +1799,6 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
if (type == BTRFS_EXTENT_DATA_REF_KEY)
return sizeof(struct btrfs_extent_data_ref) +
   offsetof(struct btrfs_extent_inline_ref, offset);
-   BUG();
return 0;
 }
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/7] add sanity check for extent inline ref type

2017-08-07 Thread Liu Bo
An invalid extent inline ref type could be read from a btrfs image and
it ends up with a panic[1], this set is to deal with the insane value
gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6.

Patch 7 adds one more check to see if the ref is a valid shared one.

I'm not sure in the real world what may result in this corruption, but
I've seen several reports on the ML about __btrfs_free_extent saying
something was missing (or simply wrong), while testing this set with
btrfs-corrupt-block, I found that switching ref type could end up that
situation as well, eg. a data extent's ref type
(BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY).
Hopefully this can give people more sights next time when that
happens.

[1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html

v2:
- add enum type and return BTRFS_REF_TYPE_INVALID instead of -EINVAL.
- remove one more BUG_ON which is in __add_tree_block.
- add validation check for shared refs.
- improve btrfs_print_leaf to show which refs has something wrong.

Liu Bo (7):
  Btrfs: add a helper to retrive extent inline ref type
  Btrfs: convert to use btrfs_get_extent_inline_ref_type
  Btrfs: remove BUG() in btrfs_extent_inline_ref_size
  Btrfs: remove BUG() in print_extent_item
  Btrfs: remove BUG() in add_data_reference
  Btrfs: remove BUG_ON in __add_tree_block
  Btrfs: add one more sanity check for shared ref type

 fs/btrfs/backref.c | 11 --
 fs/btrfs/ctree.h   | 12 ++-
 fs/btrfs/extent-tree.c | 93 ++
 fs/btrfs/print-tree.c  | 28 ---
 fs/btrfs/relocation.c  | 29 +---
 5 files changed, 155 insertions(+), 18 deletions(-)

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at /build/linux-H5UzH8/linux-4.10.0/fs/btrfs/extent_io.c:2318

2017-08-07 Thread Duncan
Piotr Pawłow posted on Mon, 07 Aug 2017 15:26:16 +0200 as excerpted:

> # uname -a
> Linux pps 4.10.0-30-generic #34-Ubuntu SMP
> Mon Jul 31 19:38:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

This is a general principles reply and chances are wouldn't help with 
your issue since the spread isn't /too/ large yet and I've no knowledge 
of a specific fix for your issue in newer kernels, but FWIW...

This being the btrfs development list and btrfs, while stabilizing (but 
not yet to be considered fully stable) still being under rather intense 
development, with significant changes every kernel cycle, list focus 
tends to be quite forward looking, with most interest and the best chance 
of fixes on relatively current kernels.

What we try to support, in addition to development kernels, is the latest 
two release kernel series in two tracks, (mainline/Linus) current and LTS.

On the current track, 4.12 is the newest release so 4.12 and 4.11 are 
best supported, tho the 4.11 series is already listed as EOL on 
kernel.org.  On the LTS track, 4.9 and 4.4 are the latest LTS releases, 
with 4.1 the previous one but already well out of btrfs focus range.

That puts 4.10 out of the list's best-supported focus range, with 4.11 
still in focus but EOL.  Beyond that we still try, but you're not as 
likely to get interest from the developers themselves, and for the other 
btrfs-user-regulars here, as the spread gets wider, the support gets more 
difficult as it's harder to remember where things were back then.

So the recommendation would be to either upgrade to 4.12 current if you 
want to stay on current track, or downgrade to the latest 4.9 release if 
you prefer stable track.

Other alternatives would include getting support from your distro if 
you're running a distro kernel (as seems to be your case), since they 
know what they've backported and what they haven't, and are thus in a 
better position to support it, and of course, simply sticking with what 
you have and accepting that you won't get quite the support or interest 
on your list posts since you're out of primary focus range.

Of course you can also keep a current kernel available and boot to it to 
try to duplicate any issues before reporting, while otherwise sticking to 
your older kernel, but unless you have a specific reason to do that, it 
would seem more trouble than simply running the current kernel by default.

So 4.10 isn't /too/ far out of range yet, but I'd strongly consider 
upgrading (or downgrading to 4.9 LTS) as soon as it's reasonably 
convenient, before 4.13 in any case.  Unless you prefer to go the distro 
support route, of course.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix out of bounds array access while reading extent buffer

2017-08-07 Thread Liu Bo
There is a cornel case that slip through the checkers in functions
reading extent buffer, ie.

if (start < eb->len) and (start + len > eb->len),
then

a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,

b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).

The case is found by switching extent inline ref type from shared data
ref to non-shared data ref.

This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection', before it's too late.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0aff9b2..d198e87 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5416,13 +5416,19 @@ void read_extent_buffer(struct extent_buffer *eb, void 
*dstv,
char *dst = (char *)dstv;
size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
unsigned long i = (start_offset + start) >> PAGE_SHIFT;
+   unsigned long num_pages = num_extent_pages(eb->start, eb->len);
 
-   WARN_ON(start > eb->len);
-   WARN_ON(start + len > eb->start + eb->len);
+   if (start + len > eb->len) {
+   WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
wanted %lu %lu\n",
+eb->start, eb->len, start, len);
+   memset(dst, 0, len);
+   return;
+   }
 
offset = (start_offset + start) & (PAGE_SIZE - 1);
 
while (len > 0) {
+   ASSERT(i < num_pages);
page = eb->pages[i];
 
cur = min(len, (PAGE_SIZE - offset));
@@ -5491,6 +5497,12 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long start,
unsigned long end_i = (start_offset + start + min_len - 1) >>
PAGE_SHIFT;
 
+   if (start + min_len > eb->len) {
+   WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
wanted %lu %lu\n",
+  eb->start, eb->len, start, min_len);
+   return -EINVAL;
+   }
+
if (i != end_i)
return 1;
 
@@ -5502,12 +5514,6 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long start,
*map_start = ((u64)i << PAGE_SHIFT) - start_offset;
}
 
-   if (start + min_len > eb->len) {
-   WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
wanted %lu %lu\n",
-  eb->start, eb->len, start, min_len);
-   return -EINVAL;
-   }
-
p = eb->pages[i];
kaddr = page_address(p);
*map = kaddr + offset;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs umount hang

2017-08-07 Thread Jeff Mahoney
On 8/7/17 1:19 PM, Jeff Mahoney wrote:
> On 8/7/17 10:12 AM, Angel Shtilianov wrote:
>> Hi there,
>> I'm investigating sporadic hanging during btrfs umount. The FS is
>> contained in a loop mounted file.
>> I have no reproduction scenario and the issue may happen once a day or
>> once a month. It is rare, but frustrating.
>> I have a crashdump (the server has been manually crashed and collected
>> a crashdump), so I could take look through the data structures.
>> What happens is that umount is getting in D state and a the kernel
>> complains about hung tasks. We are using kernel 4.4.y The actual back
>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
>> used (4.4.30 through 4.4.70).
>> Tasks like:
>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
>> are getting blocked waiting for btrfs_tree_read_lock, which is owned
>> by task umount:31696 (which is also blocked for more than 120 seconds)
>> regarding the lock debug.
>>
>> umount is hung in "cache_block_group", see the '>' mark:
>>while (cache->cached == BTRFS_CACHE_FAST) {
>> struct btrfs_caching_control *ctl;
>>
>> ctl = cache->caching_ctl;
>> atomic_inc(>count);
>> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
>> spin_unlock(>lock);
>>
>>>schedule();
>>
>> finish_wait(>wait, );
>> put_caching_control(ctl);
>> spin_lock(>lock);
>> }
>>
>> The complete backtraces could be found in the attached log.
>>
>> Do you have any ideas ?
> 
> Hi Angel -
> 
> In your log, it says lockdep is disabled.  What tripped it earlier?
> Lockdep really should be catching locking deadlocks in situations like
> this, if that's really the underlying cause.

Actually, I'm not sure if lockdep would catch this one.  Here's my
hypothesis:

kworker/u32:9 is waiting on a read lock while reading the free space
cache, which means it owns the cache->cached value and will issue the
wakeup when it completes.

umount is waiting on for the wakeup from kworker/u32:9 but is holding
some tree locks in write mode.

If kworker/u32:9 is waiting on the locks that umount holds, we have a
deadlock.

Can you dump the extent buffer that kworker/u32:9 is waiting on?  Part
of that will contain the PID of the holder, and if matches umount, we
found the cause.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Btrfs umount hang

2017-08-07 Thread Jeff Mahoney
On 8/7/17 10:12 AM, Angel Shtilianov wrote:
> Hi there,
> I'm investigating sporadic hanging during btrfs umount. The FS is
> contained in a loop mounted file.
> I have no reproduction scenario and the issue may happen once a day or
> once a month. It is rare, but frustrating.
> I have a crashdump (the server has been manually crashed and collected
> a crashdump), so I could take look through the data structures.
> What happens is that umount is getting in D state and a the kernel
> complains about hung tasks. We are using kernel 4.4.y The actual back
> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
> used (4.4.30 through 4.4.70).
> Tasks like:
> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
> are getting blocked waiting for btrfs_tree_read_lock, which is owned
> by task umount:31696 (which is also blocked for more than 120 seconds)
> regarding the lock debug.
> 
> umount is hung in "cache_block_group", see the '>' mark:
>while (cache->cached == BTRFS_CACHE_FAST) {
> struct btrfs_caching_control *ctl;
> 
> ctl = cache->caching_ctl;
> atomic_inc(>count);
> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> spin_unlock(>lock);
> 
>>schedule();
> 
> finish_wait(>wait, );
> put_caching_control(ctl);
> spin_lock(>lock);
> }
> 
> The complete backtraces could be found in the attached log.
> 
> Do you have any ideas ?

Hi Angel -

In your log, it says lockdep is disabled.  What tripped it earlier?
Lockdep really should be catching locking deadlocks in situations like
this, if that's really the underlying cause.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Btrfs umount hang

2017-08-07 Thread Nikolay Borisov


On  7.08.2017 17:12, Angel Shtilianov wrote:
> Hi there,
> I'm investigating sporadic hanging during btrfs umount. The FS is
> contained in a loop mounted file.
> I have no reproduction scenario and the issue may happen once a day or
> once a month. It is rare, but frustrating.
> I have a crashdump (the server has been manually crashed and collected
> a crashdump), so I could take look through the data structures.
> What happens is that umount is getting in D state and a the kernel
> complains about hung tasks. We are using kernel 4.4.y The actual back
> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
> used (4.4.30 through 4.4.70).
> Tasks like:
> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
> are getting blocked waiting for btrfs_tree_read_lock, which is owned
> by task umount:31696 (which is also blocked for more than 120 seconds)
> regarding the lock debug.
> 
> umount is hung in "cache_block_group", see the '>' mark:
>while (cache->cached == BTRFS_CACHE_FAST) {
> struct btrfs_caching_control *ctl;
> 
> ctl = cache->caching_ctl;
> atomic_inc(>count);
> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> spin_unlock(>lock);
> 
>>schedule();
> 
> finish_wait(>wait, );
> put_caching_control(ctl);
> spin_lock(>lock);
> }
> 
> The complete backtraces could be found in the attached log.
> 
> Do you have any ideas ?
> Any help will be greatly appreciated.

So by the looks of it while writing dirty bgs and requiring a free block
for the CoW process, cache_block_group() kicks off a caching thread
which should just go and read in the respective block group. So the
newly spawned caching_thread should actually wake up cache_block_group
either due to success, if it manages to find 2megs:
if (total_found > (1024 * 1024 * 2)) {

or in case of failure after the out label. But in both cases it will set
cache->cached to something different than BTRFS_CACHE_FAST and it ought
to exit the loop.


But from your description of the issue I take it the process never comes
back from the schedule, meaning it missed the wakeup from caching_thread
and atm I cannot see how this could happen. Can you print the state of
the 'cache' parameter of cache_block_group ?
> 
> Best regards,
> Angel Shtilianov
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-07 Thread Chris Murphy
On Fri, Aug 4, 2017 at 8:05 AM, Qu Wenruo  wrote:
>
> For example, if one day there is some dm-csum to support verify csum of
> given ranges (and skip unrelated ones specified by higher levels), btrfs
> support for data csum is no longer an exclusive feature.

How would dm-csum differ from dm-integrity?
https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt

By that description it uses a journal to guarantee atomicity. If
multiqueue maybe the performance implications are neutral. But
certainly on spinning drives that would slow things down, especially
if the file system is also journaling, and the workload is metadata
heavy.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs umount hang

2017-08-07 Thread Angel Shtilianov
Hi there,
I'm investigating sporadic hanging during btrfs umount. The FS is
contained in a loop mounted file.
I have no reproduction scenario and the issue may happen once a day or
once a month. It is rare, but frustrating.
I have a crashdump (the server has been manually crashed and collected
a crashdump), so I could take look through the data structures.
What happens is that umount is getting in D state and a the kernel
complains about hung tasks. We are using kernel 4.4.y The actual back
trace is from 4.4.70, but this happens with all the 4.4 kernels I've
used (4.4.30 through 4.4.70).
Tasks like:
INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
are getting blocked waiting for btrfs_tree_read_lock, which is owned
by task umount:31696 (which is also blocked for more than 120 seconds)
regarding the lock debug.

umount is hung in "cache_block_group", see the '>' mark:
   while (cache->cached == BTRFS_CACHE_FAST) {
struct btrfs_caching_control *ctl;

ctl = cache->caching_ctl;
atomic_inc(>count);
prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
spin_unlock(>lock);

>schedule();

finish_wait(>wait, );
put_caching_control(ctl);
spin_lock(>lock);
}

The complete backtraces could be found in the attached log.

Do you have any ideas ?
Any help will be greatly appreciated.

Best regards,
Angel Shtilianov
Jul 21 22:30:34 smallvault22 kernel: [377419.794212] BTRFS info (device loop0): disk space caching is enabled
Jul 21 22:30:34 smallvault22 kernel: [377419.794449] BTRFS: has skinny extents
Jul 21 22:30:43 smallvault22 kernel: [377429.488153] BTRFS info (device loop2): disk space caching is enabled
Jul 21 22:30:43 smallvault22 kernel: [377429.488389] BTRFS: has skinny extents
Jul 21 22:30:44 smallvault22 kernel: [377429.710041] BTRFS info (device loop2): disk space caching is enabled
Jul 21 22:30:44 smallvault22 kernel: [377429.710279] BTRFS: has skinny extents
Jul 21 22:30:44 smallvault22 kernel: [377429.961130] BTRFS info (device loop0): disk space caching is enabled
Jul 21 22:30:44 smallvault22 kernel: [377429.961367] BTRFS: has skinny extents
Jul 21 22:32:04 smallvault22 kernel: [377510.277526] BTRFS: device fsid 7d53957c-847b-4dd6-8a70-63870a34c7f7 devid 1 transid 1222 / dev/loop0
Jul 21 22:32:04 smallvault22 kernel: [377510.294142] BTRFS info (device loop0): disk space caching is enabled
Jul 21 22:32:04 smallvault22 kernel: [377510.294387] BTRFS: has skinny extents
Jul 21 22:32:08 smallvault22 kernel: [377514.199173] BTRFS info (device loop0): new size for /dev/loop0 is 214748364800
Jul 21 22:32:09 smallvault22 kernel: [377514.722687] BTRFS info (device loop0): disk space caching is enabled
Jul 21 22:32:09 smallvault22 kernel: [377514.722924] BTRFS: has skinny extents
Jul 21 22:32:09 smallvault22 kernel: [377514.833162] INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
Jul 21 22:32:09 smallvault22 kernel: [377514.833543]   Tainted: G   O4.4.70-backup2 #93
Jul 21 22:32:09 smallvault22 kernel: [377514.833767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 21 22:32:09 smallvault22 kernel: [377514.834153] kworker/u32:9   D 880046142ec8 0 27574  2 0x
Jul 21 22:32:09 smallvault22 kernel: [377514.834536] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.834825]  880046142ec8 8802598b62f0 880046142e90 
Jul 21 22:32:09 smallvault22 kernel: [377514.835468]  88046bb19a00 88038b469a00 880046144000 8802598b62f0
Jul 21 22:32:09 smallvault22 kernel: [377514.836097]  88038b469a00 88038b469a00 880046142f08 880046142ee0
Jul 21 22:32:09 smallvault22 kernel: [377514.836724] Call Trace:
Jul 21 22:32:09 smallvault22 kernel: [377514.836947]  [] schedule+0x3c/0x90
Jul 21 22:32:09 smallvault22 kernel: [377514.837206]  [] btrfs_tree_read_lock+0xe5/0x140 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.837437]  [] ? wait_woken+0xb0/0xb0
Jul 21 22:32:09 smallvault22 kernel: [377514.837669]  [] btrfs_read_lock_root_node+0x34/0x50 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.838054]  [] btrfs_search_slot+0x70b/0x9d0 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.838289]  [] ? trace_hardirqs_on+0xd/0x10
Jul 21 22:32:09 smallvault22 kernel: [377514.838524]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.838760]  [] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.838999]  [] btrfs_get_extent+0x155/0xb30 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: [377514.839249]  [] ? __set_extent_bit+0x15a/0x580 [btrfs]
Jul 21 22:32:09 smallvault22 kernel: 

kernel BUG at /build/linux-H5UzH8/linux-4.10.0/fs/btrfs/extent_io.c:2318

2017-08-07 Thread Piotr Pawłow
Hello,

my btrfs raid1 fs with 4 drives crashed after only one drive developed a
couple of bad blocks.

Seems similar to https://bugzilla.kernel.org/show_bug.cgi?id=196251 but
it happened during normal usage instead of during replace.

As a sidenote, I tried to make the system less noisy during the day,
when it's not doing nightly backups, by setting drives to spin down
after a few minutes. Unfortunately btrfs kept them spinning for no
reason, so I tried systemd's automount to unmount the fs when not in
use. It worked, but it seems it uncovered some kind of umount/mount race
condition, where mount unexpectedly failed with:

sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): super_total_bytes 
3855366488064 mismatch with fs_devices total_rw_bytes 7710732976128
sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): failed to read chunk 
tree: -22
sie 06 03:47:39 pps kernel: BTRFS error (device dm-1): open_ctree failed

...and then the next mount worked properly as if nothing had happened.

Full log: https://pp.siedziba.pl/btrfs_raid1_fail/kernel.log

# uname -a
Linux pps 4.10.0-30-generic #34-Ubuntu SMP Mon Jul 31 19:38:17 UTC 2017 x86_64 
x86_64 x86_64 GNU/Linux

# btrfs device usage /backup
/dev/mapper/pps-backup1, ID: 2
   Device size:   465.76GiB
   Device slack:  0.00B
   Data,RAID1:387.00GiB
   Metadata,RAID1: 16.00GiB
   Unallocated:62.76GiB

/dev/mapper/pps-backup2, ID: 8
   Device size: 1.81TiB
   Device slack:  0.00B
   Data,RAID1:  1.48TiB
   Metadata,RAID1: 26.00GiB
   System,RAID1:   32.00MiB
   Unallocated:   314.29GiB

/dev/mapper/pps-backup3, ID: 6
   Device size:   334.69GiB
   Device slack:  0.00B
   Data,RAID1:268.00GiB
   Metadata,RAID1:  5.00GiB
   System,RAID1:   32.00MiB
   Unallocated:61.66GiB

/dev/mapper/pps-backup4, ID: 7
   Device size:   931.51GiB
   Device slack:  0.00B
   Data,RAID1:863.00GiB
   Metadata,RAID1:  5.00GiB
   System,RAID1:   64.00MiB
   Unallocated:63.45GiB

# btrfs device stats /backup
[/dev/mapper/pps-backup1].write_io_errs0
[/dev/mapper/pps-backup1].read_io_errs 0
[/dev/mapper/pps-backup1].flush_io_errs0
[/dev/mapper/pps-backup1].corruption_errs  0
[/dev/mapper/pps-backup1].generation_errs  0
[/dev/mapper/pps-backup3].write_io_errs0
[/dev/mapper/pps-backup3].read_io_errs 0
[/dev/mapper/pps-backup3].flush_io_errs0
[/dev/mapper/pps-backup3].corruption_errs  0
[/dev/mapper/pps-backup3].generation_errs  0
[/dev/mapper/pps-backup4].write_io_errs0
[/dev/mapper/pps-backup4].read_io_errs 0
[/dev/mapper/pps-backup4].flush_io_errs0
[/dev/mapper/pps-backup4].corruption_errs  0
[/dev/mapper/pps-backup4].generation_errs  0
[/dev/mapper/pps-backup2].write_io_errs1
[/dev/mapper/pps-backup2].read_io_errs 22
[/dev/mapper/pps-backup2].flush_io_errs0
[/dev/mapper/pps-backup2].corruption_errs  0
[/dev/mapper/pps-backup2].generation_errs  0

sie 07 04:45:05 pps kernel: [ cut here ]
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2
sie 07 04:45:05 pps kernel: BTRFS critical (device dm-1): stripe index math 
went horribly wrong, got stripe_index=2176876031, num_stripes=2

Re: Crashed filesystem, nothing helps

2017-08-07 Thread Henk Slager
On Mon, Aug 7, 2017 at 7:12 AM, Thomas Wurfbaum  wrote:
> Now i do a btrfs-find-root, but it runs now since 5 day without a result.
> How long should i wait? Or is it already to late to hope?
>
> mainframe:~ # btrfs-find-root.static /dev/sdb1
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> parent transid verify failed on 29376512 wanted 1327723 found 1489835
> Ignoring transid failure
> Superblock thinks the generation is 1490226
> Superblock thinks the level is 1
>
> The process is still running...

I also have a filesystem that has a big difference in wanted and found
transid. I my case the wanted is much higher than the found. I also
tried to repair it for more than 2 weeks in total, but both progs and
kernel fail. I haven't seen that you mounted it just with only ro
mount option. I think in your case chances are low, but it worked in
my case. I could copy all data (7TB unique non-shared thorugh snapshot
or reflink) to a new fs with rsync. Btrfs send and other action done
by btrfs progs quite soon failed on 1 or several parent transid
inconsistencies.

Others on this forum have stated that when there is such a big
difference in transid, the fs is simple unrepairable. I must admit
that this is the truth in practice. Restore with btrfs-restore I tried
once on a other fs that was damaged and at that time the restore gave
quite different content for some file compared to a copy of the same
file from the fs mouted via the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-07 Thread Shyam Prasad N
Hi Chris,
Good points that you make.

We're making use of btrfs raid only. (One of the reasons we want to
move to btrfs)
However, during this test, we haven't run with multi-disk btrfs raid.
We just have one disk. (This test setup doesn't have too many disks)

We do have our metadata replicated as well. For data, we do have
regular async backups.
However, this is something that we noticed (somewhat more frequently)
while testing out btrfs as our data store (as compared to ext4).
We've tests going on with flushoncommit and recovery mount options
running on the same setup. Hopefully, we'll have near-ext4-like
behaviour with this, w.r.t power off recovery.

Regards,
Shyam

On Mon, Aug 7, 2017 at 7:52 AM, Chris Murphy  wrote:
> On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N  
> wrote:
>> Hi all,
>>
>> We're running a couple of experiments on our servers with btrfs
>> (kernel version 4.4).
>> And we're running some abrupt power-off tests for a couple of scenarios:
>>
>> 1. We have a filesystem on top of two different btrfs filesystems
>> (distributed across N disks).
>
> What's the layout from physical devices all the way to your 16M file?
> This is hardware raid, lvm linear, Btrfs raid? All of that matters.
>
> Do the drives have write caching disabled? You might be better off
> with the drive write cache disabled, and then add bcache or dm-cache
> and an SSD to compensate. But that's just speculation on my part. The
> write cache in the drives is definitely volatile. And disabling them
> will definitely make writes slower. So, you might have slightly better
> luck with another layout.
>
> But the bottom line is, you need to figure out a way to avoid *any*
> data loss in your files because otherwise that means the 2nd file
> system has data loss and even corruption. This is not something a file
> system choice can solve. You need reliable power and reliable
> shutdown. And you may also need a cluster file system like ceph or
> glusterfs instead of depending on a single box to stay upright.
>
>
>
> --
> Chris Murphy



-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-07 Thread Shyam Prasad N
Hi Chris,

Thanks for the detailed reply. :)
Read my answers inline:

On Mon, Aug 7, 2017 at 7:45 AM, Chris Murphy  wrote:
>
> This is astronomically more complicated than the already complicated
> scenario with one file system on a single normal partition of a well
> behaved (non-lying) single drive.
>
> You have multiple devices, so any one or all of them could drop data
> during the power failure and in different amounts. In the best case
> scenario, at next mount the supers are checked on all the devices, and
> the lowest common denominator generation is found, and therefore the
> lowest common denominator root tree. No matter what it means some data
> is going to be lost.

True. This is something that we're experimenting with, since we can
use many btrfs features. Except for these power off issues, we didn't
face many other issues.

>
> Next there is a file system on top of a file system, I assume it's a
> file that's loopback mounted?
>

Not exactly loopback mounted. We are, however, distributing the data
and metadata across different btrfs files and reading them to present
a filesystem view to the client.

>
> I'd want to know why it fails. And then I'd check all the supers on
> all the devices  with 'btrfs inspect-internal dump-super -fa '.
>
> Are all the copies on a given device the same and valid? Are all the
> copies among all devices the same and valid? I'm expecting there will
> be discrepancies and then you have to figure out if the mount logic is
> really finding the right root to try to mount. I'm not sure if kernel
> code by default reports back in detail what logic its using and
> exactly where it fails, or if you just get the generic open_ctree
> mount failure message.
>
> And then it's an open question whether the supers need fixing, or
> whether the 'usebackuproot' mount option is the way to go. It might
> depend on the status of the supers how that logic ends up working.
> Again, it might be useful if there were debug info that explicitly
> shows the mount logic actually being used, dumped to kernel messages.
> I'm not sure if that code exists when CONFIG_BTRFS_DEBUG is enabled
> (as in, I haven't looked but I've thought it really could come in
> handy in some of the cases we see of mount failure can can't tell
> where things are getting stuck with the existing reporting).
>

Unfortunately, we don't have these data now, since we've started a
fresh batch of similar tests with a couple of new mount options (-o
flushoncommit,recovery). If we hit the issue again, I'll share the
data here.

>
> I can't tell you if that's a bug or not because I'm not sure how your
> software creates these 16M backing files, if they're fallocated or
> touched or what. It's plausible they're created as zero length files,
> and the file system successful creates them, and then data is written
> to them, but before there is either committed metadata or an updated
> super pointing to the new root tree you get a power failure. And in
> that case, I expect a zero length file or maybe some partial amount of
> data is there.
>

The files are first touched, then truncated to 16M, before being written to.
So, it does makes sense then that on recovery, we ended up with
zero-sized files. Btrfs could be showing us a consistent older
filesystem, rather than inconsistent newer one.

>
> Sounds expected for any file system, but chances are there's more
> missing with a CoW file system since by nature it rolls back to the
> most recent sane checkpoint for the fs metadata without any regard to
> what data is lost to make that happen. The goal is to not lose the
> file system in such a case, as some amount of data is always going to
> happen, and why power losses need to be avoided (UPS's and such). The
> fact that you have a file system on top of a file system makes it more
> fragile because the 2nd file system's metadata *IS* data as far as the
> 1st file system is concerned. And that data is considered expendable.
>

Yes, you're right. that is a downside when we stack one FS on top of
another. As long as we minimize the scope of seeing filesystem
inconsistencies, we should be okay. Even if the data is slightly
older.
We were using ext4 for the same purpose with good results on power off
and recovery. With flushoncommit, hopefully, we should see better
results on btrfs as well. Let's see.

>
> commit 5s might make the problem worse by requiring such constant
> flushing of dirty data that you're getting a bunch of disk contention,
> hard to say since there's no details about the workload at the time of
> the power failure. Changing nothing else but but commit= mount option,
> what difference do you see (with a scientific sample) if any between
> commit 5 and default commit 30 when it comes to the amount of data
> loss?

We're not choking the disk with the workload now, if that is what
you're asking. The disks can take a lot more load.

>
> Another thing we don't know is the application or service